High Availability Deployment

Overview

A production-grade TigerAccess deployment includes:

Multiple Auth Instances: 3+ auth servers for consensus
Scalable Proxies: Horizontally scaled proxy tier
Replicated Storage: PostgreSQL with streaming replication
Multi-Region: Active-passive or active-active deployment

Prerequisites

Minimum 6 servers (3 auth, 2 proxy, 1 load balancer)
PostgreSQL cluster with replication (or managed service like RDS)
Redis cluster or managed service (ElastiCache)
Shared storage (S3) for session recordings
Load balancer (ALB, NLB, or HAProxy)

HA Architecture

Recommended production architecture:

Load Balancer (ALB/NLB)

↓

Proxy Tier (2+ instances, auto-scaling)

↓

Auth Cluster (3+ instances, etcd consensus)

↓

PostgreSQL Primary (with 2+ replicas)

↓

Redis Cluster (3+ nodes)

Auth Service Cluster

Configure etcd Backend

# /etc/tigeraccess/config.yaml on each auth server
auth:
  cluster_name: "production"
  # Use etcd for distributed state
  storage:
    type: "etcd"
    peers:
      - "https://auth1.example.com:2379"
      - "https://auth2.example.com:2379"
      - "https://auth3.example.com:2379"
    # mTLS for etcd
    tls_ca_file: "/etc/tigeraccess/etcd-ca.pem"
    tls_cert_file: "/etc/tigeraccess/etcd-cert.pem"
    tls_key_file: "/etc/tigeraccess/etcd-key.pem"
  # External PostgreSQL
  audit:
    type: "postgres"
    conn_string: "postgres://user:pass@postgres.example.com:5432/tigeraccess"
  # Cache configuration
  cache:
    enabled: true
    type: "redis"
    addresses:
      - "redis1.example.com:6379"
      - "redis2.example.com:6379"
      - "redis3.example.com:6379"

Start Auth Cluster

# On each auth server
sudo tigeraccess start --config=/etc/tigeraccess/config.yaml --roles=auth

# Verify cluster health
tac status --cluster

Proxy Service Cluster

Configure Proxy Instances

# /etc/tigeraccess/config.yaml on each proxy
proxy:
  enabled: true
  public_addr: "proxy.example.com:3023"
  # Connect to auth cluster
  auth_servers:
    - "auth1.example.com:3025"
    - "auth2.example.com:3025"
    - "auth3.example.com:3025"
  # Session recording to S3
  recording:
    enabled: true
    mode: "proxy-async"
    storage:
      type: "s3"
      bucket: "tigeraccess-recordings"
      region: "us-east-1"
  # Enable all protocols
  ssh:
    enabled: true
    listen_addr: "0.0.0.0:3023"
  kubernetes:
    enabled: true
    listen_addr: "0.0.0.0:3026"
  database:
    enabled: true

Auto-Scaling Configuration

# AWS Auto Scaling Group example
{
  "AutoScalingGroupName": "tigeraccess-proxy",
  "MinSize": 2,
  "MaxSize": 10,
  "DesiredCapacity": 3,
  "HealthCheckType": "ELB",
  "HealthCheckGracePeriod": 300,
  "TargetGroupARNs": ["arn:aws:elasticloadbalancing:..."]
}

Database High Availability

PostgreSQL Replication

# Primary server postgresql.conf
wal_level = replica
max_wal_senders = 5
max_replication_slots = 5
synchronous_commit = on
synchronous_standby_names = 'standby1,standby2'

# Standby server recovery.conf
standby_mode = on
primary_conninfo = 'host=primary.example.com port=5432 user=replicator'
trigger_file = '/tmp/postgresql.trigger'

Managed Database Services

Recommended for production:

• AWS RDS PostgreSQL: Multi-AZ deployment with automated failover
• Google Cloud SQL: Regional availability with read replicas
• Azure Database for PostgreSQL: Zone-redundant HA

Redis High Availability

# Redis Cluster mode
auth:
  cache:
    type: "redis"
    cluster_mode: true
    addresses:
      - "redis-node-1:6379"
      - "redis-node-2:6379"
      - "redis-node-3:6379"
      - "redis-node-4:6379"
      - "redis-node-5:6379"
      - "redis-node-6:6379"

Load Balancing

AWS Application Load Balancer

# ALB Target Group for SSH Proxy (NLB recommended for SSH)
{
  "Protocol": "TCP",
  "Port": 3023,
  "VpcId": "vpc-xxx",
  "HealthCheckEnabled": true,
  "HealthCheckProtocol": "TCP",
  "HealthCheckPort": "3023",
  "HealthCheckIntervalSeconds": 30,
  "HealthyThresholdCount": 2,
  "UnhealthyThresholdCount": 2
}

HAProxy Configuration

# /etc/haproxy/haproxy.cfg
frontend ssh_proxy
    bind *:3023
    mode tcp
    default_backend tigeraccess_proxies

backend tigeraccess_proxies
    mode tcp
    balance roundrobin
    option tcp-check
    server proxy1 proxy1.example.com:3023 check
    server proxy2 proxy2.example.com:3023 check
    server proxy3 proxy3.example.com:3023 check

frontend web_ui
    bind *:443 ssl crt /etc/ssl/certs/tigeraccess.pem
    mode http
    default_backend tigeraccess_web

backend tigeraccess_web
    mode http
    balance leastconn
    option httpchk GET /webapi/ping
    server proxy1 proxy1.example.com:3080 check
    server proxy2 proxy2.example.com:3080 check

Monitoring & Alerting

Prometheus Metrics

# Enable Prometheus endpoint
auth:
  metrics:
    enabled: true
    listen_addr: "0.0.0.0:9090"

# Key metrics to monitor:
# - tigeraccess_auth_cluster_health
# - tigeraccess_active_connections
# - tigeraccess_failed_logins
# - tigeraccess_session_duration_seconds
# - tigeraccess_certificate_expiry_seconds

Health Checks

# Check auth service health
curl https://auth.example.com:3025/healthz

# Check proxy health
curl https://proxy.example.com:3080/webapi/ping

# Check cluster status
tac status --cluster

Alert Rules

# Prometheus alert rules
groups:
  - name: tigeraccess
    rules:
      - alert: AuthServiceDown
        expr: up{job="tigeraccess-auth"} == 0
        for: 1m
        annotations:
          summary: "TigerAccess auth service is down"

      - alert: ProxyHighConnections
        expr: tigeraccess_active_connections > 1000
        for: 5m
        annotations:
          summary: "High number of active connections"

Disaster Recovery

Backup Strategy

# Backup auth configuration
tac backup create --output=/backups/tigeraccess-$(date +%Y%m%d).tar.gz

# Automated daily backups
0 2 * * * /usr/local/bin/tac backup create --output=/backups/daily-$(date +\%Y\%m\%d).tar.gz

Multi-Region Deployment

# Active-Passive configuration
# Primary Region (us-east-1)
auth:
  cluster_name: "production-primary"
  replication:
    mode: "primary"
    peers:
      - "https://auth-dr.us-west-2.example.com:3025"

# DR Region (us-west-2)
auth:
  cluster_name: "production-dr"
  replication:
    mode: "secondary"
    primary: "https://auth.us-east-1.example.com:3025"

Failover Procedure

# Promote DR cluster to primary
tac cluster promote --cluster=production-dr

# Update DNS to point to DR region
# Update load balancer health checks
# Verify all services are operational
tac status --cluster