Operations Runbook

This runbook covers day-to-day operations, deployment procedures, incident response, and recovery for the TryOn SaaS platform.

Quick reference

Server: 185.246.222.107
Domain: ziex-tryon.com
SSH: ssh -i ~/.ssh/id_ed25519 [email protected]
App root: /opt/tryon-saas
Backups: /opt/tryon-saas-backups/

Quick Health Check¶

# Platform health (run from anywhere)
curl https://api.ziex-tryon.com/health | python3 -m json.tool

# Strict readiness probe (DB + Redis)
curl -I https://api.ziex-tryon.com/readiness

# SSH to server and check all Docker services
ssh -i ~/.ssh/id_ed25519 [email protected]
sudo docker ps

Expected health response:

{
  "status": "ok",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "worker_heartbeat": "ok",
    "fal_api": "configured",
    "storage_backend": "local",
    "media_base_url_public": true
  }
}

HTTP 200 is not enough

The /health endpoint returns 200 even when individual checks degrade. Always inspect the checks object, not just the HTTP status code.

Routine Operations¶

View logs¶

# Follow admin-api logs
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs -f admin-api

# Follow ml-worker logs
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs -f ml-worker

# All services, last 100 lines
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs --tail=100

# Search for errors
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs admin-api | grep '"level":"error"'

Check service status¶

# All containers
sudo docker ps

# Resource usage
sudo docker stats --no-stream

# UFW firewall
sudo ufw status verbose

# fail2ban (SSH brute-force)
sudo fail2ban-client status sshd

# Disk usage
df -h /
du -sh /opt/tryon-saas-backups/

Manual backup¶

cd /opt/tryon-saas
bash scripts/backup.sh
ls -lh /opt/tryon-saas-backups/

Run database migrations¶

cd /opt/tryon-saas
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec admin-api alembic upgrade head

Migrations are not automatic

After deploying code that includes a new Alembic migration, you must run alembic upgrade head manually. create_all only creates tables for fresh installs — it does not apply migration columns.

Restart a single service¶

cd /opt/tryon-saas
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart admin-api
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart ml-worker

Apply a code change without full redeploy¶

cd /opt/tryon-saas
git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml build admin-api
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d admin-api

Deployment¶

Full production deploy (CI/CD)¶

Deployment triggers automatically via GitHub Actions when CI passes on main. The workflow: 1. Runs lint + tests 2. SSHs to server 3. git pull 4. Builds changed services (detected by file diff) 5. Runs up -d with ordered startup 6. Calls scripts/health-check.sh to verify

Manual trigger

You can re-run the deploy workflow from the GitHub Actions UI without pushing a new commit.

Manual production deploy¶

ssh -i ~/.ssh/id_ed25519 [email protected]
cd /opt/tryon-saas
git pull
bash scripts/deploy.sh --prod

deploy.sh --prod performs pre-flight checks (checks DOMAIN is set, TLS certs exist), builds, starts services in dependency order, and runs scripts/health-check.sh.

DOMAIN must be set in .env

docker-compose.prod.yml uses ${DOMAIN} for MEDIA_BASE_URL. If DOMAIN is unset, the URL becomes "https:///uploads" which breaks image delivery. deploy.sh --prod fails fast if DOMAIN is empty.

First-time deploy (fresh server)¶

ssh -i ~/.ssh/id_ed25519 [email protected]

# Clone repository
git clone <repo-url> /opt/tryon-saas
cd /opt/tryon-saas

# Place TLS certificates (Cloudflare Origin Certificate)
# ssl/fullchain.pem ← origin.crt
# ssl/privkey.pem   ← origin.key
chmod 644 ssl/fullchain.pem
chmod 600 ssl/privkey.pem

# Generate secrets and create .env
bash scripts/generate-secrets.sh
cp .env.example .env
# Edit .env: fill FAL_API_KEY, FIRST_ADMIN_EMAIL, FIRST_ADMIN_PASSWORD, DOMAIN

# Deploy
bash scripts/deploy.sh --prod

Incident Response¶

Worker down¶

Symptoms: Jobs stuck in pending status. /health shows "worker_heartbeat": "stale" or "missing". Telegram alert fires.

# Check worker logs
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs --tail=50 ml-worker

# Check if container is running
sudo docker ps | grep ml-worker

# Restart worker
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart ml-worker

# Verify heartbeat recovers (check within 90 seconds)
curl https://api.ziex-tryon.com/health | python3 -m json.tool

Stale job auto-recovery

The admin-api runs a background sweep every 60 seconds. Jobs stuck in processing for more than 15 minutes are automatically marked failed. Jobs stuck in pending for more than 30 minutes are also failed. Clients should retry failed jobs.

High job failure rate¶

Symptoms: Telegram alert fires. Grafana shows tryon_completed_total{status="failed"} spiking.

# Check recent job errors in worker logs
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs ml-worker | grep '"level":"error"' | tail -20

# Common causes:
# 1. fal.ai API key invalid or rate-limited
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs ml-worker | grep "fal_api_key"

# 2. fal.ai service outage
curl https://api.ziex-tryon.com/health | python3 -m json.tool | grep fal_api

# 3. Image URL not reachable by fal.ai (MEDIA_BASE_URL issue)
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs admin-api | grep "media_base_url_localhost"

Resolution: - fal.ai key issue → update FAL_API_KEY in .env, rebuild and restart ml-worker - fal.ai outage → wait; jobs will be retried when worker restarts, or mark failed manually - MEDIA_BASE_URL issue → ensure MEDIA_BASE_URL in .env is set to a publicly reachable HTTPS URL

API down (admin-api not responding)¶

Symptoms: curl https://api.ziex-tryon.com/health fails or returns 502.

# Check admin-api container
sudo docker ps | grep admin-api
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs --tail=30 admin-api

# Check nginx
sudo docker ps | grep nginx
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs --tail=20 nginx

# Restart admin-api
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart admin-api

# If still 502, check nginx can reach admin-api
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec nginx wget -qO- http://admin-api:8000/health

Common admin-api startup failures

Missing required env var (e.g. JWT_SECRET_KEY too short) → check logs for startup_failed event
Database not reachable at startup → check postgres container first
Port conflict → check if port 8000 is bound by another process

Database issues¶

Symptoms: /readiness returns 503. /health shows "database": "connection_failed".

# Check postgres container
sudo docker ps | grep postgres
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs --tail=30 postgres

# Restart postgres (data is on a named Docker volume, safe to restart)
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart postgres

# Wait ~10 seconds then check
curl https://api.ziex-tryon.com/readiness

# Connect directly to check
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec postgres psql -U postgres tryon_saas -c "\l"

Never delete the postgres volume

The database lives in a Docker named volume. docker-compose down -v would destroy all data. Always use docker-compose down without -v.

Redis issues¶

Symptoms: /readiness returns 503. /health shows "redis": "connection_failed". Rate limiting and job queuing stop working.

# Check redis container
sudo docker ps | grep redis
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs --tail=20 redis

# Restart redis (AOF persistence will replay on restart — takes a few seconds)
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart redis

# Verify
curl https://api.ziex-tryon.com/readiness

Redis AOF persistence

Redis is configured with AOF (append-only file) persistence. After a restart, Redis replays the AOF to recover the queue state. This takes a few seconds and is safe. Pending jobs that were in the queue before the restart will be available again.

TLS / SSL issues¶

Symptoms: HTTPS returns certificate error. Cloudflare shows SSL handshake failure.

# Check nginx logs for TLS errors
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs nginx | grep -i "ssl\|tls\|cert"

# Verify cert files exist and are readable
ls -la /opt/tryon-saas/ssl/
# Expected: fullchain.pem (644) and privkey.pem (600)

# Check cert expiry
openssl x509 -in /opt/tryon-saas/ssl/fullchain.pem -noout -dates

Cloudflare Origin Certificates

The platform uses Cloudflare Origin Certificates (not Let's Encrypt). These are 15-year certificates — expiry is not a routine concern. However, ssl_stapling must be off in nginx.prod.conf for these certs:

ssl_stapling        off;
ssl_stapling_verify off;

OCSP stapling does not work with Cloudflare origin certs and will cause nginx startup failure if enabled.

Disk full¶

Symptoms: Admin-api fails to save uploaded images. Logs show OSError: No space left on device.

# Check disk usage
df -h /
du -sh /opt/tryon-saas-backups/
du -sh /opt/tryon-saas/uploads/
du -sh /var/lib/docker/

# Remove old backups (keep last 30 days — already automated, but manual override)
ls -lht /opt/tryon-saas-backups/ | tail -n +31 | awk '{print $NF}' | xargs -I{} rm /opt/tryon-saas-backups/{}

# Trigger manual media cleanup (removes WebP files older than UPLOAD_RETENTION_HOURS with no active job)
# The cleanup service runs automatically every interval — but you can restart admin-api to trigger it sooner
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart admin-api

# Prune unused Docker layers
sudo docker system prune -f

High memory usage¶

Symptoms: OOM kill in container logs. sudo docker stats shows a service near its memory limit.

# Check memory usage per container
sudo docker stats --no-stream

# Check for OOM kills in kernel log
sudo dmesg | grep -i "oom\|kill" | tail -10

# Restart the affected service
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart admin-api

Resource limits

All services have memory limits defined in docker-compose.yml. If a service consistently hits its limit, review the limit in compose and the service's memory usage pattern before increasing.

Recovery Procedures¶

Restore database from backup¶

# List available backups
ls -lh /opt/tryon-saas-backups/

# Restore from a specific dump (this will DROP and recreate the database)
bash /opt/tryon-saas/scripts/restore.sh /opt/tryon-saas-backups/<dump-file>.sql.gz --drop-existing --yes

# Verify restore
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec postgres psql -U postgres tryon_saas -c "SELECT count(*) FROM clients;"

Restore drops existing data

--drop-existing drops the current database before restoring. Always confirm you have the right backup file before using this flag.

Recover admin access (password reset)¶

If the admin password is lost and no recovery option exists:

# Generate a new bcrypt hash (use Python on the server)
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec admin-api python3 -c "
import bcrypt
pw = b'YourNewPassword123!'
print(bcrypt.hashpw(pw, bcrypt.gensalt(rounds=12)).decode())
"

# Update the password in the database
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec postgres psql -U postgres tryon_saas -c \
  "UPDATE users SET password_hash='\$2b\$12\$...' WHERE email='[email protected]';"

Forgot password endpoint is not implemented

There is no /auth/forgot-password endpoint. Admin password recovery requires direct DB access as shown above. A self-service forgot-password flow is planned for Phase 2 (requires email provider).

Fix broken nginx config¶

# Test nginx config before applying
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec nginx nginx -t

# If config is broken and nginx won't start, restore from git
cd /opt/tryon-saas
git diff nginx/nginx.prod.conf   # see what changed
git checkout nginx/nginx.prod.conf  # revert
docker compose -f docker-compose.yml -f docker-compose.prod.yml restart nginx

Fix broken sshd config¶

If you get locked out of SSH due to a bad sshd_config:

Log into the VPS provider web console (xorek.cloud) using the root account

As root via console: restore the backup config

cp /etc/ssh/sshd_config.backup.20260520 /etc/ssh/sshd_config
systemctl reload ssh

Re-enable the deploy user's key if needed:

cat /home/deploy/.ssh/authorized_keys
# Add your public key if missing

Recover from SSH key loss¶

Log into xorek.cloud web console
Access the server as root via the web console
Add your new public key to /home/deploy/.ssh/authorized_keys
Verify permissions: chmod 700 /home/deploy/.ssh && chmod 600 /home/deploy/.ssh/authorized_keys

Reference¶

Environment variables (critical)¶

Variable	Purpose	Where to change
JWT_SECRET_KEY	Signing all JWTs	`/opt/tryon-saas/.env`
FAL_API_KEY	fal.ai inference	`/opt/tryon-saas/.env`
POSTGRES_PASSWORD	Database auth	`/opt/tryon-saas/.env`
DOMAIN	MEDIA_BASE_URL prefix	`/opt/tryon-saas/.env`
MEDIA_BASE_URL	Public URL for uploads sent to fal.ai	`/opt/tryon-saas/.env`
UPLOAD_RETENTION_HOURS	How long to keep images	`/opt/tryon-saas/.env`
ENABLE_DOCS	Disable /docs in production	`/opt/tryon-saas/.env`

After changing .env

Restart the affected service:

docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --force-recreate admin-api

Firewall (UFW)¶

Open ports: 22 (SSH), 80 (HTTP→HTTPS redirect), 443 (HTTPS). All other ports are denied.

sudo ufw status verbose

Docker Compose commands¶

# Always use both compose files in production
COMPOSE="docker compose -f docker-compose.yml -f docker-compose.prod.yml"

$COMPOSE ps                          # service status
$COMPOSE logs -f admin-api           # follow logs
$COMPOSE restart admin-api           # restart one service
$COMPOSE up -d                       # start/update all
$COMPOSE exec admin-api bash         # shell into container
$COMPOSE exec postgres psql -U postgres tryon_saas  # DB shell

Useful one-liners¶

# Count jobs by status
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec postgres \
  psql -U postgres tryon_saas -c "SELECT status, count(*) FROM jobs GROUP BY status;"

# Check active client count
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec postgres \
  psql -U postgres tryon_saas -c "SELECT count(*) FROM clients WHERE status='active';"

# Check Redis queue depth
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec redis \
  redis-cli LLEN tryon_jobs

# Check worker heartbeat age (seconds since last update)
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec redis \
  redis-cli TTL worker_heartbeat

# Tail structured logs and pretty-print JSON
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs -f admin-api | python3 -c "
import sys, json
for line in sys.stdin:
    try:
        print(json.dumps(json.loads(line.strip()), indent=2))
    except:
        print(line, end='')
"

Backup crontab¶

To set up automated daily backups (if not already configured):

crontab -e
# Add:
0 3 * * * cd /opt/tryon-saas && bash scripts/backup.sh >> /var/log/tryon-backup.log 2>&1

Backups are stored in /opt/tryon-saas-backups/ as timestamped .sql.gz files with a manifest.json. The backup.sh script automatically deletes backups older than 30 days.