Skip to content

Project Report

Overview

This document is a comprehensive account of all work performed to design, build, harden, and ship the TryOn SaaS platform — a B2B virtual clothing try-on service that allows e-commerce businesses to embed an AI-powered widget on their websites. It is intended to give technical leadership a complete picture of the scope, depth, and engineering discipline that went into this system.

The platform was built from a blank VPS to a fully automated, production-hardened, monitored service with a 41-finding security audit, 507+ automated tests, ≥97% code coverage, and a CI/CD pipeline where a git push results in a live deployment in approximately 20 seconds.


1. Server & Infrastructure Setup

Provisioning

The foundation of the platform is a dedicated VPS purchased at xorek.cloud with the following specifications:

Parameter Value
CPU 4 vCPU
RAM 8 GB
Storage 80 GB NVMe SSD
IP Address 185.246.222.107
Operating System Ubuntu 22.04 LTS

Day-One Hardening

Security hardening was applied on the very first session, before any application code was deployed:

  • Root SSH disabled — remote root login is explicitly blocked in /etc/ssh/sshd_config
  • Password authentication disabled — only public-key authentication is accepted
  • Non-root deploy user — all operations run as deploy with NOPASSWD sudo strictly scoped to Docker commands
  • UFW firewall — only ports 22 (SSH), 80 (HTTP), and 443 (HTTPS) are open; all other inbound traffic is denied by default
  • fail2ban — protects the SSH port against brute-force attacks with automatic IP banning after repeated failed attempts

SSH Access Recovery

If the SSH key is ever lost, recovery is possible through the xorek.cloud web console. A backup of the sshd config is kept at /etc/ssh/sshd_config.backup.20260520 on the server.


2. Domain & DNS Setup

Cloudflare Integration

The domain ziex-tryon.com is registered and fully managed through Cloudflare. All DNS resolution runs through Cloudflare's proxy (orange-cloud mode), providing:

  • DDoS protection at the DNS/CDN layer
  • Automatic HTTP→HTTPS redirects
  • Bot management and IP reputation filtering
  • Edge caching for static assets

TLS Strategy

Rather than Let's Encrypt (which requires certificate renewal automation and ACME challenges), a Cloudflare Origin Certificate was chosen. This is a certificate issued directly by Cloudflare, trusted only between Cloudflare's edge and the origin server — enabling full-strict TLS mode without the operational overhead of certbot.

nginx OCSP Stapling

Cloudflare Origin Certificates are not signed by a public CA, so OCSP stapling must be disabled in nginx:

ssl_stapling        off;
ssl_stapling_verify off;
Failing to set this causes nginx to log repeated OCSP errors and can delay TLS handshakes.

Subdomain Map

Six DNS entries were configured, all pointing to the same origin IP (185.246.222.107) but routed to different application contexts by nginx server blocks:

Subdomain Purpose
ziex-tryon.com Main landing page
api.ziex-tryon.com REST API (FastAPI backend)
admin.ziex-tryon.com Admin panel (React SPA)
app.ziex-tryon.com Client portal (React SPA)
sandbox.ziex-tryon.com Integration sandbox for testing
docs.ziex-tryon.com MkDocs documentation site

3. Architecture

The platform was designed from scratch with a clear separation of concerns across stateless HTTP services, a queue-based job processing pipeline, and a dedicated ML inference backend.

Client Website
tryon-embed.js  ──►  Admin API (:8000)  ──►  Redis Queue  ──►  ML Worker
(Shadow DOM)          │                                              │
                      │                                              ▼
                 PostgreSQL 15                              fal.ai FASHN v1.5
                      ▲                                    (AI inference)
             ┌────────┴────────┐
        Prometheus           Grafana
             ▲               (3 dashboards)
       Pushgateway  ◄──  ML Worker metrics
       AlertManager  ──►  Telegram Bot
           nginx  (TLS · rate limiting · security headers)

Service Inventory

The entire platform runs as 8 Docker services defined in docker-compose.yml with a production overlay in docker-compose.prod.yml:

Service Internal Port Role
postgres 5432 (internal) PostgreSQL 15 — persistent data store
redis 6379 (internal) Redis 7 AOF — job queue + rate limit counters
admin-api 8000 FastAPI — all business logic
ml-worker none Job processor — Redis BRPOP loop
frontend 5173 (internal) Vite dev / built React SPA
nginx 80/443 Reverse proxy, TLS termination
prometheus 9090 (internal in prod) Metrics scraping
grafana 3000 (internal in prod) Dashboards

Every service is configured with: - restart: unless-stopped - logging: json-file driver with max-size: 50m, max-file: 5 - CPU and memory limits and reservations - Docker health checks

Production vs Development

docker-compose.prod.yml overrides the base compose to close ports 8000, 9090, and 3000 from public access, sets MEDIA_BASE_URL to the production domain, and mounts the production nginx config with full TLS. In development these ports are accessible directly for debugging.

Job Processing Pipeline

When a client's embedded widget triggers a try-on request:

  1. Embed widget (tryon-embed.js) captures product and person images, sends a POST /api/v1/tryon with an API key
  2. Admin API validates the key, checks the client's generation quota, processes images (resize, EXIF strip, WebP conversion), saves them to storage, pushes a JSON job payload to Redis
  3. ML Worker picks up the job via BRPOP (blocking pop), calls fal.ai FASHN v1.5
  4. fal.ai runs the inference (virtual try-on), returns a result URL
  5. Worker updates the job status in PostgreSQL and fires registered webhooks
  6. Client polls GET /api/v1/tryon/status/{id} until status is completed

4. Tech Stack

Backend

Component Version Notes
Python 3.11 Async throughout (asyncio, async SQLAlchemy)
FastAPI 0.111.0 ASGI, Pydantic v2 integration
SQLAlchemy 2.0 Async engine, async_session factory
Pydantic v2 Breaking change: AnyHttpUrl is no longer a str subclass
Alembic 1.13 DB migrations, auto-applied on startup
PostgreSQL 15 Primary data store
Redis 7 AOF persistence, job queue + rate limiting
PyJWT ≥2.8.0 Not python-jose — import as import jwt
bcrypt direct (5.x) Not passlib — 5.x API changed, passlib incompatible
structlog latest JSON logging with sensitive-field redaction
httpx latest Async HTTP client for fal.ai calls

Pydantic v2 AnyHttpUrl Gotcha

In Pydantic v1, AnyHttpUrl was a str subclass and could be passed directly to asyncpg. In Pydantic v2 it is not. Any URL field passed to SQLAlchemy/asyncpg must be explicitly coerced:

# Wrong — asyncpg raises TypeError: expected str, got AnyHttpUrl
model = Job(webhook_url=body.webhook_url)

# Correct
model = Job(webhook_url=str(body.webhook_url))

# Correct for PATCH (model_dump coerces automatically)
body.model_dump(mode="json", exclude_unset=True)

Frontend

Component Version Notes
React 18 SPA, hooks, context
TypeScript 5 Strict mode
Vite 5 Build tool, dev server
Tailwind CSS 3 Utility-first styling
shadcn/ui latest Component library
Node.js 22 Build environment (Alpine image)
Axios latest HTTP client with JWT interceptor + auto-refresh
Zustand latest State management (separate stores for admin/portal auth)

Infrastructure

Component Version Notes
Docker Compose V2 docker compose command
nginx 1.25 TLS, gzip, rate limiting, security headers
Prometheus 2.x Metrics collection
Grafana 11.0.0 Dashboards, auto-provisioned
Pushgateway 1.10.0 Worker metrics (worker has no HTTP server)
fal.ai FASHN v1.5 AI inference backend

5. Security Audit — 41 Findings

A formal security audit was conducted in two phases. 41 findings were identified, 38 closed, and 3 consciously deferred with written justification in the project's known-issues register.

Authentication & Access Control (8 findings)

Finding Resolution
Refresh tokens stored as plaintext SHA-256 pre-hash → bcrypt(rounds=12); token_prefix VARCHAR(16) for O(1) DB lookup
No brute-force protection on login 5 failed attempts → IP + email locked for 15 minutes (Redis counters)
Timing attack on user lookup dummy_password_check() runs bcrypt on pre-computed hash when user not found — both paths take identical time
Portal JWT used same scope as admin Separate scope="portal" claim; get_current_client dependency validates scope
Refresh token lookup was O(n) token_prefix index + composite (token_prefix, user_id) index — confirmed O(1) at scale
/auth/change-password no rate limit Deferred: 15-min token window limits attack surface; rate limit scheduled for next sprint
API key brute force key_prefix VARCHAR(20) fast lookup; bcrypt hash storage; per-key hourly rate limit in Redis
Portal login had no rate limit Fixed: 5/min/IP via get_redis dependency. Was using request.app.state.redis (never set → always fail-open)

Input Validation & Injection (7 findings)

Finding Resolution
f-string SQL in stats queries Replaced with parameterized SQLAlchemy expressions throughout
SSRF via webhook URL validate_webhook_url() blocks RFC-1918, loopback, link-local, CGNAT; DNS-resolves hostname to check all returned IPs
SSRF via per-job webhook_url Same validator applied at Pydantic schema layer — not just in the worker
MIME type not validated Strict image/jpeg / image/png check via magic bytes, not just Content-Type header
Decompression bomb PIL.MAX_IMAGE_PIXELS = 4096 * 4096; explicit post-open pixel count check; rejects >16.7M pixels with HTTP 422
TryOnRequest.mode accepted any string Changed to Literal["balanced", "quality"]; invalid values return 422
ClientUpdate.status accepted any string Changed to Literal["active", "suspended"]; prevents "banned" persisting to DB

Infrastructure & Configuration (14 findings)

Finding Resolution
Port 8000 publicly exposed Closed in docker-compose.prod.yml; nginx is the only ingress
Ports 9090, 3000 publicly exposed Closed in production override
CORS_ORIGINS JSON parse error pydantic-settings cannot parse comma-separated list; must use JSON array in .env
/metrics endpoint publicly readable location = /metrics { deny all; return 403; } in both nginx configs
No Cache-Control on API responses RequestIdMiddleware adds Cache-Control: no-store, private to all /api/ responses
ENABLE_DOCS not enforced /docs, /redoc, /openapi.json removed when ENABLE_DOCS=false
Permissions-Policy header missing Added to both nginx configs: camera=(), microphone=(), geolocation=()
Security headers only on 2xx Added always flag to all nginx security headers directives
nginx client_max_body_size not set Set to 12m (10 MB image + ~33% base64 overhead)
Prometheus targeted ml-worker:8001 Removed — ml-worker has no HTTP server; was causing Prometheus log spam
DOMAIN missing from .env.example Added; deploy.sh --prod fails fast if DOMAIN is unset
media_base_url_localhost warning Startup log + health endpoint media_base_url_public: bool field
Storage backend not observable GET /health response includes storage_backend: "local"/"s3"
nginx health location missing Added explicit location = /health in both configs; was falling through to Vite frontend

Frontend Correctness (7 findings)

Finding Resolution
Settings.tsx called wrong endpoint Was PUT /auth/me/password; fixed to POST /auth/change-password with correct field names
ClientDetail.tsx iterated envelope object GET /clients/{id}/keys returns paginated envelope; fixed to use .items
Concurrent 401 refresh race Module-level refreshPromise dedup in api/client.ts — second 401 reuses in-flight refresh
Portal webhooks wrong event namespace Was tryon.completed; fixed to job.completed / job.failed (only valid values in _VALID_EVENTS)
model_validate(update=...) removed in Pydantic 2.13+ Webhook response built as plain dict before model_validate
Jobs date filter was dead code Fully implemented: date_from/date_to query params wired on both frontend and backend
Plan creation missing slug field Frontend auto-generates slug from plan name; API requires slug + generations_limit

Observability & Monitoring (5 findings)

Finding Resolution
No Prometheus metrics for worker Pushgateway integration — worker pushes after each job; Prometheus scrapes Pushgateway
Worker healthcheck always passed Was python -c "sys.exit(0)"; now checks worker_heartbeat Redis key is <90s old
No stale job detection _run_stale_job_sweep() runs every 60s in admin-api lifespan; marks processing >15min and pending >30min as failed
No Grafana dashboards 3 dashboards auto-provisioned via grafana/dashboards/ directory mount
Telegram alerts had no dedup Redis SETNX with 1h TTL per alert key prevents duplicate notifications

6. CI/CD Pipeline

The CI/CD pipeline was built to enforce the project's quality standards automatically. A developer's workflow is: write code, push, and watch the pipeline do everything else.

Continuous Integration (ci.yml)

Push to any branch
  Lint (ruff)
  admin/ruff.toml: line-length=130, E501 ignored
  Tests (pytest)
  Services: postgres:15, redis:7 in GitHub Actions containers
  Threshold: --cov-fail-under=94 (≥97% coverage maintained)
  Docker build (admin-api + ml-worker)
  Verifies Dockerfiles are valid and dependencies install correctly

Continuous Deployment (deploy.yml)

CI workflow completes with success
       ▼  (workflow_run trigger, not needs:)
  SSH to 185.246.222.107
  git pull origin main
  git diff HEAD~1 HEAD → detect changed services
       ├── admin-api changed?  → docker compose build admin-api && up -d
       ├── ml-worker changed?  → docker compose build ml-worker && up -d
       ├── frontend changed?   → build in container → rsync dist/ to nginx
       └── nginx/compose changed? → docker compose up -d nginx
  curl /health → verify 200
  Deployment complete (~20 seconds)

Why workflow_run instead of needs:

needs: only works within the same workflow file. workflow_run allows deploy.yml to gate on ci.yml success across separate files. Without this, deploy would fire on every push regardless of test results.

Local Development Tools

  • docker-compose.test.yml — runs the test suite with source volume-mounted; no rebuild required after code changes
  • Pre-push git hook — runs ruff check and pytest locally before any push reaches GitHub
  • scripts/generate-secrets.sh — generates all random secrets (JWT key, DB password, Grafana password, admin password) in one command

7. Test Suite

Testing was treated as a first-class deliverable. The suite grew in lockstep with the codebase using test-driven development practices.

Coverage Methodology

Before writing any test, the exact uncovered lines are identified:

cd admin && pytest --cov=app --cov-report=term-missing -q \
  2>&1 | grep -E "^\s+app/" | sort -k4 -t% -n | head -20

This prevents guessing what to test and ensures every new test file targets real coverage gaps — not imagined ones.

Test File Inventory

File Tests Coverage Area
test_auth.py 40+ Login, refresh, logout, change-password, brute force lockout
test_clients.py 25+ Client CRUD, suspend/activate, quota reset
test_api_keys.py 20+ Create, list, revoke API keys
test_plans.py 15+ Plan CRUD including delete (was 0% before audit)
test_jobs.py 15+ Job list/detail, status isolation between clients
test_stats.py 20+ Overview, realtime, per-client breakdown
test_tryon.py 30+ Submit, status, domain-check, rate limiting
test_security.py 20+ Timing attack, rate limit, domain whitelist, SSRF
test_health.py 10+ Health endpoint, readiness probe, error sanitization
test_errors.py 10+ Error format consistency across all endpoints
test_dependencies.py 15+ API key validation paths, fail-open/closed
test_billing_service.py 15+ Generation limit checks, usage log upsert
test_image_processor.py 31 Format validation, pipeline, resize, quality ramp, decompression bomb
test_webhooks.py 23 CRUD, HMAC signing, delivery, client isolation
test_graceful_shutdown.py 14 Signal handlers, request tracking, interrupted job marking
test_multitenancy.py 23 Data isolation, API key scoping, plan quota enforcement
test_service_units.py 20+ Stateless unit tests: config, auth, billing, queue
test_alerting_service.py 15+ Telegram alert formatting, dedup, trigger conditions
test_embed_domain_check.py 10+ Domain-check endpoint for embed pre-flight
test_portal.py 30+ Portal login, job scoping, usage, webhook CRUD, tenant isolation
test_storage_service.py 15 LocalStorageBackend save/delete/list, factory, unknown backend
test_s3_storage.py 8 S3StorageBackend via moto mock (skipped if boto3/moto absent)
test_hardening.py 12 Job status isolation, ENABLE_DOCS flag, Cache-Control, health sanitization
test_observability.py 15+ Health fields, Prometheus counters, webhook dedup, readiness
test_sandbox.py 32 Sandbox garment/model CRUD, pagination, public endpoints, auth enforcement

Total: 507+ tests, ≥97% coverage


8. Monitoring & Observability

Metrics Pipeline

Admin API  ──► Prometheus (scrape every 15s)  ──► Grafana
ML Worker  ──► Pushgateway                    ──► Grafana

The admin-api exposes GET /metrics (Prometheus format) blocked from public access at the nginx layer. The ML worker, having no HTTP server, pushes metrics to Pushgateway after each job. Prometheus scrapes Pushgateway in the same 15-second cycle.

Four Prometheus counters are defined in admin/app/metrics.py:

Counter Labels Meaning
tryon_submitted_total client_id Every try-on request accepted
tryon_completed_total client_id, status Job completions (success/failure)
tryon_rate_limited_total reason Rate limit hits (per_ip or limit_exceeded)
cleanup_files_deleted_total none WebP files removed by cleanup service

Grafana Dashboards

Three dashboards are auto-provisioned from grafana/dashboards/ — no manual setup required after deploy:

  1. Platform Overview — request rates, job throughput, error rates
  2. Client Usage — per-client generation counts, quota utilization
  3. Worker Health — ML Worker heartbeat age, job queue depth, fal.ai latency

Health Endpoints

Endpoint Purpose Use Case
GET /health Detailed status (DB, Redis, worker heartbeat, fal.ai key, storage backend) Monitoring dashboards
GET /readiness Strict probe: 200 if DB+Redis reachable, 503 otherwise Load balancer health checks

Worker Heartbeat & Stale Job Detection

The ML Worker writes a worker_heartbeat key to Redis every 30 seconds. The /health endpoint reports the age of this key — if the worker dies, the heartbeat goes stale and appears in the dashboard.

A background task in admin-api (_run_stale_job_sweep()) runs every 60 seconds and marks jobs as failed if they are stuck: - processing status for >15 minutes → failed (worker died mid-job) - pending status for >30 minutes → failed (BRPOP race — job was popped but never processed)

Telegram Alerts

alerting_service.py sends structured alerts to a Telegram bot. Redis SETNX with 1-hour TTL prevents duplicate notifications for the same alert type. Alert severity levels: info, warning, critical.


9. Deployment Workflow

"git push → 20 seconds → production"

The deployment pipeline is fully automated. Here is the complete flow:

Developer: git push origin main
                ▼ (GitHub Actions: ci.yml)
        1. ruff check admin/
        2. pytest --cov-fail-under=94
           (postgres:15 + redis:7 service containers)
        3. docker build admin-api
        4. docker build ml-worker
                │ (all pass)
                ▼ (GitHub Actions: deploy.yml, workflow_run trigger)
        5. SSH to 185.246.222.107
        6. git pull origin main
        7. diff HEAD~1 HEAD — find changed services
        8. docker compose build <changed services>
        9. docker compose up -d <changed services>
        10. curl https://ziex-tryon.com/health → 200
        Production updated ✓

Deployment Scripts

All scripts live in scripts/ and are designed to be idempotent and auditable.

generate-secrets.sh — run once before first deployment:

bash scripts/generate-secrets.sh
# Generates: JWT_SECRET_KEY, POSTGRES_PASSWORD,
#            GRAFANA_ADMIN_PASSWORD, FIRST_ADMIN_PASSWORD
# Outputs ready-to-paste .env entries

deploy.sh --prod — full production deployment:

bash scripts/deploy.sh --prod
# 1. Pre-flight: checks DOMAIN set, TLS certs exist, .env complete
# 2. Build: docker compose -f ... -f docker-compose.prod.yml build
# 3. Ordered startup: postgres → redis → admin-api → ml-worker → nginx → monitoring
# 4. Health verification: polls /health until 200 or timeout

health-check.sh — post-deploy verification:

bash scripts/health-check.sh
# Checks: API /health 200, embed.js accessible, frontend loads
# Catches: accidentally exposed ports 8000, 9090, 3000

backup.sh — PostgreSQL backup with manifest:

bash scripts/backup.sh
# Creates: /opt/tryon-saas-backups/backup_20260523_143022.dump
# Creates: /opt/tryon-saas-backups/backup_20260523_143022.manifest.json
# Add to crontab for automated nightly backups

restore.sh — restore from dump:

bash scripts/restore.sh --drop-existing --yes backup_20260523_143022.dump
# Drops existing DB, restores from dump file


10. Application Features

Admin Panel (admin.ziex-tryon.com)

The admin panel is a React SPA providing full operational control:

  • Client management — create, edit, suspend, activate clients; assign plans; reset monthly usage
  • API key management — generate and revoke API keys per client; view key prefixes and creation dates
  • Plan management — create billing plans with monthly generation limits; assign to clients
  • Job dashboard — view all try-on jobs with filtering by client, status, date range
  • Statistics — platform overview (total clients, jobs, revenue); per-client usage breakdown; realtime metrics

Client Portal (app.ziex-tryon.com)

A separate React SPA (different Zustand store, different Axios instance, different JWT scope) for clients to self-serve:

  • Dashboard — current plan, usage this month, remaining generations
  • Jobs — view own try-on job history with results
  • API Keys — view active keys (create/revoke coming in next sprint)
  • Webhooks — register webhook endpoints to receive job.completed and job.failed events
  • Usage — monthly generation history

Embeddable Widget (tryon-embed.js)

A zero-dependency vanilla JavaScript widget that clients drop onto their product pages:

<script src="https://api.ziex-tryon.com/embed/tryon-embed.js"
        data-api-key="tryon_xxxxxxxxxxxx"></script>
  • Renders inside a Shadow DOM — fully isolated from the host page's CSS
  • FAB → modal interaction pattern (floating action button opens full-screen overlay)
  • SPA-aware: MutationObserver detects page changes; re-initializes when the product URL changes
  • Auto-detects product images from <img> tags matching known e-commerce patterns
  • Configurable: TryOnWidget.init({ apiUrl, container, theme })

Sandbox (sandbox.ziex-tryon.com)

A sandboxed environment for testing API integration without affecting production data:

  • Pre-loaded garment and model image library
  • Full try-on API available (rate-limited separately)
  • Returns realistic responses including result images

11. Known Limitations & Deferred Items

These items were explicitly reviewed and deferred with written justification — they are not gaps in awareness, but conscious product decisions:

Forgot Password

No POST /auth/forgot-password endpoint exists. Admin recovery requires direct DB access. This is intentional: at the current stage (single technical admin), email provider integration is out of scope. Recovery path is documented in the ops runbook.

change-password Rate Limiting

POST /auth/change-password has no rate limit. The attack surface is bounded to the 15-minute access token window. Rate limiting is on the next-sprint backlog.

S3 ACL Not Set

S3StorageBackend.put_object() does not set ACL="public-read". For AWS S3, a public bucket policy must be configured separately. Cloudflare R2 (the production storage backend) does not use ACLs — it uses bucket-level public access settings.

Worker Shutdown During Polling

Docker's stop_grace_period: 35s is shorter than fal.ai's 600s polling ceiling. A job mid-inference when SIGTERM arrives will be killed. Marked as a known issue; increasing stop_grace_period to 610s is the fix.


12. Scope Metrics

Metric Value
Lines of Python backend code ~6,000
Lines of TypeScript frontend code ~4,000
Lines of tests ~5,000
Test files 26
Test assertions 507+
API endpoints 35+
Docker services 8
nginx server blocks 5+
Security findings identified 41
Security findings resolved 38
Consciously deferred 3
Code coverage ≥97%
Audit phases 2
CI/CD automation Full (lint + test + build + deploy)
Time from git push to production ~20 seconds
Subdomains configured 6
Grafana dashboards 3
Alembic migrations 2

13. Conclusion

The TryOn SaaS platform was built with production-readiness as the primary constraint, not speed. Every design decision — from the choice of PyJWT over python-jose (active maintenance), to the bcrypt-direct approach over passlib (5.x compatibility), to the workflow_run CI/CD trigger (correct cross-workflow gating) — was made deliberately and documented.

The result is a platform that:

  • Handles multi-tenancy correctly — data isolation between clients is enforced at every layer (DB queries, Redis keys, JWT scopes)
  • Degrades gracefully — stale job detection, graceful shutdown with active-request draining, worker heartbeat monitoring
  • Is observable — every significant event is logged via structlog with a request_id trace that flows from nginx through the API to the ML worker to fal.ai
  • Is auditable — 507+ tests, 41-finding security audit, documented deferred items with written justification
  • Deploys safely — CI gates every deploy, health checks gate every deployment, pre-flight scripts validate secrets before touching containers