Skip to content

Technical Reference

This reference covers the internal architecture, API surface, data models, security design, and development workflow of the TryOn SaaS platform. Audience: engineers onboarding to the codebase.

Source of truth

CLAUDE.md in the repository root is the authoritative reference, auto-maintained alongside code changes.


Overview

TryOn SaaS is a B2B virtual clothing try-on platform. Client businesses embed a JavaScript widget on their product pages; the widget submits garment and model images to the Admin API, which queues an AI inference job. The ML Worker picks up jobs from Redis and calls fal.ai FASHN v1.5 to generate try-on images. Results are stored in the database and returned via polling or webhooks.

The platform is multi-tenant: each client has an account, a plan with generation limits, API keys, and optionally registered webhooks. An admin UI manages clients and monitors usage. A client portal lets clients track their own jobs and manage their integration.


Architecture

Client Website → Embed Script (tryon-embed.js) → Admin API (:8000) → Redis Queue → ML Worker → fal.ai FASHN v1.5
                      ↓                                ↓
                 PostgreSQL 15                    Redis 7 (AOF)
          Prometheus (:9090) → Grafana (:3000)
          ML Worker → Pushgateway (:9091) → Prometheus
          Nginx (:80/:443) → Frontend (:5173) + Embed Static (/embed/)
          Telegram Bot ← AlertManager (health + worker alerts)

Data flow for a try-on request

  1. Client's website loads tryon-embed.js via a <script> tag. The widget renders a floating action button.
  2. User clicks the widget, selects a garment. The widget encodes both images as base64 and POSTs to POST /api/v1/tryon with an X-API-Key header.
  3. Admin API validates the API key, checks the client's plan quota, processes images (resize, format normalize, strip EXIF), stores them, and enqueues a job in Redis.
  4. ML Worker receives the job via BRPOP, calls fal.ai, and polls for completion (up to 600 seconds).
  5. On completion the worker writes the result URL to the database and fires registered webhooks.
  6. The widget polls GET /api/v1/tryon/status/{id} until it gets a result URL, then displays the try-on image.

Infrastructure notes

Nginx reverse-proxies all public traffic. The Admin API and frontend are internal Docker services. Redis and PostgreSQL are internal. Prometheus scrapes admin-api:8000/metrics every 15 seconds; the ML worker pushes metrics to Pushgateway. Grafana reads from Prometheus and serves three auto-provisioned dashboards.


Services

Service Port Technology Purpose
admin-api 8000 (internal) FastAPI + SQLAlchemy 2.0 (async) + Pydantic v2 Core API: auth, clients, jobs, billing, webhooks
ml-worker internal Python asyncio + BRPOP loop + fal-client AI job processor
postgres internal PostgreSQL 15 Primary database
redis internal Redis 7 (AOF) Job queue + rate limiting + refresh token dedup
frontend internal React 18 + Vite + TypeScript + shadcn/ui Admin UI + client portal (proxied via nginx)
nginx 80/443 Nginx Reverse proxy, TLS, rate limiting, security headers
prometheus 9090 (internal) Prometheus Metrics scraping and storage
grafana 3000 (internal) Grafana Dashboards (auto-provisioned)
pushgateway 9091 (internal) Prometheus Pushgateway Receives metrics from ML worker

Note

All services have restart: unless-stopped, logging: json-file (50 MB max / 5 files), and resource limits.


API Reference

Authentication model

  • Access tokens — 15-minute lifetime, HS256 signed with JWT_SECRET_KEY
  • Refresh tokens — 30-day lifetime, rotated on every /auth/refresh. Stored as bcrypt(sha256(token)). token_prefix = first 16 chars for O(1) lookup.
  • Brute force — 5 failed logins → IP + email locked 15 min
  • Timing attackdummy_password_check() equalizes response time for missing users
  • API Keystryon_ prefix + 32 random chars. key_prefix = first 20 chars.
  • Portal JWTscope="portal", separate from admin JWTs

PyJWT, not python-jose

The project uses PyJWT>=2.8.0. Import as import jwt, catch jwt.PyJWTError. Do NOT add python-jose.

Admin endpoints (require admin JWT)

Group Method + Path Description
Auth POST /api/v1/auth/login Access + refresh tokens
Auth POST /api/v1/auth/refresh Rotate refresh token
Auth POST /api/v1/auth/logout Invalidate refresh token
Auth GET /api/v1/auth/me Current user info
Auth POST /api/v1/auth/change-password Change password
Clients GET/POST /api/v1/clients List / create clients
Clients GET/PATCH/DELETE /api/v1/clients/{id} Client detail / update / delete
Clients POST /api/v1/clients/{id}/suspend Suspend client
Clients POST /api/v1/clients/{id}/activate Reactivate client
Clients POST /api/v1/clients/{id}/reset-usage Reset monthly usage
API Keys GET /api/v1/clients/{id}/keys List keys (paginated envelope)
API Keys POST /api/v1/clients/{id}/keys Create key (raw value shown once)
API Keys DELETE /api/v1/clients/{id}/keys/{key_id} Revoke key
Plans GET/POST /api/v1/plans List / create plans
Plans GET/PATCH/DELETE /api/v1/plans/{id} Plan CRUD
Jobs GET /api/v1/jobs All jobs (supports date_from, date_to)
Jobs GET /api/v1/jobs/{id} Job detail
Stats GET /api/v1/stats/overview Platform-wide stats
Stats GET /api/v1/stats/jobs Job stats
Stats GET /api/v1/stats/clients/{id} Per-client stats
Stats GET /api/v1/stats/realtime Live stats from Redis

API keys list response shape

GET /api/v1/clients/{id}/keys returns a paginated envelope: {"items": [...], "total": N, "limit": N, "offset": N}. Iterate .items, not the root object.

Plan creation requires slug

PlanCreate requires slug (URL-friendly identifier) and generations_limit (plural). Frontend should auto-generate slug from plan name. Wrong field names silently cause HTTP 422.

Portal endpoints (client API key → portal JWT)

Method + Path Description
POST /api/v1/portal/auth/login Login with API key → portal JWT
GET /api/v1/portal/me Client info
GET /api/v1/portal/jobs Client's jobs (paginated, page 1–∞, page_size 1–100)
GET /api/v1/portal/jobs/{id} Job detail (client-scoped)
GET /api/v1/portal/usage Usage vs plan quota
GET/POST /api/v1/portal/webhooks List / create webhooks
DELETE /api/v1/portal/webhooks/{id} Delete webhook
GET /api/v1/portal/api-keys List API keys

Portal webhook event names

Events must be job.completed or job.failed — these are the only values in _VALID_EVENTS. Using tryon.completed or similar causes a silent HTTP 422.

Public endpoints (X-API-Key header)

Method + Path Description
POST /api/v1/tryon Submit try-on (base64 images, max 10 MB each)
GET /api/v1/tryon/status/{id} Poll job status (client-scoped)
GET /api/v1/tryon/domain-check Embed script pre-flight

Try-on request body:

{
  "model_image": "<base64 jpeg/png>",
  "garment_image": "<base64 jpeg/png>",
  "category": "tops | bottoms | dresses | outerwear",
  "mode": "balanced | quality",
  "webhook_url": "<optional HTTPS URL>"
}

Category mapping to fal.ai

Internal categories are translated before being sent to fal.ai: dresses → full-body, outerwear → auto, tops/bottoms pass through unchanged.

System endpoints

Path Auth Description
GET /health None Detailed liveness (fal.ai key status, storage backend, media_base_url_public)
GET /readiness None Strict K8s readiness probe — 503 if DB or Redis down
GET /metrics Blocked at nginx (403) Prometheus metrics

Use /readiness for load balancer probes

/health is for monitoring dashboards. /readiness is the strict upcheck — use it for load balancer health gates.


Configuration Reference

Variable Required Description
POSTGRES_PASSWORD Yes PostgreSQL password
JWT_SECRET_KEY Yes ≥32 chars — openssl rand -hex 32
FAL_API_KEY Yes fal.ai key. Format: {uuid}:{32-char-hex}
FIRST_ADMIN_EMAIL Yes Seeded admin email
FIRST_ADMIN_PASSWORD Yes Seeded admin password
GRAFANA_ADMIN_PASSWORD Yes Grafana admin password
DOMAIN Prod only Used for MEDIA_BASE_URL in production compose
MEDIA_BASE_URL Prod Public URL prefix for uploads — must be publicly reachable by fal.ai
STORAGE_BACKEND No local (default) or s3
S3_BUCKET If S3 S3/R2/MinIO bucket name
S3_REGION If S3 Default auto (for R2)
S3_ENDPOINT_URL If S3 Empty for AWS; set for R2/MinIO
S3_ACCESS_KEY If S3 S3 access key
S3_SECRET_KEY If S3 S3 secret key
S3_PUBLIC_URL If S3 CDN prefix without trailing slash
ENABLE_DOCS No false disables /docs in production
UPLOAD_RETENTION_HOURS No 48h default for media cleanup
CORS_ORIGINS No JSON array ["https://a.com"] — NOT comma-separated

CORS_ORIGINS must be a JSON array

pydantic-settings cannot parse a comma-separated string for List[str]. Always use JSON array format in .env:
CORS_ORIGINS=["https://a.com","https://b.com"]

MEDIA_BASE_URL must be publicly reachable

If MEDIA_BASE_URL contains localhost or 127.0.0.1, fal.ai cannot fetch uploaded images. The startup log emits media_base_url_localhost_warning and each affected job is logged. Set to your public domain in production.


Data Models

Table Key Columns Notes
users id, email, password_hash, is_admin Admin users only
clients id, name, plan_id, status, allowed_domains B2B clients
plans id, slug, name, generations_limit, price_monthly slug required at creation
api_keys id, client_id, key_prefix, key_hash, is_active key_prefix = first 20 chars
jobs id, client_id, status, model_image_url, garment_url, result_url Status: pending→processing→completed/failed/interrupted
refresh_tokens id, user_id, token_prefix, token_hash, expires_at token_prefix = first 16 chars
usage_logs client_id, month, generation_count, avg_latency_ms Monthly per-client
webhook_endpoints id, client_id, url, secret, events, is_active Events: job.completed, job.failed

Security Design

Threat Protection
Brute force 5-strike lockout (Redis), 15-min window on IP + email
Timing attacks dummy_password_check() equalizes response time for missing users
SSRF via webhook URLs Blocks RFC-1918, loopback, link-local, CGNAT; DNS-resolved at registration
Cross-tenant job reads Status lookup filtered by client_id matching the API key
Rate limit abuse Per-key hourly + per-IP 30/min on tryon submit
Secrets in logs structlog redacts sensitive fields
Large images nginx 12m + Pydantic max_length=14_000_000
MIME spoofing Magic bytes validation
Decompression bombs MAX_IMAGE_PIXELS = 4096×4096
Cache poisoning Cache-Control: no-store, private on all /api/ responses
Metrics leakage /metrics: deny all; return 403 at nginx
Internal port exposure Production compose closes ports 8000/9090/3000

Security headers applied with the always flag (also on error responses): X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Referrer-Policy, Permissions-Policy, full CSP.

Domain whitelist only works in browsers

validate_api_key() checks Origin/Referer headers. Server-to-server API calls (curl, SDKs) don't send these headers — the domain check is skipped. If an API key is leaked, it can be used from any environment. Clients must keep API keys secret.


Image Processing Pipeline

Images go through a 9-step pipeline in admin/app/services/image_processor.py:

  1. Magic bytes validation (JPEG/PNG only)
  2. HEIC → WebP conversion
  3. Alpha channel flatten (white background)
  4. EXIF orientation normalize
  5. EXIF metadata strip
  6. Palette → RGB/RGBA conversion
  7. Resize to max 1500×1500 px
  8. WebP re-encode (quality 85→55 adaptive ramp)
  9. Minimum 256×256 px check

Decompression bomb guard

PILImage.MAX_IMAGE_PIXELS is set to 4096 * 4096 at module level. Images exceeding 16.7M pixels are rejected with HTTP 422 before any decode.


Monitoring

Prometheus metrics

Metric Labels Description
tryon_submitted_total client_id Jobs submitted
tryon_completed_total client_id, status Jobs finished
tryon_rate_limited_total reason (per_ip, limit_exceeded) Rate limit hits
cleanup_files_deleted_total Files deleted by cleanup service

Grafana dashboards

Three dashboards are auto-provisioned at startup: - Platform Overview — submission rates, completion rates, error rates - ML Worker — job processing latency, fal.ai call duration, heartbeat status - Infrastructure — CPU, memory, disk, Redis and Postgres connection counts

Alerting

Telegram alerts fire for: - Worker heartbeat stale > 5 minutes - Job failure rate > 5% in 10 minutes - DB or Redis unreachable


Testing

507+ tests, ≥97% coverage. Tests use an in-memory database and fakeredis.

cd admin
pytest --cov=app --cov-report=term-missing -q

Always check term-missing before writing tests

Run pytest --cov=app --cov-report=term-missing -q 2>&1 | grep -E "^\s+app/" | sort -k4 -t% -n | head -20 first. Never guess what's uncovered — read the exact missing lines.

Key test patterns:

  • validate_api_key calls redis.pipeline() — tests must provide a working pipeline or override the dependency
  • SSRF validator calls socket.getaddrinfo() — mock in unit tests
  • Webhook URL fixtures use https://example.com/... (resolves in CI; bare new.example.com does not)
  • FastAPI returns 422 (not 401) when a required header is absent entirely; 401 only when the token is present but invalid

Development Workflow

# Clone and start
git clone <repo> && cd tryon-saas
cp .env.example .env   # fill in required values
docker-compose up -d

# View logs
docker-compose logs -f admin-api

# Run tests
cd admin && pytest -x --tb=short

# Apply DB migrations manually (auto-migrate is NOT enabled)
docker-compose exec admin-api alembic upgrade head

# Lint
cd admin && ruff check app/   # run from admin/ dir, NOT from repo root

# Rebuild after frontend changes
docker-compose build frontend && docker-compose up -d frontend

Migrations do not run automatically

main.py calls Base.metadata.create_all (creates tables from ORM models for fresh installs) but does NOT run alembic upgrade head. New columns from migrations require a manual alembic upgrade head.

Frontend is baked into the Docker image

After changing frontend source, you must run docker-compose build frontend before up -d. Old images cause runtime errors when the API response shape has changed but the container still runs the old frontend.


Known Constraints

Constraint Detail
CORS_ORIGINS Must be a JSON array, not comma-separated
Frontend deploys Rebuild Docker image after every source change
API keys list Paginated envelope {items, total, limit, offset} — iterate .items
bcrypt 5.x is incompatible with passlib — use bcrypt directly
JWT library PyJWT only, not python-jose
change-password rate limiting Not implemented (known gap)
S3 ACL Not set by default — configure public bucket policy separately
Portal webhook events Must be job.completed or job.failed
avg_latency_ms Not recalculated on CONFLICT — shows first job's value
fal.ai polling timeout 600s ceiling; worker shut down mid-poll will be killed by Docker after 35s grace period