Technical Reference

This reference covers the internal architecture, API surface, data models, security design, and development workflow of the TryOn SaaS platform. Audience: engineers onboarding to the codebase.

Source of truth

CLAUDE.md in the repository root is the authoritative reference, auto-maintained alongside code changes.

Overview¶

TryOn SaaS is a B2B virtual clothing try-on platform. Client businesses embed a JavaScript widget on their product pages; the widget submits garment and model images to the Admin API, which queues an AI inference job. The ML Worker picks up jobs from Redis and calls fal.ai FASHN v1.5 to generate try-on images. Results are stored in the database and returned via polling or webhooks.

The platform is multi-tenant: each client has an account, a plan with generation limits, API keys, and optionally registered webhooks. An admin UI manages clients and monitors usage. A client portal lets clients track their own jobs and manage their integration.

Architecture¶

Client Website → Embed Script (tryon-embed.js) → Admin API (:8000) → Redis Queue → ML Worker → fal.ai FASHN v1.5
                      ↓                                ↓
                 PostgreSQL 15                    Redis 7 (AOF)
                      ↑
          Prometheus (:9090) → Grafana (:3000)
          ML Worker → Pushgateway (:9091) → Prometheus
          Nginx (:80/:443) → Frontend (:5173) + Embed Static (/embed/)
          Telegram Bot ← AlertManager (health + worker alerts)

Data flow for a try-on request¶

Client's website loads tryon-embed.js via a <script> tag. The widget renders a floating action button.
User clicks the widget, selects a garment. The widget encodes both images as base64 and POSTs to POST /api/v1/tryon with an X-API-Key header.
Admin API validates the API key, checks the client's plan quota, processes images (resize, format normalize, strip EXIF), stores them, and enqueues a job in Redis.
ML Worker receives the job via BRPOP, calls fal.ai, and polls for completion (up to 600 seconds).
On completion the worker writes the result URL to the database and fires registered webhooks.
The widget polls GET /api/v1/tryon/status/{id} until it gets a result URL, then displays the try-on image.

Infrastructure notes¶

Nginx reverse-proxies all public traffic. The Admin API and frontend are internal Docker services. Redis and PostgreSQL are internal. Prometheus scrapes admin-api:8000/metrics every 15 seconds; the ML worker pushes metrics to Pushgateway. Grafana reads from Prometheus and serves three auto-provisioned dashboards.

Services¶

Service	Port	Technology	Purpose
admin-api	8000 (internal)	FastAPI + SQLAlchemy 2.0 (async) + Pydantic v2	Core API: auth, clients, jobs, billing, webhooks
ml-worker	internal	Python asyncio + BRPOP loop + fal-client	AI job processor
postgres	internal	PostgreSQL 15	Primary database
redis	internal	Redis 7 (AOF)	Job queue + rate limiting + refresh token dedup
frontend	internal	React 18 + Vite + TypeScript + shadcn/ui	Admin UI + client portal (proxied via nginx)
nginx	80/443	Nginx	Reverse proxy, TLS, rate limiting, security headers
prometheus	9090 (internal)	Prometheus	Metrics scraping and storage
grafana	3000 (internal)	Grafana	Dashboards (auto-provisioned)
pushgateway	9091 (internal)	Prometheus Pushgateway	Receives metrics from ML worker

Note

All services have restart: unless-stopped, logging: json-file (50 MB max / 5 files), and resource limits.

API Reference¶

Authentication model¶

Access tokens — 15-minute lifetime, HS256 signed with JWT_SECRET_KEY
Refresh tokens — 30-day lifetime, rotated on every /auth/refresh. Stored as bcrypt(sha256(token)). token_prefix = first 16 chars for O(1) lookup.
Brute force — 5 failed logins → IP + email locked 15 min
Timing attack — dummy_password_check() equalizes response time for missing users
API Keys — tryon_ prefix + 32 random chars. key_prefix = first 20 chars.
Portal JWT — scope="portal", separate from admin JWTs

PyJWT, not python-jose

The project uses PyJWT>=2.8.0. Import as import jwt, catch jwt.PyJWTError. Do NOT add python-jose.

Admin endpoints (require admin JWT)¶

Group	Method + Path	Description
Auth	POST /api/v1/auth/login	Access + refresh tokens
Auth	POST /api/v1/auth/refresh	Rotate refresh token
Auth	POST /api/v1/auth/logout	Invalidate refresh token
Auth	GET /api/v1/auth/me	Current user info
Auth	POST /api/v1/auth/change-password	Change password
Clients	GET/POST /api/v1/clients	List / create clients
Clients	GET/PATCH/DELETE /api/v1/clients/{id}	Client detail / update / delete
Clients	POST /api/v1/clients/{id}/suspend	Suspend client
Clients	POST /api/v1/clients/{id}/activate	Reactivate client
Clients	POST /api/v1/clients/{id}/reset-usage	Reset monthly usage
API Keys	GET /api/v1/clients/{id}/keys	List keys (paginated envelope)
API Keys	POST /api/v1/clients/{id}/keys	Create key (raw value shown once)
API Keys	DELETE /api/v1/clients/{id}/keys/{key_id}	Revoke key
Plans	GET/POST /api/v1/plans	List / create plans
Plans	GET/PATCH/DELETE /api/v1/plans/{id}	Plan CRUD
Jobs	GET /api/v1/jobs	All jobs (supports date_from, date_to)
Jobs	GET /api/v1/jobs/{id}	Job detail
Stats	GET /api/v1/stats/overview	Platform-wide stats
Stats	GET /api/v1/stats/jobs	Job stats
Stats	GET /api/v1/stats/clients/{id}	Per-client stats
Stats	GET /api/v1/stats/realtime	Live stats from Redis

API keys list response shape

GET /api/v1/clients/{id}/keys returns a paginated envelope: {"items": [...], "total": N, "limit": N, "offset": N}. Iterate .items, not the root object.

Plan creation requires slug

PlanCreate requires slug (URL-friendly identifier) and generations_limit (plural). Frontend should auto-generate slug from plan name. Wrong field names silently cause HTTP 422.

Portal endpoints (client API key → portal JWT)¶

Method + Path	Description
POST /api/v1/portal/auth/login	Login with API key → portal JWT
GET /api/v1/portal/me	Client info
GET /api/v1/portal/jobs	Client's jobs (paginated, page 1–∞, page_size 1–100)
GET /api/v1/portal/jobs/{id}	Job detail (client-scoped)
GET /api/v1/portal/usage	Usage vs plan quota
GET/POST /api/v1/portal/webhooks	List / create webhooks
DELETE /api/v1/portal/webhooks/{id}	Delete webhook
GET /api/v1/portal/api-keys	List API keys

Portal webhook event names

Events must be job.completed or job.failed — these are the only values in _VALID_EVENTS. Using tryon.completed or similar causes a silent HTTP 422.

Public endpoints (X-API-Key header)¶

Method + Path	Description
POST /api/v1/tryon	Submit try-on (base64 images, max 10 MB each)
GET /api/v1/tryon/status/{id}	Poll job status (client-scoped)
GET /api/v1/tryon/domain-check	Embed script pre-flight

Try-on request body:

{
  "model_image": "<base64 jpeg/png>",
  "garment_image": "<base64 jpeg/png>",
  "category": "tops | bottoms | dresses | outerwear",
  "mode": "balanced | quality",
  "webhook_url": "<optional HTTPS URL>"
}

Category mapping to fal.ai

Internal categories are translated before being sent to fal.ai: dresses → full-body, outerwear → auto, tops/bottoms pass through unchanged.

System endpoints¶

Path	Auth	Description
GET /health	None	Detailed liveness (fal.ai key status, storage backend, media_base_url_public)
GET /readiness	None	Strict K8s readiness probe — 503 if DB or Redis down
GET /metrics	Blocked at nginx (403)	Prometheus metrics

Use /readiness for load balancer probes

/health is for monitoring dashboards. /readiness is the strict upcheck — use it for load balancer health gates.

Configuration Reference¶

Variable	Required	Description
POSTGRES_PASSWORD	Yes	PostgreSQL password
JWT_SECRET_KEY	Yes	≥32 chars — `openssl rand -hex 32`
FAL_API_KEY	Yes	fal.ai key. Format: `{uuid}:{32-char-hex}`
FIRST_ADMIN_EMAIL	Yes	Seeded admin email
FIRST_ADMIN_PASSWORD	Yes	Seeded admin password
GRAFANA_ADMIN_PASSWORD	Yes	Grafana admin password
DOMAIN	Prod only	Used for MEDIA_BASE_URL in production compose
MEDIA_BASE_URL	Prod	Public URL prefix for uploads — must be publicly reachable by fal.ai
STORAGE_BACKEND	No	`local` (default) or `s3`
S3_BUCKET	If S3	S3/R2/MinIO bucket name
S3_REGION	If S3	Default `auto` (for R2)
S3_ENDPOINT_URL	If S3	Empty for AWS; set for R2/MinIO
S3_ACCESS_KEY	If S3	S3 access key
S3_SECRET_KEY	If S3	S3 secret key
S3_PUBLIC_URL	If S3	CDN prefix without trailing slash
ENABLE_DOCS	No	`false` disables /docs in production
UPLOAD_RETENTION_HOURS	No	48h default for media cleanup
CORS_ORIGINS	No	JSON array `["https://a.com"]` — NOT comma-separated

CORS_ORIGINS must be a JSON array

pydantic-settings cannot parse a comma-separated string for List[str]. Always use JSON array format in .env:
CORS_ORIGINS=["https://a.com","https://b.com"]

MEDIA_BASE_URL must be publicly reachable

If MEDIA_BASE_URL contains localhost or 127.0.0.1, fal.ai cannot fetch uploaded images. The startup log emits media_base_url_localhost_warning and each affected job is logged. Set to your public domain in production.

Data Models¶

Table	Key Columns	Notes
users	id, email, password_hash, is_admin	Admin users only
clients	id, name, plan_id, status, allowed_domains	B2B clients
plans	id, slug, name, generations_limit, price_monthly	`slug` required at creation
api_keys	id, client_id, key_prefix, key_hash, is_active	`key_prefix` = first 20 chars
jobs	id, client_id, status, model_image_url, garment_url, result_url	Status: pending→processing→completed/failed/interrupted
refresh_tokens	id, user_id, token_prefix, token_hash, expires_at	`token_prefix` = first 16 chars
usage_logs	client_id, month, generation_count, avg_latency_ms	Monthly per-client
webhook_endpoints	id, client_id, url, secret, events, is_active	Events: `job.completed`, `job.failed`

Security Design¶

Threat	Protection
Brute force	5-strike lockout (Redis), 15-min window on IP + email
Timing attacks	`dummy_password_check()` equalizes response time for missing users
SSRF via webhook URLs	Blocks RFC-1918, loopback, link-local, CGNAT; DNS-resolved at registration
Cross-tenant job reads	Status lookup filtered by `client_id` matching the API key
Rate limit abuse	Per-key hourly + per-IP 30/min on tryon submit
Secrets in logs	structlog redacts sensitive fields
Large images	nginx 12m + Pydantic `max_length=14_000_000`
MIME spoofing	Magic bytes validation
Decompression bombs	`MAX_IMAGE_PIXELS = 4096×4096`
Cache poisoning	`Cache-Control: no-store, private` on all /api/ responses
Metrics leakage	`/metrics: deny all; return 403` at nginx
Internal port exposure	Production compose closes ports 8000/9090/3000

Security headers applied with the always flag (also on error responses): X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Referrer-Policy, Permissions-Policy, full CSP.

Domain whitelist only works in browsers

validate_api_key() checks Origin/Referer headers. Server-to-server API calls (curl, SDKs) don't send these headers — the domain check is skipped. If an API key is leaked, it can be used from any environment. Clients must keep API keys secret.

Image Processing Pipeline¶

Images go through a 9-step pipeline in admin/app/services/image_processor.py:

Magic bytes validation (JPEG/PNG only)
HEIC → WebP conversion
Alpha channel flatten (white background)
EXIF orientation normalize
EXIF metadata strip
Palette → RGB/RGBA conversion
Resize to max 1500×1500 px
WebP re-encode (quality 85→55 adaptive ramp)
Minimum 256×256 px check

Decompression bomb guard

PILImage.MAX_IMAGE_PIXELS is set to 4096 * 4096 at module level. Images exceeding 16.7M pixels are rejected with HTTP 422 before any decode.

Monitoring¶

Prometheus metrics¶

Metric	Labels	Description
tryon_submitted_total	client_id	Jobs submitted
tryon_completed_total	client_id, status	Jobs finished
tryon_rate_limited_total	reason (per_ip, limit_exceeded)	Rate limit hits
cleanup_files_deleted_total	—	Files deleted by cleanup service

Grafana dashboards¶

Three dashboards are auto-provisioned at startup: - Platform Overview — submission rates, completion rates, error rates - ML Worker — job processing latency, fal.ai call duration, heartbeat status - Infrastructure — CPU, memory, disk, Redis and Postgres connection counts

Alerting¶

Telegram alerts fire for: - Worker heartbeat stale > 5 minutes - Job failure rate > 5% in 10 minutes - DB or Redis unreachable

Testing¶

507+ tests, ≥97% coverage. Tests use an in-memory database and fakeredis.

cd admin
pytest --cov=app --cov-report=term-missing -q

Always check term-missing before writing tests

Run pytest --cov=app --cov-report=term-missing -q 2>&1 | grep -E "^\s+app/" | sort -k4 -t% -n | head -20 first. Never guess what's uncovered — read the exact missing lines.

Key test patterns:

validate_api_key calls redis.pipeline() — tests must provide a working pipeline or override the dependency
SSRF validator calls socket.getaddrinfo() — mock in unit tests
Webhook URL fixtures use https://example.com/... (resolves in CI; bare new.example.com does not)
FastAPI returns 422 (not 401) when a required header is absent entirely; 401 only when the token is present but invalid

Development Workflow¶

# Clone and start
git clone <repo> && cd tryon-saas
cp .env.example .env   # fill in required values
docker-compose up -d

# View logs
docker-compose logs -f admin-api

# Run tests
cd admin && pytest -x --tb=short

# Apply DB migrations manually (auto-migrate is NOT enabled)
docker-compose exec admin-api alembic upgrade head

# Lint
cd admin && ruff check app/   # run from admin/ dir, NOT from repo root

# Rebuild after frontend changes
docker-compose build frontend && docker-compose up -d frontend

Migrations do not run automatically

main.py calls Base.metadata.create_all (creates tables from ORM models for fresh installs) but does NOT run alembic upgrade head. New columns from migrations require a manual alembic upgrade head.

Frontend is baked into the Docker image

After changing frontend source, you must run docker-compose build frontend before up -d. Old images cause runtime errors when the API response shape has changed but the container still runs the old frontend.

Known Constraints¶

Constraint	Detail
CORS_ORIGINS	Must be a JSON array, not comma-separated
Frontend deploys	Rebuild Docker image after every source change
API keys list	Paginated envelope `{items, total, limit, offset}` — iterate `.items`
bcrypt	5.x is incompatible with passlib — use bcrypt directly
JWT library	PyJWT only, not python-jose
change-password rate limiting	Not implemented (known gap)
S3 ACL	Not set by default — configure public bucket policy separately
Portal webhook events	Must be `job.completed` or `job.failed`
avg_latency_ms	Not recalculated on CONFLICT — shows first job's value
fal.ai polling timeout	600s ceiling; worker shut down mid-poll will be killed by Docker after 35s grace period