Technical Reference
This reference covers the internal architecture, API surface, data models, security design, and development workflow of the TryOn SaaS platform. Audience: engineers onboarding to the codebase.
Source of truth
CLAUDE.md in the repository root is the authoritative reference, auto-maintained alongside code changes.
Overview¶
TryOn SaaS is a B2B virtual clothing try-on platform. Client businesses embed a JavaScript widget on their product pages; the widget submits garment and model images to the Admin API, which queues an AI inference job. The ML Worker picks up jobs from Redis and calls fal.ai FASHN v1.5 to generate try-on images. Results are stored in the database and returned via polling or webhooks.
The platform is multi-tenant: each client has an account, a plan with generation limits, API keys, and optionally registered webhooks. An admin UI manages clients and monitors usage. A client portal lets clients track their own jobs and manage their integration.
Architecture¶
Client Website → Embed Script (tryon-embed.js) → Admin API (:8000) → Redis Queue → ML Worker → fal.ai FASHN v1.5
↓ ↓
PostgreSQL 15 Redis 7 (AOF)
↑
Prometheus (:9090) → Grafana (:3000)
ML Worker → Pushgateway (:9091) → Prometheus
Nginx (:80/:443) → Frontend (:5173) + Embed Static (/embed/)
Telegram Bot ← AlertManager (health + worker alerts)
Data flow for a try-on request¶
- Client's website loads
tryon-embed.jsvia a<script>tag. The widget renders a floating action button. - User clicks the widget, selects a garment. The widget encodes both images as base64 and POSTs to
POST /api/v1/tryonwith anX-API-Keyheader. - Admin API validates the API key, checks the client's plan quota, processes images (resize, format normalize, strip EXIF), stores them, and enqueues a job in Redis.
- ML Worker receives the job via
BRPOP, calls fal.ai, and polls for completion (up to 600 seconds). - On completion the worker writes the result URL to the database and fires registered webhooks.
- The widget polls
GET /api/v1/tryon/status/{id}until it gets a result URL, then displays the try-on image.
Infrastructure notes¶
Nginx reverse-proxies all public traffic. The Admin API and frontend are internal Docker services. Redis and PostgreSQL are internal. Prometheus scrapes admin-api:8000/metrics every 15 seconds; the ML worker pushes metrics to Pushgateway. Grafana reads from Prometheus and serves three auto-provisioned dashboards.
Services¶
| Service | Port | Technology | Purpose |
|---|---|---|---|
| admin-api | 8000 (internal) | FastAPI + SQLAlchemy 2.0 (async) + Pydantic v2 | Core API: auth, clients, jobs, billing, webhooks |
| ml-worker | internal | Python asyncio + BRPOP loop + fal-client | AI job processor |
| postgres | internal | PostgreSQL 15 | Primary database |
| redis | internal | Redis 7 (AOF) | Job queue + rate limiting + refresh token dedup |
| frontend | internal | React 18 + Vite + TypeScript + shadcn/ui | Admin UI + client portal (proxied via nginx) |
| nginx | 80/443 | Nginx | Reverse proxy, TLS, rate limiting, security headers |
| prometheus | 9090 (internal) | Prometheus | Metrics scraping and storage |
| grafana | 3000 (internal) | Grafana | Dashboards (auto-provisioned) |
| pushgateway | 9091 (internal) | Prometheus Pushgateway | Receives metrics from ML worker |
Note
All services have restart: unless-stopped, logging: json-file (50 MB max / 5 files), and resource limits.
API Reference¶
Authentication model¶
- Access tokens — 15-minute lifetime, HS256 signed with
JWT_SECRET_KEY - Refresh tokens — 30-day lifetime, rotated on every
/auth/refresh. Stored asbcrypt(sha256(token)).token_prefix= first 16 chars for O(1) lookup. - Brute force — 5 failed logins → IP + email locked 15 min
- Timing attack —
dummy_password_check()equalizes response time for missing users - API Keys —
tryon_prefix + 32 random chars.key_prefix= first 20 chars. - Portal JWT —
scope="portal", separate from admin JWTs
PyJWT, not python-jose
The project uses PyJWT>=2.8.0. Import as import jwt, catch jwt.PyJWTError. Do NOT add python-jose.
Admin endpoints (require admin JWT)¶
| Group | Method + Path | Description |
|---|---|---|
| Auth | POST /api/v1/auth/login | Access + refresh tokens |
| Auth | POST /api/v1/auth/refresh | Rotate refresh token |
| Auth | POST /api/v1/auth/logout | Invalidate refresh token |
| Auth | GET /api/v1/auth/me | Current user info |
| Auth | POST /api/v1/auth/change-password | Change password |
| Clients | GET/POST /api/v1/clients | List / create clients |
| Clients | GET/PATCH/DELETE /api/v1/clients/{id} | Client detail / update / delete |
| Clients | POST /api/v1/clients/{id}/suspend | Suspend client |
| Clients | POST /api/v1/clients/{id}/activate | Reactivate client |
| Clients | POST /api/v1/clients/{id}/reset-usage | Reset monthly usage |
| API Keys | GET /api/v1/clients/{id}/keys | List keys (paginated envelope) |
| API Keys | POST /api/v1/clients/{id}/keys | Create key (raw value shown once) |
| API Keys | DELETE /api/v1/clients/{id}/keys/{key_id} | Revoke key |
| Plans | GET/POST /api/v1/plans | List / create plans |
| Plans | GET/PATCH/DELETE /api/v1/plans/{id} | Plan CRUD |
| Jobs | GET /api/v1/jobs | All jobs (supports date_from, date_to) |
| Jobs | GET /api/v1/jobs/{id} | Job detail |
| Stats | GET /api/v1/stats/overview | Platform-wide stats |
| Stats | GET /api/v1/stats/jobs | Job stats |
| Stats | GET /api/v1/stats/clients/{id} | Per-client stats |
| Stats | GET /api/v1/stats/realtime | Live stats from Redis |
API keys list response shape
GET /api/v1/clients/{id}/keys returns a paginated envelope: {"items": [...], "total": N, "limit": N, "offset": N}. Iterate .items, not the root object.
Plan creation requires slug
PlanCreate requires slug (URL-friendly identifier) and generations_limit (plural). Frontend should auto-generate slug from plan name. Wrong field names silently cause HTTP 422.
Portal endpoints (client API key → portal JWT)¶
| Method + Path | Description |
|---|---|
| POST /api/v1/portal/auth/login | Login with API key → portal JWT |
| GET /api/v1/portal/me | Client info |
| GET /api/v1/portal/jobs | Client's jobs (paginated, page 1–∞, page_size 1–100) |
| GET /api/v1/portal/jobs/{id} | Job detail (client-scoped) |
| GET /api/v1/portal/usage | Usage vs plan quota |
| GET/POST /api/v1/portal/webhooks | List / create webhooks |
| DELETE /api/v1/portal/webhooks/{id} | Delete webhook |
| GET /api/v1/portal/api-keys | List API keys |
Portal webhook event names
Events must be job.completed or job.failed — these are the only values in _VALID_EVENTS. Using tryon.completed or similar causes a silent HTTP 422.
Public endpoints (X-API-Key header)¶
| Method + Path | Description |
|---|---|
| POST /api/v1/tryon | Submit try-on (base64 images, max 10 MB each) |
| GET /api/v1/tryon/status/{id} | Poll job status (client-scoped) |
| GET /api/v1/tryon/domain-check | Embed script pre-flight |
Try-on request body:
{
"model_image": "<base64 jpeg/png>",
"garment_image": "<base64 jpeg/png>",
"category": "tops | bottoms | dresses | outerwear",
"mode": "balanced | quality",
"webhook_url": "<optional HTTPS URL>"
}
Category mapping to fal.ai
Internal categories are translated before being sent to fal.ai: dresses → full-body, outerwear → auto, tops/bottoms pass through unchanged.
System endpoints¶
| Path | Auth | Description |
|---|---|---|
| GET /health | None | Detailed liveness (fal.ai key status, storage backend, media_base_url_public) |
| GET /readiness | None | Strict K8s readiness probe — 503 if DB or Redis down |
| GET /metrics | Blocked at nginx (403) | Prometheus metrics |
Use /readiness for load balancer probes
/health is for monitoring dashboards. /readiness is the strict upcheck — use it for load balancer health gates.
Configuration Reference¶
| Variable | Required | Description |
|---|---|---|
| POSTGRES_PASSWORD | Yes | PostgreSQL password |
| JWT_SECRET_KEY | Yes | ≥32 chars — openssl rand -hex 32 |
| FAL_API_KEY | Yes | fal.ai key. Format: {uuid}:{32-char-hex} |
| FIRST_ADMIN_EMAIL | Yes | Seeded admin email |
| FIRST_ADMIN_PASSWORD | Yes | Seeded admin password |
| GRAFANA_ADMIN_PASSWORD | Yes | Grafana admin password |
| DOMAIN | Prod only | Used for MEDIA_BASE_URL in production compose |
| MEDIA_BASE_URL | Prod | Public URL prefix for uploads — must be publicly reachable by fal.ai |
| STORAGE_BACKEND | No | local (default) or s3 |
| S3_BUCKET | If S3 | S3/R2/MinIO bucket name |
| S3_REGION | If S3 | Default auto (for R2) |
| S3_ENDPOINT_URL | If S3 | Empty for AWS; set for R2/MinIO |
| S3_ACCESS_KEY | If S3 | S3 access key |
| S3_SECRET_KEY | If S3 | S3 secret key |
| S3_PUBLIC_URL | If S3 | CDN prefix without trailing slash |
| ENABLE_DOCS | No | false disables /docs in production |
| UPLOAD_RETENTION_HOURS | No | 48h default for media cleanup |
| CORS_ORIGINS | No | JSON array ["https://a.com"] — NOT comma-separated |
CORS_ORIGINS must be a JSON array
pydantic-settings cannot parse a comma-separated string for List[str]. Always use JSON array format in .env:
CORS_ORIGINS=["https://a.com","https://b.com"]
MEDIA_BASE_URL must be publicly reachable
If MEDIA_BASE_URL contains localhost or 127.0.0.1, fal.ai cannot fetch uploaded images. The startup log emits media_base_url_localhost_warning and each affected job is logged. Set to your public domain in production.
Data Models¶
| Table | Key Columns | Notes |
|---|---|---|
| users | id, email, password_hash, is_admin | Admin users only |
| clients | id, name, plan_id, status, allowed_domains | B2B clients |
| plans | id, slug, name, generations_limit, price_monthly | slug required at creation |
| api_keys | id, client_id, key_prefix, key_hash, is_active | key_prefix = first 20 chars |
| jobs | id, client_id, status, model_image_url, garment_url, result_url | Status: pending→processing→completed/failed/interrupted |
| refresh_tokens | id, user_id, token_prefix, token_hash, expires_at | token_prefix = first 16 chars |
| usage_logs | client_id, month, generation_count, avg_latency_ms | Monthly per-client |
| webhook_endpoints | id, client_id, url, secret, events, is_active | Events: job.completed, job.failed |
Security Design¶
| Threat | Protection |
|---|---|
| Brute force | 5-strike lockout (Redis), 15-min window on IP + email |
| Timing attacks | dummy_password_check() equalizes response time for missing users |
| SSRF via webhook URLs | Blocks RFC-1918, loopback, link-local, CGNAT; DNS-resolved at registration |
| Cross-tenant job reads | Status lookup filtered by client_id matching the API key |
| Rate limit abuse | Per-key hourly + per-IP 30/min on tryon submit |
| Secrets in logs | structlog redacts sensitive fields |
| Large images | nginx 12m + Pydantic max_length=14_000_000 |
| MIME spoofing | Magic bytes validation |
| Decompression bombs | MAX_IMAGE_PIXELS = 4096×4096 |
| Cache poisoning | Cache-Control: no-store, private on all /api/ responses |
| Metrics leakage | /metrics: deny all; return 403 at nginx |
| Internal port exposure | Production compose closes ports 8000/9090/3000 |
Security headers applied with the always flag (also on error responses): X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Referrer-Policy, Permissions-Policy, full CSP.
Domain whitelist only works in browsers
validate_api_key() checks Origin/Referer headers. Server-to-server API calls (curl, SDKs) don't send these headers — the domain check is skipped. If an API key is leaked, it can be used from any environment. Clients must keep API keys secret.
Image Processing Pipeline¶
Images go through a 9-step pipeline in admin/app/services/image_processor.py:
- Magic bytes validation (JPEG/PNG only)
- HEIC → WebP conversion
- Alpha channel flatten (white background)
- EXIF orientation normalize
- EXIF metadata strip
- Palette → RGB/RGBA conversion
- Resize to max 1500×1500 px
- WebP re-encode (quality 85→55 adaptive ramp)
- Minimum 256×256 px check
Decompression bomb guard
PILImage.MAX_IMAGE_PIXELS is set to 4096 * 4096 at module level. Images exceeding 16.7M pixels are rejected with HTTP 422 before any decode.
Monitoring¶
Prometheus metrics¶
| Metric | Labels | Description |
|---|---|---|
| tryon_submitted_total | client_id | Jobs submitted |
| tryon_completed_total | client_id, status | Jobs finished |
| tryon_rate_limited_total | reason (per_ip, limit_exceeded) | Rate limit hits |
| cleanup_files_deleted_total | — | Files deleted by cleanup service |
Grafana dashboards¶
Three dashboards are auto-provisioned at startup: - Platform Overview — submission rates, completion rates, error rates - ML Worker — job processing latency, fal.ai call duration, heartbeat status - Infrastructure — CPU, memory, disk, Redis and Postgres connection counts
Alerting¶
Telegram alerts fire for: - Worker heartbeat stale > 5 minutes - Job failure rate > 5% in 10 minutes - DB or Redis unreachable
Testing¶
507+ tests, ≥97% coverage. Tests use an in-memory database and fakeredis.
Always check term-missing before writing tests
Run pytest --cov=app --cov-report=term-missing -q 2>&1 | grep -E "^\s+app/" | sort -k4 -t% -n | head -20 first. Never guess what's uncovered — read the exact missing lines.
Key test patterns:
validate_api_keycallsredis.pipeline()— tests must provide a working pipeline or override the dependency- SSRF validator calls
socket.getaddrinfo()— mock in unit tests - Webhook URL fixtures use
https://example.com/...(resolves in CI; barenew.example.comdoes not) - FastAPI returns 422 (not 401) when a required header is absent entirely; 401 only when the token is present but invalid
Development Workflow¶
# Clone and start
git clone <repo> && cd tryon-saas
cp .env.example .env # fill in required values
docker-compose up -d
# View logs
docker-compose logs -f admin-api
# Run tests
cd admin && pytest -x --tb=short
# Apply DB migrations manually (auto-migrate is NOT enabled)
docker-compose exec admin-api alembic upgrade head
# Lint
cd admin && ruff check app/ # run from admin/ dir, NOT from repo root
# Rebuild after frontend changes
docker-compose build frontend && docker-compose up -d frontend
Migrations do not run automatically
main.py calls Base.metadata.create_all (creates tables from ORM models for fresh installs) but does NOT run alembic upgrade head. New columns from migrations require a manual alembic upgrade head.
Frontend is baked into the Docker image
After changing frontend source, you must run docker-compose build frontend before up -d. Old images cause runtime errors when the API response shape has changed but the container still runs the old frontend.
Known Constraints¶
| Constraint | Detail |
|---|---|
| CORS_ORIGINS | Must be a JSON array, not comma-separated |
| Frontend deploys | Rebuild Docker image after every source change |
| API keys list | Paginated envelope {items, total, limit, offset} — iterate .items |
| bcrypt | 5.x is incompatible with passlib — use bcrypt directly |
| JWT library | PyJWT only, not python-jose |
| change-password rate limiting | Not implemented (known gap) |
| S3 ACL | Not set by default — configure public bucket policy separately |
| Portal webhook events | Must be job.completed or job.failed |
| avg_latency_ms | Not recalculated on CONFLICT — shows first job's value |
| fal.ai polling timeout | 600s ceiling; worker shut down mid-poll will be killed by Docker after 35s grace period |