Project Report

Overview¶

This document is a comprehensive account of all work performed to design, build, harden, and ship the TryOn SaaS platform — a B2B virtual clothing try-on service that allows e-commerce businesses to embed an AI-powered widget on their websites. It is intended to give technical leadership a complete picture of the scope, depth, and engineering discipline that went into this system.

The platform was built from a blank VPS to a fully automated, production-hardened, monitored service with a 41-finding security audit, 507+ automated tests, ≥97% code coverage, and a CI/CD pipeline where a git push results in a live deployment in approximately 20 seconds.

1. Server & Infrastructure Setup¶

Provisioning¶

The foundation of the platform is a dedicated VPS purchased at xorek.cloud with the following specifications:

Parameter	Value
CPU	4 vCPU
RAM	8 GB
Storage	80 GB NVMe SSD
IP Address	185.246.222.107
Operating System	Ubuntu 22.04 LTS

Day-One Hardening¶

Security hardening was applied on the very first session, before any application code was deployed:

Root SSH disabled — remote root login is explicitly blocked in /etc/ssh/sshd_config
Password authentication disabled — only public-key authentication is accepted
Non-root deploy user — all operations run as deploy with NOPASSWD sudo strictly scoped to Docker commands
UFW firewall — only ports 22 (SSH), 80 (HTTP), and 443 (HTTPS) are open; all other inbound traffic is denied by default
fail2ban — protects the SSH port against brute-force attacks with automatic IP banning after repeated failed attempts

SSH Access Recovery

If the SSH key is ever lost, recovery is possible through the xorek.cloud web console. A backup of the sshd config is kept at /etc/ssh/sshd_config.backup.20260520 on the server.

2. Domain & DNS Setup¶

Cloudflare Integration¶

The domain ziex-tryon.com is registered and fully managed through Cloudflare. All DNS resolution runs through Cloudflare's proxy (orange-cloud mode), providing:

DDoS protection at the DNS/CDN layer
Automatic HTTP→HTTPS redirects
Bot management and IP reputation filtering
Edge caching for static assets

TLS Strategy¶

Rather than Let's Encrypt (which requires certificate renewal automation and ACME challenges), a Cloudflare Origin Certificate was chosen. This is a certificate issued directly by Cloudflare, trusted only between Cloudflare's edge and the origin server — enabling full-strict TLS mode without the operational overhead of certbot.

nginx OCSP Stapling

Cloudflare Origin Certificates are not signed by a public CA, so OCSP stapling must be disabled in nginx:

ssl_stapling        off;
ssl_stapling_verify off;

Failing to set this causes nginx to log repeated OCSP errors and can delay TLS handshakes.

Subdomain Map¶

Six DNS entries were configured, all pointing to the same origin IP (185.246.222.107) but routed to different application contexts by nginx server blocks:

Subdomain	Purpose
`ziex-tryon.com`	Main landing page
`api.ziex-tryon.com`	REST API (FastAPI backend)
`admin.ziex-tryon.com`	Admin panel (React SPA)
`app.ziex-tryon.com`	Client portal (React SPA)
`sandbox.ziex-tryon.com`	Integration sandbox for testing
`docs.ziex-tryon.com`	MkDocs documentation site

3. Architecture¶

The platform was designed from scratch with a clear separation of concerns across stateless HTTP services, a queue-based job processing pipeline, and a dedicated ML inference backend.

Client Website
      │
      ▼
tryon-embed.js  ──►  Admin API (:8000)  ──►  Redis Queue  ──►  ML Worker
(Shadow DOM)          │                                              │
                      │                                              ▼
                 PostgreSQL 15                              fal.ai FASHN v1.5
                      ▲                                    (AI inference)
                      │
             ┌────────┴────────┐
        Prometheus           Grafana
             ▲               (3 dashboards)
             │
       Pushgateway  ◄──  ML Worker metrics
             │
       AlertManager  ──►  Telegram Bot
             │
           nginx  (TLS · rate limiting · security headers)

Service Inventory¶

The entire platform runs as 8 Docker services defined in docker-compose.yml with a production overlay in docker-compose.prod.yml:

Service	Internal Port	Role
`postgres`	5432 (internal)	PostgreSQL 15 — persistent data store
`redis`	6379 (internal)	Redis 7 AOF — job queue + rate limit counters
`admin-api`	8000	FastAPI — all business logic
`ml-worker`	none	Job processor — Redis BRPOP loop
`frontend`	5173 (internal)	Vite dev / built React SPA
`nginx`	80/443	Reverse proxy, TLS termination
`prometheus`	9090 (internal in prod)	Metrics scraping
`grafana`	3000 (internal in prod)	Dashboards

Every service is configured with: - restart: unless-stopped - logging: json-file driver with max-size: 50m, max-file: 5 - CPU and memory limits and reservations - Docker health checks

Production vs Development

docker-compose.prod.yml overrides the base compose to close ports 8000, 9090, and 3000 from public access, sets MEDIA_BASE_URL to the production domain, and mounts the production nginx config with full TLS. In development these ports are accessible directly for debugging.

Job Processing Pipeline¶

When a client's embedded widget triggers a try-on request:

Embed widget (tryon-embed.js) captures product and person images, sends a POST /api/v1/tryon with an API key
Admin API validates the key, checks the client's generation quota, processes images (resize, EXIF strip, WebP conversion), saves them to storage, pushes a JSON job payload to Redis
ML Worker picks up the job via BRPOP (blocking pop), calls fal.ai FASHN v1.5
fal.ai runs the inference (virtual try-on), returns a result URL
Worker updates the job status in PostgreSQL and fires registered webhooks
Client polls GET /api/v1/tryon/status/{id} until status is completed

4. Tech Stack¶

Backend¶

Component	Version	Notes
Python	3.11	Async throughout (asyncio, async SQLAlchemy)
FastAPI	0.111.0	ASGI, Pydantic v2 integration
SQLAlchemy	2.0	Async engine, `async_session` factory
Pydantic	v2	Breaking change: `AnyHttpUrl` is no longer a `str` subclass
Alembic	1.13	DB migrations, auto-applied on startup
PostgreSQL	15	Primary data store
Redis	7	AOF persistence, job queue + rate limiting
PyJWT	≥2.8.0	Not python-jose — import as `import jwt`
bcrypt	direct (5.x)	Not passlib — 5.x API changed, passlib incompatible
structlog	latest	JSON logging with sensitive-field redaction
httpx	latest	Async HTTP client for fal.ai calls

Pydantic v2 AnyHttpUrl Gotcha

In Pydantic v1, AnyHttpUrl was a str subclass and could be passed directly to asyncpg. In Pydantic v2 it is not. Any URL field passed to SQLAlchemy/asyncpg must be explicitly coerced:

# Wrong — asyncpg raises TypeError: expected str, got AnyHttpUrl
model = Job(webhook_url=body.webhook_url)

# Correct
model = Job(webhook_url=str(body.webhook_url))

# Correct for PATCH (model_dump coerces automatically)
body.model_dump(mode="json", exclude_unset=True)

Frontend¶

Component	Version	Notes
React	18	SPA, hooks, context
TypeScript	5	Strict mode
Vite	5	Build tool, dev server
Tailwind CSS	3	Utility-first styling
shadcn/ui	latest	Component library
Node.js	22	Build environment (Alpine image)
Axios	latest	HTTP client with JWT interceptor + auto-refresh
Zustand	latest	State management (separate stores for admin/portal auth)

Infrastructure¶

Component	Version	Notes
Docker Compose	V2	`docker compose` command
nginx	1.25	TLS, gzip, rate limiting, security headers
Prometheus	2.x	Metrics collection
Grafana	11.0.0	Dashboards, auto-provisioned
Pushgateway	1.10.0	Worker metrics (worker has no HTTP server)
fal.ai FASHN	v1.5	AI inference backend

5. Security Audit — 41 Findings¶

A formal security audit was conducted in two phases. 41 findings were identified, 38 closed, and 3 consciously deferred with written justification in the project's known-issues register.

Authentication & Access Control (8 findings)¶

Finding	Resolution
Refresh tokens stored as plaintext	SHA-256 pre-hash → bcrypt(rounds=12); `token_prefix` VARCHAR(16) for O(1) DB lookup
No brute-force protection on login	5 failed attempts → IP + email locked for 15 minutes (Redis counters)
Timing attack on user lookup	`dummy_password_check()` runs bcrypt on pre-computed hash when user not found — both paths take identical time
Portal JWT used same scope as admin	Separate `scope="portal"` claim; `get_current_client` dependency validates scope
Refresh token lookup was O(n)	`token_prefix` index + composite `(token_prefix, user_id)` index — confirmed O(1) at scale
`/auth/change-password` no rate limit	Deferred: 15-min token window limits attack surface; rate limit scheduled for next sprint
API key brute force	`key_prefix` VARCHAR(20) fast lookup; bcrypt hash storage; per-key hourly rate limit in Redis
Portal login had no rate limit	Fixed: 5/min/IP via `get_redis` dependency. Was using `request.app.state.redis` (never set → always fail-open)

Input Validation & Injection (7 findings)¶

Finding	Resolution
f-string SQL in stats queries	Replaced with parameterized SQLAlchemy expressions throughout
SSRF via webhook URL	`validate_webhook_url()` blocks RFC-1918, loopback, link-local, CGNAT; DNS-resolves hostname to check all returned IPs
SSRF via per-job webhook_url	Same validator applied at Pydantic schema layer — not just in the worker
MIME type not validated	Strict `image/jpeg` / `image/png` check via magic bytes, not just Content-Type header
Decompression bomb	`PIL.MAX_IMAGE_PIXELS = 4096 * 4096`; explicit post-open pixel count check; rejects >16.7M pixels with HTTP 422
`TryOnRequest.mode` accepted any string	Changed to `Literal["balanced", "quality"]`; invalid values return 422
`ClientUpdate.status` accepted any string	Changed to `Literal["active", "suspended"]`; prevents `"banned"` persisting to DB

Infrastructure & Configuration (14 findings)¶

Finding	Resolution
Port 8000 publicly exposed	Closed in `docker-compose.prod.yml`; nginx is the only ingress
Ports 9090, 3000 publicly exposed	Closed in production override
`CORS_ORIGINS` JSON parse error	pydantic-settings cannot parse comma-separated list; must use JSON array in `.env`
`/metrics` endpoint publicly readable	`location = /metrics { deny all; return 403; }` in both nginx configs
No `Cache-Control` on API responses	`RequestIdMiddleware` adds `Cache-Control: no-store, private` to all `/api/` responses
`ENABLE_DOCS` not enforced	`/docs`, `/redoc`, `/openapi.json` removed when `ENABLE_DOCS=false`
`Permissions-Policy` header missing	Added to both nginx configs: `camera=(), microphone=(), geolocation=()`
Security headers only on 2xx	Added `always` flag to all nginx security headers directives
`nginx client_max_body_size` not set	Set to `12m` (10 MB image + ~33% base64 overhead)
Prometheus targeted ml-worker:8001	Removed — ml-worker has no HTTP server; was causing Prometheus log spam
DOMAIN missing from `.env.example`	Added; `deploy.sh --prod` fails fast if `DOMAIN` is unset
`media_base_url_localhost` warning	Startup log + health endpoint `media_base_url_public: bool` field
Storage backend not observable	`GET /health` response includes `storage_backend: "local"/"s3"`
nginx `health` location missing	Added explicit `location = /health` in both configs; was falling through to Vite frontend

Frontend Correctness (7 findings)¶

Finding	Resolution
`Settings.tsx` called wrong endpoint	Was `PUT /auth/me/password`; fixed to `POST /auth/change-password` with correct field names
`ClientDetail.tsx` iterated envelope object	`GET /clients/{id}/keys` returns paginated envelope; fixed to use `.items`
Concurrent 401 refresh race	Module-level `refreshPromise` dedup in `api/client.ts` — second 401 reuses in-flight refresh
Portal webhooks wrong event namespace	Was `tryon.completed`; fixed to `job.completed` / `job.failed` (only valid values in `_VALID_EVENTS`)
`model_validate(update=...)` removed in Pydantic 2.13+	Webhook response built as plain dict before `model_validate`
Jobs date filter was dead code	Fully implemented: `date_from`/`date_to` query params wired on both frontend and backend
Plan creation missing `slug` field	Frontend auto-generates slug from plan name; API requires `slug` + `generations_limit`

Observability & Monitoring (5 findings)¶

Finding	Resolution
No Prometheus metrics for worker	Pushgateway integration — worker pushes after each job; Prometheus scrapes Pushgateway
Worker healthcheck always passed	Was `python -c "sys.exit(0)"`; now checks `worker_heartbeat` Redis key is <90s old
No stale job detection	`_run_stale_job_sweep()` runs every 60s in admin-api lifespan; marks `processing >15min` and `pending >30min` as failed
No Grafana dashboards	3 dashboards auto-provisioned via `grafana/dashboards/` directory mount
Telegram alerts had no dedup	Redis `SETNX` with 1h TTL per alert key prevents duplicate notifications

6. CI/CD Pipeline¶

The CI/CD pipeline was built to enforce the project's quality standards automatically. A developer's workflow is: write code, push, and watch the pipeline do everything else.

Continuous Integration (`ci.yml`)¶

Push to any branch
       │
       ▼
  Lint (ruff)
  admin/ruff.toml: line-length=130, E501 ignored
       │
       ▼
  Tests (pytest)
  Services: postgres:15, redis:7 in GitHub Actions containers
  Threshold: --cov-fail-under=94 (≥97% coverage maintained)
       │
       ▼
  Docker build (admin-api + ml-worker)
  Verifies Dockerfiles are valid and dependencies install correctly

Continuous Deployment (`deploy.yml`)¶

CI workflow completes with success
       │
       ▼  (workflow_run trigger, not needs:)
  SSH to 185.246.222.107
       │
       ▼
  git pull origin main
       │
       ▼
  git diff HEAD~1 HEAD → detect changed services
       │
       ├── admin-api changed?  → docker compose build admin-api && up -d
       ├── ml-worker changed?  → docker compose build ml-worker && up -d
       ├── frontend changed?   → build in container → rsync dist/ to nginx
       └── nginx/compose changed? → docker compose up -d nginx
       │
       ▼
  curl /health → verify 200
       │
       ▼
  Deployment complete (~20 seconds)

Why workflow_run instead of needs:

needs: only works within the same workflow file. workflow_run allows deploy.yml to gate on ci.yml success across separate files. Without this, deploy would fire on every push regardless of test results.

Local Development Tools¶

docker-compose.test.yml — runs the test suite with source volume-mounted; no rebuild required after code changes
Pre-push git hook — runs ruff check and pytest locally before any push reaches GitHub
scripts/generate-secrets.sh — generates all random secrets (JWT key, DB password, Grafana password, admin password) in one command

7. Test Suite¶

Testing was treated as a first-class deliverable. The suite grew in lockstep with the codebase using test-driven development practices.

Coverage Methodology¶

Before writing any test, the exact uncovered lines are identified:

cd admin && pytest --cov=app --cov-report=term-missing -q \
  2>&1 | grep -E "^\s+app/" | sort -k4 -t% -n | head -20

This prevents guessing what to test and ensures every new test file targets real coverage gaps — not imagined ones.

Test File Inventory¶

File	Tests	Coverage Area
`test_auth.py`	40+	Login, refresh, logout, change-password, brute force lockout
`test_clients.py`	25+	Client CRUD, suspend/activate, quota reset
`test_api_keys.py`	20+	Create, list, revoke API keys
`test_plans.py`	15+	Plan CRUD including delete (was 0% before audit)
`test_jobs.py`	15+	Job list/detail, status isolation between clients
`test_stats.py`	20+	Overview, realtime, per-client breakdown
`test_tryon.py`	30+	Submit, status, domain-check, rate limiting
`test_security.py`	20+	Timing attack, rate limit, domain whitelist, SSRF
`test_health.py`	10+	Health endpoint, readiness probe, error sanitization
`test_errors.py`	10+	Error format consistency across all endpoints
`test_dependencies.py`	15+	API key validation paths, fail-open/closed
`test_billing_service.py`	15+	Generation limit checks, usage log upsert
`test_image_processor.py`	31	Format validation, pipeline, resize, quality ramp, decompression bomb
`test_webhooks.py`	23	CRUD, HMAC signing, delivery, client isolation
`test_graceful_shutdown.py`	14	Signal handlers, request tracking, interrupted job marking
`test_multitenancy.py`	23	Data isolation, API key scoping, plan quota enforcement
`test_service_units.py`	20+	Stateless unit tests: config, auth, billing, queue
`test_alerting_service.py`	15+	Telegram alert formatting, dedup, trigger conditions
`test_embed_domain_check.py`	10+	Domain-check endpoint for embed pre-flight
`test_portal.py`	30+	Portal login, job scoping, usage, webhook CRUD, tenant isolation
`test_storage_service.py`	15	LocalStorageBackend save/delete/list, factory, unknown backend
`test_s3_storage.py`	8	S3StorageBackend via moto mock (skipped if boto3/moto absent)
`test_hardening.py`	12	Job status isolation, ENABLE_DOCS flag, Cache-Control, health sanitization
`test_observability.py`	15+	Health fields, Prometheus counters, webhook dedup, readiness
`test_sandbox.py`	32	Sandbox garment/model CRUD, pagination, public endpoints, auth enforcement

Total: 507+ tests, ≥97% coverage

8. Monitoring & Observability¶

Metrics Pipeline¶

Admin API  ──► Prometheus (scrape every 15s)  ──► Grafana
ML Worker  ──► Pushgateway                    ──► Grafana

The admin-api exposes GET /metrics (Prometheus format) blocked from public access at the nginx layer. The ML worker, having no HTTP server, pushes metrics to Pushgateway after each job. Prometheus scrapes Pushgateway in the same 15-second cycle.

Four Prometheus counters are defined in admin/app/metrics.py:

Counter	Labels	Meaning
`tryon_submitted_total`	`client_id`	Every try-on request accepted
`tryon_completed_total`	`client_id`, `status`	Job completions (success/failure)
`tryon_rate_limited_total`	`reason`	Rate limit hits (`per_ip` or `limit_exceeded`)
`cleanup_files_deleted_total`	none	WebP files removed by cleanup service

Grafana Dashboards¶

Three dashboards are auto-provisioned from grafana/dashboards/ — no manual setup required after deploy:

Platform Overview — request rates, job throughput, error rates
Client Usage — per-client generation counts, quota utilization
Worker Health — ML Worker heartbeat age, job queue depth, fal.ai latency

Health Endpoints¶

Endpoint	Purpose	Use Case
`GET /health`	Detailed status (DB, Redis, worker heartbeat, fal.ai key, storage backend)	Monitoring dashboards
`GET /readiness`	Strict probe: 200 if DB+Redis reachable, 503 otherwise	Load balancer health checks

Worker Heartbeat & Stale Job Detection¶

The ML Worker writes a worker_heartbeat key to Redis every 30 seconds. The /health endpoint reports the age of this key — if the worker dies, the heartbeat goes stale and appears in the dashboard.

A background task in admin-api (_run_stale_job_sweep()) runs every 60 seconds and marks jobs as failed if they are stuck: - processing status for >15 minutes → failed (worker died mid-job) - pending status for >30 minutes → failed (BRPOP race — job was popped but never processed)

Telegram Alerts¶

alerting_service.py sends structured alerts to a Telegram bot. Redis SETNX with 1-hour TTL prevents duplicate notifications for the same alert type. Alert severity levels: info, warning, critical.

9. Deployment Workflow¶

"git push → 20 seconds → production"¶

The deployment pipeline is fully automated. Here is the complete flow:

Developer: git push origin main
                │
                ▼ (GitHub Actions: ci.yml)
        1. ruff check admin/
        2. pytest --cov-fail-under=94
           (postgres:15 + redis:7 service containers)
        3. docker build admin-api
        4. docker build ml-worker
                │ (all pass)
                ▼ (GitHub Actions: deploy.yml, workflow_run trigger)
        5. SSH to 185.246.222.107
        6. git pull origin main
        7. diff HEAD~1 HEAD — find changed services
        8. docker compose build <changed services>
        9. docker compose up -d <changed services>
        10. curl https://ziex-tryon.com/health → 200
                │
                ▼
        Production updated ✓

Deployment Scripts¶

All scripts live in scripts/ and are designed to be idempotent and auditable.

generate-secrets.sh — run once before first deployment:

bash scripts/generate-secrets.sh
# Generates: JWT_SECRET_KEY, POSTGRES_PASSWORD,
#            GRAFANA_ADMIN_PASSWORD, FIRST_ADMIN_PASSWORD
# Outputs ready-to-paste .env entries

deploy.sh --prod — full production deployment:

bash scripts/deploy.sh --prod
# 1. Pre-flight: checks DOMAIN set, TLS certs exist, .env complete
# 2. Build: docker compose -f ... -f docker-compose.prod.yml build
# 3. Ordered startup: postgres → redis → admin-api → ml-worker → nginx → monitoring
# 4. Health verification: polls /health until 200 or timeout

health-check.sh — post-deploy verification:

bash scripts/health-check.sh
# Checks: API /health 200, embed.js accessible, frontend loads
# Catches: accidentally exposed ports 8000, 9090, 3000

backup.sh — PostgreSQL backup with manifest:

bash scripts/backup.sh
# Creates: /opt/tryon-saas-backups/backup_20260523_143022.dump
# Creates: /opt/tryon-saas-backups/backup_20260523_143022.manifest.json
# Add to crontab for automated nightly backups

restore.sh — restore from dump:

bash scripts/restore.sh --drop-existing --yes backup_20260523_143022.dump
# Drops existing DB, restores from dump file

10. Application Features¶

Admin Panel (`admin.ziex-tryon.com`)¶

The admin panel is a React SPA providing full operational control:

Client management — create, edit, suspend, activate clients; assign plans; reset monthly usage
API key management — generate and revoke API keys per client; view key prefixes and creation dates
Plan management — create billing plans with monthly generation limits; assign to clients
Job dashboard — view all try-on jobs with filtering by client, status, date range
Statistics — platform overview (total clients, jobs, revenue); per-client usage breakdown; realtime metrics

Client Portal (`app.ziex-tryon.com`)¶

A separate React SPA (different Zustand store, different Axios instance, different JWT scope) for clients to self-serve:

Dashboard — current plan, usage this month, remaining generations
Jobs — view own try-on job history with results
API Keys — view active keys (create/revoke coming in next sprint)
Webhooks — register webhook endpoints to receive job.completed and job.failed events
Usage — monthly generation history

Embeddable Widget (`tryon-embed.js`)¶

A zero-dependency vanilla JavaScript widget that clients drop onto their product pages:

<script src="https://api.ziex-tryon.com/embed/tryon-embed.js"
        data-api-key="tryon_xxxxxxxxxxxx"></script>

Renders inside a Shadow DOM — fully isolated from the host page's CSS
FAB → modal interaction pattern (floating action button opens full-screen overlay)
SPA-aware: MutationObserver detects page changes; re-initializes when the product URL changes
Auto-detects product images from <img> tags matching known e-commerce patterns
Configurable: TryOnWidget.init({ apiUrl, container, theme })

Sandbox (`sandbox.ziex-tryon.com`)¶

A sandboxed environment for testing API integration without affecting production data:

Pre-loaded garment and model image library
Full try-on API available (rate-limited separately)
Returns realistic responses including result images

11. Known Limitations & Deferred Items¶

These items were explicitly reviewed and deferred with written justification — they are not gaps in awareness, but conscious product decisions:

Forgot Password

No POST /auth/forgot-password endpoint exists. Admin recovery requires direct DB access. This is intentional: at the current stage (single technical admin), email provider integration is out of scope. Recovery path is documented in the ops runbook.

change-password Rate Limiting

POST /auth/change-password has no rate limit. The attack surface is bounded to the 15-minute access token window. Rate limiting is on the next-sprint backlog.

S3 ACL Not Set

S3StorageBackend.put_object() does not set ACL="public-read". For AWS S3, a public bucket policy must be configured separately. Cloudflare R2 (the production storage backend) does not use ACLs — it uses bucket-level public access settings.

Worker Shutdown During Polling

Docker's stop_grace_period: 35s is shorter than fal.ai's 600s polling ceiling. A job mid-inference when SIGTERM arrives will be killed. Marked as a known issue; increasing stop_grace_period to 610s is the fix.

12. Scope Metrics¶

Metric	Value
Lines of Python backend code	~6,000
Lines of TypeScript frontend code	~4,000
Lines of tests	~5,000
Test files	26
Test assertions	507+
API endpoints	35+
Docker services	8
nginx server blocks	5+
Security findings identified	41
Security findings resolved	38
Consciously deferred	3
Code coverage	≥97%
Audit phases	2
CI/CD automation	Full (lint + test + build + deploy)
Time from git push to production	~20 seconds
Subdomains configured	6
Grafana dashboards	3
Alembic migrations	2

13. Conclusion¶

The TryOn SaaS platform was built with production-readiness as the primary constraint, not speed. Every design decision — from the choice of PyJWT over python-jose (active maintenance), to the bcrypt-direct approach over passlib (5.x compatibility), to the workflow_run CI/CD trigger (correct cross-workflow gating) — was made deliberately and documented.

The result is a platform that:

Handles multi-tenancy correctly — data isolation between clients is enforced at every layer (DB queries, Redis keys, JWT scopes)
Degrades gracefully — stale job detection, graceful shutdown with active-request draining, worker heartbeat monitoring
Is observable — every significant event is logged via structlog with a request_id trace that flows from nginx through the API to the ML worker to fal.ai
Is auditable — 507+ tests, 41-finding security audit, documented deferred items with written justification
Deploys safely — CI gates every deploy, health checks gate every deployment, pre-flight scripts validate secrets before touching containers