Project Report
Overview¶
This document is a comprehensive account of all work performed to design, build, harden, and ship the TryOn SaaS platform — a B2B virtual clothing try-on service that allows e-commerce businesses to embed an AI-powered widget on their websites. It is intended to give technical leadership a complete picture of the scope, depth, and engineering discipline that went into this system.
The platform was built from a blank VPS to a fully automated, production-hardened, monitored service with a 41-finding security audit, 507+ automated tests, ≥97% code coverage, and a CI/CD pipeline where a git push results in a live deployment in approximately 20 seconds.
1. Server & Infrastructure Setup¶
Provisioning¶
The foundation of the platform is a dedicated VPS purchased at xorek.cloud with the following specifications:
| Parameter | Value |
|---|---|
| CPU | 4 vCPU |
| RAM | 8 GB |
| Storage | 80 GB NVMe SSD |
| IP Address | 185.246.222.107 |
| Operating System | Ubuntu 22.04 LTS |
Day-One Hardening¶
Security hardening was applied on the very first session, before any application code was deployed:
- Root SSH disabled — remote root login is explicitly blocked in
/etc/ssh/sshd_config - Password authentication disabled — only public-key authentication is accepted
- Non-root deploy user — all operations run as
deploywithNOPASSWDsudo strictly scoped to Docker commands - UFW firewall — only ports 22 (SSH), 80 (HTTP), and 443 (HTTPS) are open; all other inbound traffic is denied by default
- fail2ban — protects the SSH port against brute-force attacks with automatic IP banning after repeated failed attempts
SSH Access Recovery
If the SSH key is ever lost, recovery is possible through the xorek.cloud web console. A backup of the sshd config is kept at /etc/ssh/sshd_config.backup.20260520 on the server.
2. Domain & DNS Setup¶
Cloudflare Integration¶
The domain ziex-tryon.com is registered and fully managed through Cloudflare. All DNS resolution runs through Cloudflare's proxy (orange-cloud mode), providing:
- DDoS protection at the DNS/CDN layer
- Automatic HTTP→HTTPS redirects
- Bot management and IP reputation filtering
- Edge caching for static assets
TLS Strategy¶
Rather than Let's Encrypt (which requires certificate renewal automation and ACME challenges), a Cloudflare Origin Certificate was chosen. This is a certificate issued directly by Cloudflare, trusted only between Cloudflare's edge and the origin server — enabling full-strict TLS mode without the operational overhead of certbot.
nginx OCSP Stapling
Cloudflare Origin Certificates are not signed by a public CA, so OCSP stapling must be disabled in nginx:
Failing to set this causes nginx to log repeated OCSP errors and can delay TLS handshakes.Subdomain Map¶
Six DNS entries were configured, all pointing to the same origin IP (185.246.222.107) but routed to different application contexts by nginx server blocks:
| Subdomain | Purpose |
|---|---|
ziex-tryon.com |
Main landing page |
api.ziex-tryon.com |
REST API (FastAPI backend) |
admin.ziex-tryon.com |
Admin panel (React SPA) |
app.ziex-tryon.com |
Client portal (React SPA) |
sandbox.ziex-tryon.com |
Integration sandbox for testing |
docs.ziex-tryon.com |
MkDocs documentation site |
3. Architecture¶
The platform was designed from scratch with a clear separation of concerns across stateless HTTP services, a queue-based job processing pipeline, and a dedicated ML inference backend.
Client Website
│
▼
tryon-embed.js ──► Admin API (:8000) ──► Redis Queue ──► ML Worker
(Shadow DOM) │ │
│ ▼
PostgreSQL 15 fal.ai FASHN v1.5
▲ (AI inference)
│
┌────────┴────────┐
Prometheus Grafana
▲ (3 dashboards)
│
Pushgateway ◄── ML Worker metrics
│
AlertManager ──► Telegram Bot
│
nginx (TLS · rate limiting · security headers)
Service Inventory¶
The entire platform runs as 8 Docker services defined in docker-compose.yml with a production overlay in docker-compose.prod.yml:
| Service | Internal Port | Role |
|---|---|---|
postgres |
5432 (internal) | PostgreSQL 15 — persistent data store |
redis |
6379 (internal) | Redis 7 AOF — job queue + rate limit counters |
admin-api |
8000 | FastAPI — all business logic |
ml-worker |
none | Job processor — Redis BRPOP loop |
frontend |
5173 (internal) | Vite dev / built React SPA |
nginx |
80/443 | Reverse proxy, TLS termination |
prometheus |
9090 (internal in prod) | Metrics scraping |
grafana |
3000 (internal in prod) | Dashboards |
Every service is configured with:
- restart: unless-stopped
- logging: json-file driver with max-size: 50m, max-file: 5
- CPU and memory limits and reservations
- Docker health checks
Production vs Development
docker-compose.prod.yml overrides the base compose to close ports 8000, 9090, and 3000 from public access, sets MEDIA_BASE_URL to the production domain, and mounts the production nginx config with full TLS. In development these ports are accessible directly for debugging.
Job Processing Pipeline¶
When a client's embedded widget triggers a try-on request:
- Embed widget (
tryon-embed.js) captures product and person images, sends aPOST /api/v1/tryonwith an API key - Admin API validates the key, checks the client's generation quota, processes images (resize, EXIF strip, WebP conversion), saves them to storage, pushes a JSON job payload to Redis
- ML Worker picks up the job via
BRPOP(blocking pop), calls fal.ai FASHN v1.5 - fal.ai runs the inference (virtual try-on), returns a result URL
- Worker updates the job status in PostgreSQL and fires registered webhooks
- Client polls
GET /api/v1/tryon/status/{id}until status iscompleted
4. Tech Stack¶
Backend¶
| Component | Version | Notes |
|---|---|---|
| Python | 3.11 | Async throughout (asyncio, async SQLAlchemy) |
| FastAPI | 0.111.0 | ASGI, Pydantic v2 integration |
| SQLAlchemy | 2.0 | Async engine, async_session factory |
| Pydantic | v2 | Breaking change: AnyHttpUrl is no longer a str subclass |
| Alembic | 1.13 | DB migrations, auto-applied on startup |
| PostgreSQL | 15 | Primary data store |
| Redis | 7 | AOF persistence, job queue + rate limiting |
| PyJWT | ≥2.8.0 | Not python-jose — import as import jwt |
| bcrypt | direct (5.x) | Not passlib — 5.x API changed, passlib incompatible |
| structlog | latest | JSON logging with sensitive-field redaction |
| httpx | latest | Async HTTP client for fal.ai calls |
Pydantic v2 AnyHttpUrl Gotcha
In Pydantic v1, AnyHttpUrl was a str subclass and could be passed directly to asyncpg. In Pydantic v2 it is not. Any URL field passed to SQLAlchemy/asyncpg must be explicitly coerced:
Frontend¶
| Component | Version | Notes |
|---|---|---|
| React | 18 | SPA, hooks, context |
| TypeScript | 5 | Strict mode |
| Vite | 5 | Build tool, dev server |
| Tailwind CSS | 3 | Utility-first styling |
| shadcn/ui | latest | Component library |
| Node.js | 22 | Build environment (Alpine image) |
| Axios | latest | HTTP client with JWT interceptor + auto-refresh |
| Zustand | latest | State management (separate stores for admin/portal auth) |
Infrastructure¶
| Component | Version | Notes |
|---|---|---|
| Docker Compose | V2 | docker compose command |
| nginx | 1.25 | TLS, gzip, rate limiting, security headers |
| Prometheus | 2.x | Metrics collection |
| Grafana | 11.0.0 | Dashboards, auto-provisioned |
| Pushgateway | 1.10.0 | Worker metrics (worker has no HTTP server) |
| fal.ai FASHN | v1.5 | AI inference backend |
5. Security Audit — 41 Findings¶
A formal security audit was conducted in two phases. 41 findings were identified, 38 closed, and 3 consciously deferred with written justification in the project's known-issues register.
Authentication & Access Control (8 findings)¶
| Finding | Resolution |
|---|---|
| Refresh tokens stored as plaintext | SHA-256 pre-hash → bcrypt(rounds=12); token_prefix VARCHAR(16) for O(1) DB lookup |
| No brute-force protection on login | 5 failed attempts → IP + email locked for 15 minutes (Redis counters) |
| Timing attack on user lookup | dummy_password_check() runs bcrypt on pre-computed hash when user not found — both paths take identical time |
| Portal JWT used same scope as admin | Separate scope="portal" claim; get_current_client dependency validates scope |
| Refresh token lookup was O(n) | token_prefix index + composite (token_prefix, user_id) index — confirmed O(1) at scale |
/auth/change-password no rate limit |
Deferred: 15-min token window limits attack surface; rate limit scheduled for next sprint |
| API key brute force | key_prefix VARCHAR(20) fast lookup; bcrypt hash storage; per-key hourly rate limit in Redis |
| Portal login had no rate limit | Fixed: 5/min/IP via get_redis dependency. Was using request.app.state.redis (never set → always fail-open) |
Input Validation & Injection (7 findings)¶
| Finding | Resolution |
|---|---|
| f-string SQL in stats queries | Replaced with parameterized SQLAlchemy expressions throughout |
| SSRF via webhook URL | validate_webhook_url() blocks RFC-1918, loopback, link-local, CGNAT; DNS-resolves hostname to check all returned IPs |
| SSRF via per-job webhook_url | Same validator applied at Pydantic schema layer — not just in the worker |
| MIME type not validated | Strict image/jpeg / image/png check via magic bytes, not just Content-Type header |
| Decompression bomb | PIL.MAX_IMAGE_PIXELS = 4096 * 4096; explicit post-open pixel count check; rejects >16.7M pixels with HTTP 422 |
TryOnRequest.mode accepted any string |
Changed to Literal["balanced", "quality"]; invalid values return 422 |
ClientUpdate.status accepted any string |
Changed to Literal["active", "suspended"]; prevents "banned" persisting to DB |
Infrastructure & Configuration (14 findings)¶
| Finding | Resolution |
|---|---|
| Port 8000 publicly exposed | Closed in docker-compose.prod.yml; nginx is the only ingress |
| Ports 9090, 3000 publicly exposed | Closed in production override |
CORS_ORIGINS JSON parse error |
pydantic-settings cannot parse comma-separated list; must use JSON array in .env |
/metrics endpoint publicly readable |
location = /metrics { deny all; return 403; } in both nginx configs |
No Cache-Control on API responses |
RequestIdMiddleware adds Cache-Control: no-store, private to all /api/ responses |
ENABLE_DOCS not enforced |
/docs, /redoc, /openapi.json removed when ENABLE_DOCS=false |
Permissions-Policy header missing |
Added to both nginx configs: camera=(), microphone=(), geolocation=() |
| Security headers only on 2xx | Added always flag to all nginx security headers directives |
nginx client_max_body_size not set |
Set to 12m (10 MB image + ~33% base64 overhead) |
| Prometheus targeted ml-worker:8001 | Removed — ml-worker has no HTTP server; was causing Prometheus log spam |
DOMAIN missing from .env.example |
Added; deploy.sh --prod fails fast if DOMAIN is unset |
media_base_url_localhost warning |
Startup log + health endpoint media_base_url_public: bool field |
| Storage backend not observable | GET /health response includes storage_backend: "local"/"s3" |
nginx health location missing |
Added explicit location = /health in both configs; was falling through to Vite frontend |
Frontend Correctness (7 findings)¶
| Finding | Resolution |
|---|---|
Settings.tsx called wrong endpoint |
Was PUT /auth/me/password; fixed to POST /auth/change-password with correct field names |
ClientDetail.tsx iterated envelope object |
GET /clients/{id}/keys returns paginated envelope; fixed to use .items |
| Concurrent 401 refresh race | Module-level refreshPromise dedup in api/client.ts — second 401 reuses in-flight refresh |
| Portal webhooks wrong event namespace | Was tryon.completed; fixed to job.completed / job.failed (only valid values in _VALID_EVENTS) |
model_validate(update=...) removed in Pydantic 2.13+ |
Webhook response built as plain dict before model_validate |
| Jobs date filter was dead code | Fully implemented: date_from/date_to query params wired on both frontend and backend |
Plan creation missing slug field |
Frontend auto-generates slug from plan name; API requires slug + generations_limit |
Observability & Monitoring (5 findings)¶
| Finding | Resolution |
|---|---|
| No Prometheus metrics for worker | Pushgateway integration — worker pushes after each job; Prometheus scrapes Pushgateway |
| Worker healthcheck always passed | Was python -c "sys.exit(0)"; now checks worker_heartbeat Redis key is <90s old |
| No stale job detection | _run_stale_job_sweep() runs every 60s in admin-api lifespan; marks processing >15min and pending >30min as failed |
| No Grafana dashboards | 3 dashboards auto-provisioned via grafana/dashboards/ directory mount |
| Telegram alerts had no dedup | Redis SETNX with 1h TTL per alert key prevents duplicate notifications |
6. CI/CD Pipeline¶
The CI/CD pipeline was built to enforce the project's quality standards automatically. A developer's workflow is: write code, push, and watch the pipeline do everything else.
Continuous Integration (ci.yml)¶
Push to any branch
│
▼
Lint (ruff)
admin/ruff.toml: line-length=130, E501 ignored
│
▼
Tests (pytest)
Services: postgres:15, redis:7 in GitHub Actions containers
Threshold: --cov-fail-under=94 (≥97% coverage maintained)
│
▼
Docker build (admin-api + ml-worker)
Verifies Dockerfiles are valid and dependencies install correctly
Continuous Deployment (deploy.yml)¶
CI workflow completes with success
│
▼ (workflow_run trigger, not needs:)
SSH to 185.246.222.107
│
▼
git pull origin main
│
▼
git diff HEAD~1 HEAD → detect changed services
│
├── admin-api changed? → docker compose build admin-api && up -d
├── ml-worker changed? → docker compose build ml-worker && up -d
├── frontend changed? → build in container → rsync dist/ to nginx
└── nginx/compose changed? → docker compose up -d nginx
│
▼
curl /health → verify 200
│
▼
Deployment complete (~20 seconds)
Why workflow_run instead of needs:
needs: only works within the same workflow file. workflow_run allows deploy.yml to gate on ci.yml success across separate files. Without this, deploy would fire on every push regardless of test results.
Local Development Tools¶
docker-compose.test.yml— runs the test suite with source volume-mounted; no rebuild required after code changes- Pre-push git hook — runs
ruff checkandpytestlocally before any push reaches GitHub scripts/generate-secrets.sh— generates all random secrets (JWT key, DB password, Grafana password, admin password) in one command
7. Test Suite¶
Testing was treated as a first-class deliverable. The suite grew in lockstep with the codebase using test-driven development practices.
Coverage Methodology¶
Before writing any test, the exact uncovered lines are identified:
cd admin && pytest --cov=app --cov-report=term-missing -q \
2>&1 | grep -E "^\s+app/" | sort -k4 -t% -n | head -20
This prevents guessing what to test and ensures every new test file targets real coverage gaps — not imagined ones.
Test File Inventory¶
| File | Tests | Coverage Area |
|---|---|---|
test_auth.py |
40+ | Login, refresh, logout, change-password, brute force lockout |
test_clients.py |
25+ | Client CRUD, suspend/activate, quota reset |
test_api_keys.py |
20+ | Create, list, revoke API keys |
test_plans.py |
15+ | Plan CRUD including delete (was 0% before audit) |
test_jobs.py |
15+ | Job list/detail, status isolation between clients |
test_stats.py |
20+ | Overview, realtime, per-client breakdown |
test_tryon.py |
30+ | Submit, status, domain-check, rate limiting |
test_security.py |
20+ | Timing attack, rate limit, domain whitelist, SSRF |
test_health.py |
10+ | Health endpoint, readiness probe, error sanitization |
test_errors.py |
10+ | Error format consistency across all endpoints |
test_dependencies.py |
15+ | API key validation paths, fail-open/closed |
test_billing_service.py |
15+ | Generation limit checks, usage log upsert |
test_image_processor.py |
31 | Format validation, pipeline, resize, quality ramp, decompression bomb |
test_webhooks.py |
23 | CRUD, HMAC signing, delivery, client isolation |
test_graceful_shutdown.py |
14 | Signal handlers, request tracking, interrupted job marking |
test_multitenancy.py |
23 | Data isolation, API key scoping, plan quota enforcement |
test_service_units.py |
20+ | Stateless unit tests: config, auth, billing, queue |
test_alerting_service.py |
15+ | Telegram alert formatting, dedup, trigger conditions |
test_embed_domain_check.py |
10+ | Domain-check endpoint for embed pre-flight |
test_portal.py |
30+ | Portal login, job scoping, usage, webhook CRUD, tenant isolation |
test_storage_service.py |
15 | LocalStorageBackend save/delete/list, factory, unknown backend |
test_s3_storage.py |
8 | S3StorageBackend via moto mock (skipped if boto3/moto absent) |
test_hardening.py |
12 | Job status isolation, ENABLE_DOCS flag, Cache-Control, health sanitization |
test_observability.py |
15+ | Health fields, Prometheus counters, webhook dedup, readiness |
test_sandbox.py |
32 | Sandbox garment/model CRUD, pagination, public endpoints, auth enforcement |
Total: 507+ tests, ≥97% coverage
8. Monitoring & Observability¶
Metrics Pipeline¶
The admin-api exposes GET /metrics (Prometheus format) blocked from public access at the nginx layer. The ML worker, having no HTTP server, pushes metrics to Pushgateway after each job. Prometheus scrapes Pushgateway in the same 15-second cycle.
Four Prometheus counters are defined in admin/app/metrics.py:
| Counter | Labels | Meaning |
|---|---|---|
tryon_submitted_total |
client_id |
Every try-on request accepted |
tryon_completed_total |
client_id, status |
Job completions (success/failure) |
tryon_rate_limited_total |
reason |
Rate limit hits (per_ip or limit_exceeded) |
cleanup_files_deleted_total |
none | WebP files removed by cleanup service |
Grafana Dashboards¶
Three dashboards are auto-provisioned from grafana/dashboards/ — no manual setup required after deploy:
- Platform Overview — request rates, job throughput, error rates
- Client Usage — per-client generation counts, quota utilization
- Worker Health — ML Worker heartbeat age, job queue depth, fal.ai latency
Health Endpoints¶
| Endpoint | Purpose | Use Case |
|---|---|---|
GET /health |
Detailed status (DB, Redis, worker heartbeat, fal.ai key, storage backend) | Monitoring dashboards |
GET /readiness |
Strict probe: 200 if DB+Redis reachable, 503 otherwise | Load balancer health checks |
Worker Heartbeat & Stale Job Detection¶
The ML Worker writes a worker_heartbeat key to Redis every 30 seconds. The /health endpoint reports the age of this key — if the worker dies, the heartbeat goes stale and appears in the dashboard.
A background task in admin-api (_run_stale_job_sweep()) runs every 60 seconds and marks jobs as failed if they are stuck:
- processing status for >15 minutes → failed (worker died mid-job)
- pending status for >30 minutes → failed (BRPOP race — job was popped but never processed)
Telegram Alerts¶
alerting_service.py sends structured alerts to a Telegram bot. Redis SETNX with 1-hour TTL prevents duplicate notifications for the same alert type. Alert severity levels: info, warning, critical.
9. Deployment Workflow¶
"git push → 20 seconds → production"¶
The deployment pipeline is fully automated. Here is the complete flow:
Developer: git push origin main
│
▼ (GitHub Actions: ci.yml)
1. ruff check admin/
2. pytest --cov-fail-under=94
(postgres:15 + redis:7 service containers)
3. docker build admin-api
4. docker build ml-worker
│ (all pass)
▼ (GitHub Actions: deploy.yml, workflow_run trigger)
5. SSH to 185.246.222.107
6. git pull origin main
7. diff HEAD~1 HEAD — find changed services
8. docker compose build <changed services>
9. docker compose up -d <changed services>
10. curl https://ziex-tryon.com/health → 200
│
▼
Production updated ✓
Deployment Scripts¶
All scripts live in scripts/ and are designed to be idempotent and auditable.
generate-secrets.sh — run once before first deployment:
bash scripts/generate-secrets.sh
# Generates: JWT_SECRET_KEY, POSTGRES_PASSWORD,
# GRAFANA_ADMIN_PASSWORD, FIRST_ADMIN_PASSWORD
# Outputs ready-to-paste .env entries
deploy.sh --prod — full production deployment:
bash scripts/deploy.sh --prod
# 1. Pre-flight: checks DOMAIN set, TLS certs exist, .env complete
# 2. Build: docker compose -f ... -f docker-compose.prod.yml build
# 3. Ordered startup: postgres → redis → admin-api → ml-worker → nginx → monitoring
# 4. Health verification: polls /health until 200 or timeout
health-check.sh — post-deploy verification:
bash scripts/health-check.sh
# Checks: API /health 200, embed.js accessible, frontend loads
# Catches: accidentally exposed ports 8000, 9090, 3000
backup.sh — PostgreSQL backup with manifest:
bash scripts/backup.sh
# Creates: /opt/tryon-saas-backups/backup_20260523_143022.dump
# Creates: /opt/tryon-saas-backups/backup_20260523_143022.manifest.json
# Add to crontab for automated nightly backups
restore.sh — restore from dump:
bash scripts/restore.sh --drop-existing --yes backup_20260523_143022.dump
# Drops existing DB, restores from dump file
10. Application Features¶
Admin Panel (admin.ziex-tryon.com)¶
The admin panel is a React SPA providing full operational control:
- Client management — create, edit, suspend, activate clients; assign plans; reset monthly usage
- API key management — generate and revoke API keys per client; view key prefixes and creation dates
- Plan management — create billing plans with monthly generation limits; assign to clients
- Job dashboard — view all try-on jobs with filtering by client, status, date range
- Statistics — platform overview (total clients, jobs, revenue); per-client usage breakdown; realtime metrics
Client Portal (app.ziex-tryon.com)¶
A separate React SPA (different Zustand store, different Axios instance, different JWT scope) for clients to self-serve:
- Dashboard — current plan, usage this month, remaining generations
- Jobs — view own try-on job history with results
- API Keys — view active keys (create/revoke coming in next sprint)
- Webhooks — register webhook endpoints to receive
job.completedandjob.failedevents - Usage — monthly generation history
Embeddable Widget (tryon-embed.js)¶
A zero-dependency vanilla JavaScript widget that clients drop onto their product pages:
<script src="https://api.ziex-tryon.com/embed/tryon-embed.js"
data-api-key="tryon_xxxxxxxxxxxx"></script>
- Renders inside a Shadow DOM — fully isolated from the host page's CSS
- FAB → modal interaction pattern (floating action button opens full-screen overlay)
- SPA-aware: MutationObserver detects page changes; re-initializes when the product URL changes
- Auto-detects product images from
<img>tags matching known e-commerce patterns - Configurable:
TryOnWidget.init({ apiUrl, container, theme })
Sandbox (sandbox.ziex-tryon.com)¶
A sandboxed environment for testing API integration without affecting production data:
- Pre-loaded garment and model image library
- Full try-on API available (rate-limited separately)
- Returns realistic responses including result images
11. Known Limitations & Deferred Items¶
These items were explicitly reviewed and deferred with written justification — they are not gaps in awareness, but conscious product decisions:
Forgot Password
No POST /auth/forgot-password endpoint exists. Admin recovery requires direct DB access. This is intentional: at the current stage (single technical admin), email provider integration is out of scope. Recovery path is documented in the ops runbook.
change-password Rate Limiting
POST /auth/change-password has no rate limit. The attack surface is bounded to the 15-minute access token window. Rate limiting is on the next-sprint backlog.
S3 ACL Not Set
S3StorageBackend.put_object() does not set ACL="public-read". For AWS S3, a public bucket policy must be configured separately. Cloudflare R2 (the production storage backend) does not use ACLs — it uses bucket-level public access settings.
Worker Shutdown During Polling
Docker's stop_grace_period: 35s is shorter than fal.ai's 600s polling ceiling. A job mid-inference when SIGTERM arrives will be killed. Marked as a known issue; increasing stop_grace_period to 610s is the fix.
12. Scope Metrics¶
| Metric | Value |
|---|---|
| Lines of Python backend code | ~6,000 |
| Lines of TypeScript frontend code | ~4,000 |
| Lines of tests | ~5,000 |
| Test files | 26 |
| Test assertions | 507+ |
| API endpoints | 35+ |
| Docker services | 8 |
| nginx server blocks | 5+ |
| Security findings identified | 41 |
| Security findings resolved | 38 |
| Consciously deferred | 3 |
| Code coverage | ≥97% |
| Audit phases | 2 |
| CI/CD automation | Full (lint + test + build + deploy) |
| Time from git push to production | ~20 seconds |
| Subdomains configured | 6 |
| Grafana dashboards | 3 |
| Alembic migrations | 2 |
13. Conclusion¶
The TryOn SaaS platform was built with production-readiness as the primary constraint, not speed. Every design decision — from the choice of PyJWT over python-jose (active maintenance), to the bcrypt-direct approach over passlib (5.x compatibility), to the workflow_run CI/CD trigger (correct cross-workflow gating) — was made deliberately and documented.
The result is a platform that:
- Handles multi-tenancy correctly — data isolation between clients is enforced at every layer (DB queries, Redis keys, JWT scopes)
- Degrades gracefully — stale job detection, graceful shutdown with active-request draining, worker heartbeat monitoring
- Is observable — every significant event is logged via structlog with a
request_idtrace that flows from nginx through the API to the ML worker to fal.ai - Is auditable — 507+ tests, 41-finding security audit, documented deferred items with written justification
- Deploys safely — CI gates every deploy, health checks gate every deployment, pre-flight scripts validate secrets before touching containers