Skip to content

Security Audit

Status: Production-deployed

Platform URL: https://ziex-tryon.com
All 41 security findings resolved or consciously deferred. 507 tests. ≥97% coverage.


Overview

The TryOn SaaS platform is a B2B virtual clothing try-on service deployed at ziex-tryon.com. Between May 2026 and now, 41 security and code-quality findings were identified and all were resolved or consciously deferred with written justification. The platform now operates with 507 automated tests, greater than 97% code coverage, full observability via Prometheus metrics and three Grafana dashboards, and automated CI/CD deployment with post-deploy health verification.


What Was Built

Core Platform

  1. Virtual Try-On API — Businesses integrate via REST API. Shoppers submit two images; AI returns a try-on result in 15–30 seconds.
  2. Multi-tenant billing — Each client has an account, monthly plan, API keys, and optional webhook callbacks.
  3. Client portal — Self-service: job history, usage vs quota, webhook management.
  4. Embeddable widget — Single <script> tag, auto-detects products, Shadow DOM isolation.

Security Hardening — 41 Findings Resolved

By category

Category Findings Status
Authentication and access control 8 All resolved
Input validation and injection 7 All resolved
Infrastructure and configuration 14 Resolved + 3 deferred by design
Frontend correctness 7 All resolved
Observability and monitoring 5 Resolved in Phase 2

Key risks closed

Risk Before After
Malicious URL injection via webhooks No validation Blocked: private IPs, loopback, cloud metadata. DNS-checked at registration.
Unlimited API abuse No per-IP limit 30 req/min per IP enforced
Brute-force password attacks No lockout 5-strike, 15-min lockout on IP + email
Cross-client job reads No tenant scoping Every status lookup filtered by client_id
Exposed monitoring data Prometheus metrics publicly readable Blocked at nginx (403)
Exposed internal ports DB/Redis/Prometheus externally accessible Production closes all internal ports
Unsafe image uploads No decompression protection Magic bytes check, 16.7M pixel limit, EXIF strip
Timing-based user enumeration Login timing revealed username existence Constant-time comparison on all code paths
Stuck worker jobs No polling timeout 600s ceiling; stale jobs auto-failed after 15 min
SQL injection f-string SQL in 7 places All parameterized

Current Platform State

Metric Value
Automated tests 507+ across 26 test files
Code coverage ≥97%
Open security findings 0
Monitoring Prometheus + Grafana (3 auto-provisioned dashboards)
Alerting Telegram (worker health, error rates, DB/Redis reachability)
Backup Daily automated, 30-day retention
CI/CD lint → test → build → deploy on CI success
TLS Cloudflare proxy, HSTS, TLSv1.2+

Audit sign-off

Package 10 audit completed 2026-05-18. All 44 browser E2E steps pass (Admin UI + Portal + Embed). All production readiness checks pass (health, security headers, nginx, docker-compose.prod.yml, backup, deploy.sh). Zero ruff warnings.


Consciously Deferred Items

These items were evaluated and explicitly deferred — they are not open issues.

Item Rationale Residual Risk
Forgot password / email recovery Requires email provider. Admin recovery via direct DB is sufficient at current scale. Low
Portal API key self-service (create/revoke) Admin-managed keys are secure. Self-service is a UX improvement for Phase 3. None
Presigned S3 upload URLs Current base64 flow works at this volume. None
change-password rate limiting 15-min access token window limits the attack surface. Very low
Kubernetes GPU worker scaling Single server handles current volume. None

Deferred ≠ ignored

Each deferred item has a written justification and will be revisited when the trigger condition is met (e.g., email recovery when an email provider is added; K8s when load requires it).


Technical Debt Register

These are known code-level limitations that do not block launch but should be addressed in upcoming sprints.

Issue Impact Effort
avg_latency_ms not recalculated on CONFLICT Usage dashboard shows first job's latency for a client, not a running average Low — SQL formula fix
Admin webhook CRUD: user.id == client.id assumption A second admin user without a matching Client record gets 404 on all webhook operations Low — query fix
TRYON_COMPLETED Prometheus counter not incremented Dashboard completion counter always shows zero Low — one .inc() call in worker
Node.js 20 EOL June 2026 Frontend build uses an unsupported runtime after June 2026 Low — update base image to node:22-alpine
CI diff uses HEAD~1 Only detects last commit's changes; multi-commit pushes may miss earlier changed services Medium — switch to github.event.before...after diff

Node.js 20 EOL

The frontend Dockerfile uses node:20-alpine. Node.js 20 reaches end-of-life on 2026-06-02. Upgrade to node:22-alpine before that date. No code changes expected — Vite 5 supports Node 22.


Prioritized by business impact:

Phase Item Estimated Effort
1 Stripe billing integration — paid plan enforcement, invoicing 2–3 weeks
2 Email notifications (SendGrid/Postmark) — forgot password, job alerts 1 week
3 Portal self-service improvements — API key create/revoke, plan upgrade requests 1–2 weeks
4 Performance scaling — when triggered by load metrics 3–4 weeks

Phase 1 prerequisite

Stripe integration requires an email provider (for billing confirmations and receipts), so Phases 1 and 2 are best developed together.


Deployment Architecture Summary

Internet → Cloudflare Proxy (DDoS, WAF)
         → nginx (:443, TLS 1.2+, HSTS)
         → admin-api (FastAPI, internal only)
         → postgres / redis (internal only)
         → ml-worker (internal, BRPOP loop)
         → fal.ai FASHN v1.5 (external AI)

Monitoring: Prometheus → Grafana (internal)
Alerting:   AlertManager → Telegram Bot
Backups:    pg_dump → /opt/tryon-saas-backups/ (daily, 30-day retention)
CI/CD:      GitHub Actions → SSH deploy on CI success

All internal services (postgres, redis, prometheus, grafana) are not exposed to the internet. In production, only ports 22, 80, and 443 are open via UFW.