Security Audit

Status: Production-deployed

Platform URL: https://ziex-tryon.com
All 41 security findings resolved or consciously deferred. 507 tests. ≥97% coverage.

Overview¶

The TryOn SaaS platform is a B2B virtual clothing try-on service deployed at ziex-tryon.com. Between May 2026 and now, 41 security and code-quality findings were identified and all were resolved or consciously deferred with written justification. The platform now operates with 507 automated tests, greater than 97% code coverage, full observability via Prometheus metrics and three Grafana dashboards, and automated CI/CD deployment with post-deploy health verification.

What Was Built¶

Core Platform¶

Virtual Try-On API — Businesses integrate via REST API. Shoppers submit two images; AI returns a try-on result in 15–30 seconds.
Multi-tenant billing — Each client has an account, monthly plan, API keys, and optional webhook callbacks.
Client portal — Self-service: job history, usage vs quota, webhook management.
Embeddable widget — Single <script> tag, auto-detects products, Shadow DOM isolation.

Security Hardening — 41 Findings Resolved¶

By category¶

Category	Findings	Status
Authentication and access control	8	All resolved
Input validation and injection	7	All resolved
Infrastructure and configuration	14	Resolved + 3 deferred by design
Frontend correctness	7	All resolved
Observability and monitoring	5	Resolved in Phase 2

Key risks closed¶

Risk	Before	After
Malicious URL injection via webhooks	No validation	Blocked: private IPs, loopback, cloud metadata. DNS-checked at registration.
Unlimited API abuse	No per-IP limit	30 req/min per IP enforced
Brute-force password attacks	No lockout	5-strike, 15-min lockout on IP + email
Cross-client job reads	No tenant scoping	Every status lookup filtered by `client_id`
Exposed monitoring data	Prometheus metrics publicly readable	Blocked at nginx (403)
Exposed internal ports	DB/Redis/Prometheus externally accessible	Production closes all internal ports
Unsafe image uploads	No decompression protection	Magic bytes check, 16.7M pixel limit, EXIF strip
Timing-based user enumeration	Login timing revealed username existence	Constant-time comparison on all code paths
Stuck worker jobs	No polling timeout	600s ceiling; stale jobs auto-failed after 15 min
SQL injection	f-string SQL in 7 places	All parameterized

Current Platform State¶

Metric	Value
Automated tests	507+ across 26 test files
Code coverage	≥97%
Open security findings	0
Monitoring	Prometheus + Grafana (3 auto-provisioned dashboards)
Alerting	Telegram (worker health, error rates, DB/Redis reachability)
Backup	Daily automated, 30-day retention
CI/CD	lint → test → build → deploy on CI success
TLS	Cloudflare proxy, HSTS, TLSv1.2+

Audit sign-off

Package 10 audit completed 2026-05-18. All 44 browser E2E steps pass (Admin UI + Portal + Embed). All production readiness checks pass (health, security headers, nginx, docker-compose.prod.yml, backup, deploy.sh). Zero ruff warnings.

Consciously Deferred Items¶

These items were evaluated and explicitly deferred — they are not open issues.

Item	Rationale	Residual Risk
Forgot password / email recovery	Requires email provider. Admin recovery via direct DB is sufficient at current scale.	Low
Portal API key self-service (create/revoke)	Admin-managed keys are secure. Self-service is a UX improvement for Phase 3.	None
Presigned S3 upload URLs	Current base64 flow works at this volume.	None
`change-password` rate limiting	15-min access token window limits the attack surface.	Very low
Kubernetes GPU worker scaling	Single server handles current volume.	None

Deferred ≠ ignored

Each deferred item has a written justification and will be revisited when the trigger condition is met (e.g., email recovery when an email provider is added; K8s when load requires it).

Technical Debt Register¶

These are known code-level limitations that do not block launch but should be addressed in upcoming sprints.

Issue	Impact	Effort
`avg_latency_ms` not recalculated on CONFLICT	Usage dashboard shows first job's latency for a client, not a running average	Low — SQL formula fix
Admin webhook CRUD: `user.id == client.id` assumption	A second admin user without a matching Client record gets 404 on all webhook operations	Low — query fix
`TRYON_COMPLETED` Prometheus counter not incremented	Dashboard completion counter always shows zero	Low — one `.inc()` call in worker
Node.js 20 EOL June 2026	Frontend build uses an unsupported runtime after June 2026	Low — update base image to `node:22-alpine`
CI diff uses `HEAD~1`	Only detects last commit's changes; multi-commit pushes may miss earlier changed services	Medium — switch to `github.event.before...after` diff

Node.js 20 EOL

The frontend Dockerfile uses node:20-alpine. Node.js 20 reaches end-of-life on 2026-06-02. Upgrade to node:22-alpine before that date. No code changes expected — Vite 5 supports Node 22.

Recommended Next Phase¶

Prioritized by business impact:

Phase	Item	Estimated Effort
1	Stripe billing integration — paid plan enforcement, invoicing	2–3 weeks
2	Email notifications (SendGrid/Postmark) — forgot password, job alerts	1 week
3	Portal self-service improvements — API key create/revoke, plan upgrade requests	1–2 weeks
4	Performance scaling — when triggered by load metrics	3–4 weeks

Phase 1 prerequisite

Stripe integration requires an email provider (for billing confirmations and receipts), so Phases 1 and 2 are best developed together.

Deployment Architecture Summary¶

Internet → Cloudflare Proxy (DDoS, WAF)
         → nginx (:443, TLS 1.2+, HSTS)
         → admin-api (FastAPI, internal only)
         → postgres / redis (internal only)
         → ml-worker (internal, BRPOP loop)
         → fal.ai FASHN v1.5 (external AI)

Monitoring: Prometheus → Grafana (internal)
Alerting:   AlertManager → Telegram Bot
Backups:    pg_dump → /opt/tryon-saas-backups/ (daily, 30-day retention)
CI/CD:      GitHub Actions → SSH deploy on CI success

All internal services (postgres, redis, prometheus, grafana) are not exposed to the internet. In production, only ports 22, 80, and 443 are open via UFW.