Security Audit
Status: Production-deployed
Platform URL: https://ziex-tryon.com
All 41 security findings resolved or consciously deferred. 507 tests. ≥97% coverage.
Overview¶
The TryOn SaaS platform is a B2B virtual clothing try-on service deployed at ziex-tryon.com. Between May 2026 and now, 41 security and code-quality findings were identified and all were resolved or consciously deferred with written justification. The platform now operates with 507 automated tests, greater than 97% code coverage, full observability via Prometheus metrics and three Grafana dashboards, and automated CI/CD deployment with post-deploy health verification.
What Was Built¶
Core Platform¶
- Virtual Try-On API — Businesses integrate via REST API. Shoppers submit two images; AI returns a try-on result in 15–30 seconds.
- Multi-tenant billing — Each client has an account, monthly plan, API keys, and optional webhook callbacks.
- Client portal — Self-service: job history, usage vs quota, webhook management.
- Embeddable widget — Single
<script>tag, auto-detects products, Shadow DOM isolation.
Security Hardening — 41 Findings Resolved¶
By category¶
| Category | Findings | Status |
|---|---|---|
| Authentication and access control | 8 | All resolved |
| Input validation and injection | 7 | All resolved |
| Infrastructure and configuration | 14 | Resolved + 3 deferred by design |
| Frontend correctness | 7 | All resolved |
| Observability and monitoring | 5 | Resolved in Phase 2 |
Key risks closed¶
| Risk | Before | After |
|---|---|---|
| Malicious URL injection via webhooks | No validation | Blocked: private IPs, loopback, cloud metadata. DNS-checked at registration. |
| Unlimited API abuse | No per-IP limit | 30 req/min per IP enforced |
| Brute-force password attacks | No lockout | 5-strike, 15-min lockout on IP + email |
| Cross-client job reads | No tenant scoping | Every status lookup filtered by client_id |
| Exposed monitoring data | Prometheus metrics publicly readable | Blocked at nginx (403) |
| Exposed internal ports | DB/Redis/Prometheus externally accessible | Production closes all internal ports |
| Unsafe image uploads | No decompression protection | Magic bytes check, 16.7M pixel limit, EXIF strip |
| Timing-based user enumeration | Login timing revealed username existence | Constant-time comparison on all code paths |
| Stuck worker jobs | No polling timeout | 600s ceiling; stale jobs auto-failed after 15 min |
| SQL injection | f-string SQL in 7 places | All parameterized |
Current Platform State¶
| Metric | Value |
|---|---|
| Automated tests | 507+ across 26 test files |
| Code coverage | ≥97% |
| Open security findings | 0 |
| Monitoring | Prometheus + Grafana (3 auto-provisioned dashboards) |
| Alerting | Telegram (worker health, error rates, DB/Redis reachability) |
| Backup | Daily automated, 30-day retention |
| CI/CD | lint → test → build → deploy on CI success |
| TLS | Cloudflare proxy, HSTS, TLSv1.2+ |
Audit sign-off
Package 10 audit completed 2026-05-18. All 44 browser E2E steps pass (Admin UI + Portal + Embed). All production readiness checks pass (health, security headers, nginx, docker-compose.prod.yml, backup, deploy.sh). Zero ruff warnings.
Consciously Deferred Items¶
These items were evaluated and explicitly deferred — they are not open issues.
| Item | Rationale | Residual Risk |
|---|---|---|
| Forgot password / email recovery | Requires email provider. Admin recovery via direct DB is sufficient at current scale. | Low |
| Portal API key self-service (create/revoke) | Admin-managed keys are secure. Self-service is a UX improvement for Phase 3. | None |
| Presigned S3 upload URLs | Current base64 flow works at this volume. | None |
change-password rate limiting |
15-min access token window limits the attack surface. | Very low |
| Kubernetes GPU worker scaling | Single server handles current volume. | None |
Deferred ≠ ignored
Each deferred item has a written justification and will be revisited when the trigger condition is met (e.g., email recovery when an email provider is added; K8s when load requires it).
Technical Debt Register¶
These are known code-level limitations that do not block launch but should be addressed in upcoming sprints.
| Issue | Impact | Effort |
|---|---|---|
avg_latency_ms not recalculated on CONFLICT |
Usage dashboard shows first job's latency for a client, not a running average | Low — SQL formula fix |
Admin webhook CRUD: user.id == client.id assumption |
A second admin user without a matching Client record gets 404 on all webhook operations | Low — query fix |
TRYON_COMPLETED Prometheus counter not incremented |
Dashboard completion counter always shows zero | Low — one .inc() call in worker |
| Node.js 20 EOL June 2026 | Frontend build uses an unsupported runtime after June 2026 | Low — update base image to node:22-alpine |
CI diff uses HEAD~1 |
Only detects last commit's changes; multi-commit pushes may miss earlier changed services | Medium — switch to github.event.before...after diff |
Node.js 20 EOL
The frontend Dockerfile uses node:20-alpine. Node.js 20 reaches end-of-life on 2026-06-02. Upgrade to node:22-alpine before that date. No code changes expected — Vite 5 supports Node 22.
Recommended Next Phase¶
Prioritized by business impact:
| Phase | Item | Estimated Effort |
|---|---|---|
| 1 | Stripe billing integration — paid plan enforcement, invoicing | 2–3 weeks |
| 2 | Email notifications (SendGrid/Postmark) — forgot password, job alerts | 1 week |
| 3 | Portal self-service improvements — API key create/revoke, plan upgrade requests | 1–2 weeks |
| 4 | Performance scaling — when triggered by load metrics | 3–4 weeks |
Phase 1 prerequisite
Stripe integration requires an email provider (for billing confirmations and receipts), so Phases 1 and 2 are best developed together.
Deployment Architecture Summary¶
Internet → Cloudflare Proxy (DDoS, WAF)
→ nginx (:443, TLS 1.2+, HSTS)
→ admin-api (FastAPI, internal only)
→ postgres / redis (internal only)
→ ml-worker (internal, BRPOP loop)
→ fal.ai FASHN v1.5 (external AI)
Monitoring: Prometheus → Grafana (internal)
Alerting: AlertManager → Telegram Bot
Backups: pg_dump → /opt/tryon-saas-backups/ (daily, 30-day retention)
CI/CD: GitHub Actions → SSH deploy on CI success
All internal services (postgres, redis, prometheus, grafana) are not exposed to the internet. In production, only ports 22, 80, and 443 are open via UFW.