AI-governance-first monitoring unifying metrics, logs, and traces across the multi-tenant BPMN workflow platform.
The NestJS API has zero instrumentation. Prometheus, Grafana, Loki, and Tempo are not deployed. No unified observability profile exists in Docker for local development. This creates blind spots in incident response and prevents data-driven AI governance decisions.
Prometheus, Grafana, Loki, Tempo, Promtail, Node Exporter, AlertManager
HTTP requests, duration, errors, active connections for NestJS API
Platform Overview, Workflow Metrics, AI/LLM Monitoring, Infrastructure
9 prompt enhancement + 9 Styx workflow alerts consolidated in Prometheus
| Goal | Success Criteria | Timeline |
|---|---|---|
| Complete Visibility | All services instrumented with metrics, logs, traces | Q1 2026 |
| AI Governance | GOV-011 through GOV-018 controls mapped and monitored | Q1 2026 |
| Production Parity | Local Docker observability matches GKE production | Q1 2026 |
| Operational Excellence | MTTD < 5 min, MTTR < 30 min for P0 incidents | Q2 2026 |
Unified metrics, logs, and traces flowing through OpenTelemetry to purpose-built storage backends with Grafana as the single pane of glass.
rival_api_http_requests_totalrival_api_http_request_duration_secondsrival_api_errors_totalrival_api_active_connectionsstyx_workflow_* (20+ worker metrics)All observability services run behind --profile observability for lean local development. One command to start: ./scripts/dev.sh --observability
| Service | Image | Port | Purpose |
|---|---|---|---|
| Prometheus | prom/prometheus:v2.45.0 |
9090 | Metrics storage & alerting |
| Grafana | grafana/grafana:10.2.3 |
3333 | Dashboards & visualization |
| Loki | grafana/loki:2.9.0 |
3100 | Log aggregation |
| Tempo | grafana/tempo:2.2.0 |
3200 | Trace storage |
| Promtail | grafana/promtail:2.9.0 |
— | Log shipping to Loki |
| Node Exporter | prom/node-exporter:v1.6.0 |
9100 | System metrics |
| AlertManager | prom/alertmanager:v0.27.0 |
9093 | Alert routing |
prometheus_datagrafana_dataloki_datatempo_dataalertmanager_dataAlerts previously scattered across 2 YAML files are consolidated into a single Prometheus-native alert configuration with governance control mapping.
QualityScoreDrift | Median < 0.70 for 1h |
HighRetryRate | >20% retry rate for 15m |
ModelRoutingAnomaly | Opus > 40% for 30m |
LowHaikuUtilization | Haiku < 30% for 2h |
LatencySpike | P95 > 300s for 10m |
ThroughputDegradation | <50% normal for 15m |
GovernanceComplianceDrift | Parse failure >10% for 30m |
HighManualReviewRate | >50% manual for 1h |
CortexOneFunctionErrors | >10% error for 10m |
styx_workflow_stuck | Active > 8h for 5m |
styx_agent_failure_rate | >5% for 15m |
styx_sla_breach_count | >3 breaches/h for 5m |
styx_hitl_task_aging | Pending > 4h |
styx_circuit_breaker_open | State = open for 5m |
styx_confidence_trending_down | <0.70 avg for 2h |
styx_high_regeneration_rate | >30% for 20m |
styx_cortexone_latency | P95 > 300s for 30m |
styx_workflow_completion_rate | <80% for 1h |
Every alert rule maps to a governance control (GOV-011 through GOV-018) with evidence collection for SOC2 CC7.2 and EU AI Act compliance.
| Control | Description | Monitoring Metric | Alert |
|---|---|---|---|
| GOV-011 | Model Quality Assurance | styx_confidence_score |
QualityScoreDrift |
| GOV-012 | Model Routing Efficiency | styx_model_selected_total |
ModelRoutingAnomaly |
| GOV-013 | Cost Optimization | styx_tokens_used_total |
LowHaikuUtilization |
| GOV-014 | Human Oversight | styx_hitl_tasks_pending |
styx_hitl_task_aging |
| GOV-015 | Structured Output Integrity | styx_structured_output_parse_total |
GovernanceComplianceDrift, HighRetryRate |
| GOV-016 | Workflow SLA Compliance | styx_sla_breach_total |
styx_sla_breach_count |
| GOV-017 | Audit Trail Completeness | styx_workflow_duration_seconds |
styx_workflow_stuck |
| GOV-018 | Circuit Breaker Resilience | styx_circuit_breaker_state |
styx_circuit_breaker_open |
--profile observabilityGET /api/metrics endpointdev.sh --observabilityPhoenix (Arize) for LLM traces: token usage, cost, bias detection, hallucination risk. No competitor offers native AI governance monitoring combined with cost-efficient self-hosting.
Every alert mapped to GOV-011 through GOV-018 controls. 13-month retention for SOC2 CC7.2. EU AI Act audit trail built in. Evidence collection automated.
100% open source: Prometheus, Grafana, Loki, Tempo, OpenTelemetry. Portable across clouds. We own the data. No per-host or per-GB pricing.
One command: ./scripts/dev.sh --observability. Pre-provisioned dashboards. Local dev parity with GKE production. Under 15-minute onboarding.
Full PRD available at docs/prd/observability-platform.md (1,640 lines, 73KB)