Skip to main content

CC7.2 — System monitoring (metrics, logs, alerts)

Status: Partially implemented (audit-chain integrity + on-chain anchor monitoring live; Grafana dashboard + Prometheus stack target Phase 1) Owner: Agent #38 (Senior Compliance Lead, SOC 2 + ISO 27001) Last reviewed: 2026-05-28 Next review: 2026-08-28

Trust Services Criteria reference

The entity monitors system components and the operation of those components for anomalies that are indicative of malicious acts, natural disasters, and errors affecting the entity's ability to meet its objectives; anomalies are analysed to determine whether they represent security events. The control covers metrics collection, log retention, alert thresholds, the on-call rotation, and the integration with incident-response.

How ZeroAuth meets this control

The monitoring stack is split between always-on technical signals and human-cadence reviews.

Audit-chain integrity monitoring. The drift detector mandated by ADR 0013 ("Drift detection" section) runs hourly and replays the last N rows per tenant against the recorded event_hash. A mismatch triggers a severity-1 alert. The /api/admin/audit-integrity endpoint (commit d634b2d) exposes the integrity-status surface; the dashboard's AuditIntegrityView consumes it for the Scene-5 demo. Threat-model rows A-14, A-22 capture the in-DB tampering attack surface.

On-chain anchor monitoring. ADR 0014 schedules a daily anchor at 00:30 IST. The audit_anchors table records every successful anchor with the on-chain tx hash. Failure recovery: retry every 60 min for 6 hours, then page on-call. Two consecutive missed-anchor days puts the tenant in "anchor-degraded" state with a dashboard banner. The AuditAnchor contract (commit d6c6a4e) emits an event so block explorers index it.

Boot-time integrity check. ADR 0015 + commit e98d158 install the SHA-256 check on verification_key.json at boot. A mismatch refuses to start. Failure mode is loud (service down) rather than silent (service running with wrong vkey). Closes audit finding C-7.

Application logs. Winston structured JSON logging across the app. The non-goal in CLAUDE.md ("Never log biometric-derived raw data") sets the log-content constraint; the source-grep + zod-validation guards (commit c09c081, ADR 0016 commit 76f8d4e) prevent biometric raw data from entering logs by preventing it from entering the application at all.

Health check. /api/health is the unauthenticated subsystem-status surface (returns DB connectivity, Redis connectivity, blockchain RPC connectivity, recent error rate). Public via Caddy; production-monitored externally.

CVE monitoring. scripts/cve-monitor.sh (commit f8a756c) on a nightly schedule. High-severity finding → page. Detailed in CC6.8 + CC7.1.

CI monitoring. .github/workflows/ci.yml runs on every push. Red CI surfaces a Slack-equivalent notification to the responsible agent (today the Friday status post, future the dedicated chat channel per CC2.2 gap).

On-call rotation. The escalation matrix in 06-ways-of-working.md "Escalation" defines the on-call response surface: severity-1 production incidents → Roles 5, 21, 26 → Role 1 with a 15-minute pageable SLA. The roster is currently Role 5 (SRE VP) + Role 21 (DevOps lead). Quarterly on-call-rotation reviews land in the compliance retros.

Caddy access logs. Structured logs emitted by Caddy; rotated by host syslog. Compliance-relevant access patterns also write to audit_events through appendAuditEvent (commit a475ed8) so the access trail is tamper-evident.

The Grafana + Prometheus dashboard stack is the named gap. Compliance roadmap §3.2 lists it implicitly as part of the SOC 2 Type I evidence-period preparation (week 14, 2026-08-24). ADR 0016 (commit 76f8d4e) introduces a validation_error_count_total{route, reason} counter when zod-validation lands (C-022 sprint 2) — the first piece of the Prometheus instrumentation.

Audit signal coverage in the threat model is partial. Some A-NN rows have explicit audit-signal entries (e.g., A-02 "audit_events.action = 'zkp.verify' is recorded"); others have "no special signal yet" or "MISSING" (e.g., A-01 cross-tenant query blocked). These gaps are tracked for Phase 1 sprint 2.

Evidence references

  • ADR 0013-audit-log-hash-chain.md (commit 27ed93c) "Drift detection" — hourly chain-replay job.
  • ADR 0014-on-chain-anchor-cadence.md (commit 27ed93c) — daily anchor + missed-anchor alerting.
  • ADR 0015-circuit-version-pinning.md (commit 27ed93c) — boot-time integrity check.
  • Commit e98d158 — boot vkey hash check.
  • Commit d634b2d/api/admin/audit-integrity endpoint.
  • Commit d6c6a4eAuditAnchor contract emits event for block-explorer indexing.
  • Commit a475ed8appendAuditEvent writes structured audit trail.
  • Commit c09c081 — biometric forbidden-key grep (log-content constraint).
  • Commit 76f8d4e — ADR 0016 introduces validation-error Prometheus counter.
  • Commit f8a756c — nightly CVE monitor.
  • docs/threat_model.md — audit-signal column tracks per-attack instrumentation.
  • docs/plan/bfsi-v1/06-ways-of-working.md "Escalation" — sev-1 on-call SLA.

Open gaps + remediation roadmap

  • Grafana + Prometheus stack — target Phase 1 week 8 (2026-07-20); first SLO dashboard.
  • Per-attack-class audit signal — gaps in docs/threat_model.md rows A-01, A-02, A-05, etc.; close by Phase 1 sprint 2.
  • Centralised log aggregation (Loki / OpenSearch) — target week 14 (2026-08-24); supports SOC 2 Type I evidence claims about log retention.
  • SLO + SLI definitions — target week 14; pairs with the Grafana stack.
  • Alert routing into a real on-call tool (PagerDuty / Opsgenie) — target week 14; replaces the implicit-Friday-status notification.

Test or audit query

curl -s https://api.zeroauth.dev/api/health returns the subsystem-status JSON. psql ... -c "SELECT count(*) FROM audit_anchors WHERE anchored_at > now() - interval '36 hours'" should return ≥ 1 per active tenant (proves the anchor cron is firing). grep -E "audit-signal" docs/threat_model.md | wc -l returns the count of audit-signal rows.