CC7.2 — System monitoring (metrics, logs, alerts)
Status: Partially implemented (audit-chain integrity + on-chain anchor monitoring live; Grafana dashboard + Prometheus stack target Phase 1) Owner: Agent #38 (Senior Compliance Lead, SOC 2 + ISO 27001) Last reviewed: 2026-05-28 Next review: 2026-08-28
Trust Services Criteria reference
The entity monitors system components and the operation of those components for anomalies that are indicative of malicious acts, natural disasters, and errors affecting the entity's ability to meet its objectives; anomalies are analysed to determine whether they represent security events. The control covers metrics collection, log retention, alert thresholds, the on-call rotation, and the integration with incident-response.
How ZeroAuth meets this control
The monitoring stack is split between always-on technical signals and human-cadence reviews.
Audit-chain integrity monitoring. The drift detector mandated by ADR 0013 ("Drift detection" section) runs hourly and replays the last N rows per tenant against the recorded event_hash. A mismatch triggers a severity-1 alert. The /api/admin/audit-integrity endpoint (commit d634b2d) exposes the integrity-status surface; the dashboard's AuditIntegrityView consumes it for the Scene-5 demo. Threat-model rows A-14, A-22 capture the in-DB tampering attack surface.
On-chain anchor monitoring. ADR 0014 schedules a daily anchor at 00:30 IST. The audit_anchors table records every successful anchor with the on-chain tx hash. Failure recovery: retry every 60 min for 6 hours, then page on-call. Two consecutive missed-anchor days puts the tenant in "anchor-degraded" state with a dashboard banner. The AuditAnchor contract (commit d6c6a4e) emits an event so block explorers index it.
Boot-time integrity check. ADR 0015 + commit e98d158 install the SHA-256 check on verification_key.json at boot. A mismatch refuses to start. Failure mode is loud (service down) rather than silent (service running with wrong vkey). Closes audit finding C-7.
Application logs. Winston structured JSON logging across the app. The non-goal in CLAUDE.md ("Never log biometric-derived raw data") sets the log-content constraint; the source-grep + zod-validation guards (commit c09c081, ADR 0016 commit 76f8d4e) prevent biometric raw data from entering logs by preventing it from entering the application at all.
Health check. /api/health is the unauthenticated subsystem-status surface (returns DB connectivity, Redis connectivity, blockchain RPC connectivity, recent error rate). Public via Caddy; production-monitored externally.
CVE monitoring. scripts/cve-monitor.sh (commit f8a756c) on a nightly schedule. High-severity finding → page. Detailed in CC6.8 + CC7.1.
CI monitoring. .github/workflows/ci.yml runs on every push. Red CI surfaces a Slack-equivalent notification to the responsible agent (today the Friday status post, future the dedicated chat channel per CC2.2 gap).
On-call rotation. The escalation matrix in 06-ways-of-working.md "Escalation" defines the on-call response surface: severity-1 production incidents → Roles 5, 21, 26 → Role 1 with a 15-minute pageable SLA. The roster is currently Role 5 (SRE VP) + Role 21 (DevOps lead). Quarterly on-call-rotation reviews land in the compliance retros.
Caddy access logs. Structured logs emitted by Caddy; rotated by host syslog. Compliance-relevant access patterns also write to audit_events through appendAuditEvent (commit a475ed8) so the access trail is tamper-evident.
The Grafana + Prometheus dashboard stack is the named gap. Compliance roadmap §3.2 lists it implicitly as part of the SOC 2 Type I evidence-period preparation (week 14, 2026-08-24). ADR 0016 (commit 76f8d4e) introduces a validation_error_count_total{route, reason} counter when zod-validation lands (C-022 sprint 2) — the first piece of the Prometheus instrumentation.
Audit signal coverage in the threat model is partial. Some A-NN rows have explicit audit-signal entries (e.g., A-02 "audit_events.action = 'zkp.verify' is recorded"); others have "no special signal yet" or "MISSING" (e.g., A-01 cross-tenant query blocked). These gaps are tracked for Phase 1 sprint 2.
Evidence references
- ADR
0013-audit-log-hash-chain.md(commit27ed93c) "Drift detection" — hourly chain-replay job. - ADR
0014-on-chain-anchor-cadence.md(commit27ed93c) — daily anchor + missed-anchor alerting. - ADR
0015-circuit-version-pinning.md(commit27ed93c) — boot-time integrity check. - Commit
e98d158— boot vkey hash check. - Commit
d634b2d—/api/admin/audit-integrityendpoint. - Commit
d6c6a4e—AuditAnchorcontract emits event for block-explorer indexing. - Commit
a475ed8—appendAuditEventwrites structured audit trail. - Commit
c09c081— biometric forbidden-key grep (log-content constraint). - Commit
76f8d4e— ADR 0016 introduces validation-error Prometheus counter. - Commit
f8a756c— nightly CVE monitor. docs/threat_model.md— audit-signal column tracks per-attack instrumentation.docs/plan/bfsi-v1/06-ways-of-working.md"Escalation" — sev-1 on-call SLA.
Open gaps + remediation roadmap
- Grafana + Prometheus stack — target Phase 1 week 8 (2026-07-20); first SLO dashboard.
- Per-attack-class audit signal — gaps in
docs/threat_model.mdrows A-01, A-02, A-05, etc.; close by Phase 1 sprint 2. - Centralised log aggregation (Loki / OpenSearch) — target week 14 (2026-08-24); supports SOC 2 Type I evidence claims about log retention.
- SLO + SLI definitions — target week 14; pairs with the Grafana stack.
- Alert routing into a real on-call tool (PagerDuty / Opsgenie) — target week 14; replaces the implicit-Friday-status notification.
Test or audit query
curl -s https://api.zeroauth.dev/api/health returns the subsystem-status JSON. psql ... -c "SELECT count(*) FROM audit_anchors WHERE anchored_at > now() - interval '36 hours'" should return ≥ 1 per active tenant (proves the anchor cron is firing). grep -E "audit-signal" docs/threat_model.md | wc -l returns the count of audit-signal rows.