SOC2-A1-AVAILABILITY-MONITORING
Availability Monitoring and Alerting
What this control does
Continuous uptime monitoring with automated alerting and defined on-call response procedures to minimize downtime.
Implementation guidance
Monitor all customer-facing endpoints with synthetic checks from multiple regions (Datadog, PagerDuty, Better Uptime). Set alert thresholds: notify on-call within 2 minutes of downtime. Define and document on-call rotation and escalation. Maintain an uptime history for audit evidence.
Requirements satisfied
Why it matters
Undetected service outages directly impact customer availability commitments (SLAs) and erode trust; without automated monitoring and rapid on-call response, degradation or failures can persist for hours before discovery. A documented, tested on-call process ensures incident response is coordinated rather than ad-hoc, reducing MTTR and demonstrating compliance with availability commitments in your SOC 2 report.
Evidence to collect
- Screenshots or exports from monitoring platform (Datadog, PagerDuty, Better Uptime) showing configured synthetic checks for all customer-facing endpoints with alert thresholds and multi-region coverage
- On-call rotation schedule (current and historical) documenting escalation paths, response time targets (e.g., acknowledge alert within 2 minutes), and on-call team members
- Alert notification logs showing timestamp of outage detection, alert triggered, on-call notified, and incident acknowledged within defined SLA
- Uptime dashboard or monthly/quarterly uptime reports showing historical availability percentage by service/endpoint
Testing procedure
Request the monitoring platform configuration and verify that all customer-facing endpoints (API, web portal, webhooks) have synthetic health checks running from at least 2 geographic regions with alert thresholds set to trigger within 2 minutes of downtime. Review the documented on-call runbook and verify escalation times are defined (e.g., notify on-call immediately, escalate to manager if unacknowledged in 10 minutes). Pull incident logs from the past 6 months and sample 3–5 confirmed outages, verifying that alerts fired, on-call was notified, and response occurred within SLA. Confirm uptime data is retained and accurate by spot-checking reported metrics against incident logs.
Common gotchas
Teams often configure alerts too broadly (false positive storms that become ignored) or too narrowly (missing real degradation); validate thresholds are tuned to catch meaningful issues without alert fatigue. On-call schedules frequently become outdated or exist only informally in Slack—audit the current rotation monthly and require a documented handoff process when team members change.