SOC2-A1-CAPACITY-PLANNING
Capacity Planning and Monitoring
What this control does
Processes to ensure infrastructure capacity meets current and projected demand, preventing availability failures due to resource exhaustion.
Implementation guidance
Define capacity thresholds (CPU >80%, memory >85%, disk >80%) and configure alerts. Review resource utilization monthly and plan upgrades at least 4 weeks before projected capacity limits. For cloud environments, configure auto-scaling with defined minimum and maximum limits.
Requirements satisfied
Why it matters
Capacity exhaustion causes service outages, which violates availability commitments in SOC 2 type II reports and damages customer trust. Without proactive monitoring and planning, unplanned downtime becomes inevitable as demand grows, making this a direct availability risk that auditors scrutinize heavily during fieldwork.
Evidence to collect
- Monitoring dashboard screenshots showing CPU, memory, and disk utilization thresholds with alert rules configured (e.g., PagerDuty/Datadog rules)
- Capacity planning spreadsheet or confluence doc with monthly utilization trends, growth projections, and upgrade timelines (with sign-off dates)
- Auto-scaling policy documentation showing minimum/maximum limits, scaling rules, and testing records for cloud environments
- Change management tickets or calendar entries for the last 12 months showing 4-week advance planning before capacity increases
Testing procedure
Obtain the last 12 months of monthly utilization reports and verify they were reviewed on schedule. Select 2–3 recent infrastructure upgrades (or auto-scaling adjustments in cloud) and confirm the upgrade request was submitted at least 4 weeks before the capacity threshold would have been exceeded. Test alert thresholds by checking that alerting rules fire when utilization hits 80% CPU, 85% memory, and 80% disk; verify the SOC team received and responded to at least one test alert in the past 90 days.
Common gotchas
Teams often set thresholds but never test them or act on alerts, rendering the control ineffective—audit that alerts actually trigger and drive documented remediation. Cloud auto-scaling can create a false sense of security; teams forget to set reasonable maximum limits, leading to unbudgeted costs and runaway resource consumption that breaks SLAs just as badly as exhaustion.