Internet Services
The Limit of Risk Models due to complexity
After reflecting on recent extended online service outages (AWS DynamoDB), I would like to highlight a different but related issue: the incompleteness of risk models when applied to concurrent and complex services.
Why Models Fall Short
Risk models are built on assumptions and measurable factors, such as the probability of single-point failures (recovery time etc.). But online services are not closed systems. They are open, evolving, and deeply interconnected. Dependencies are often hidden, such as a team effort to fix an issue or upgrade a background service, and cloud platforms also change continuously in the background. Models do not capture such a moving target and may provide a kind of map, not the country itself.
Risk assessments are rather aspirational than complete. It is about hidden variables. Hidden couplings between services only reveal themselves during breakdowns. Human factors—misconfigurations, rushed patches, operational shortcuts—are difficult to evaluate.
Troubleshooters and DevOps engineers could write a song about the convoluted, hidden couplings between components and the problems they encounter.
Consider a global payment provider. Even if its core transaction engine is robust, it still depends on DNS, cloud networking, third-party APIs, and regional data centers. A disruption in any of these layers can ripple outward. The risk model may have accounted for transaction downtime, but not for a DNS misconfiguration or a sudden API rate limit imposed by a partner. This gap between modeled risk and real-world complexity is where surprise comes in.
References
On complexity and risk: https://queue.acm.org/detail.cfm?id=945134
On chaos engineering: https://principlesofchaos.org/