On January 15th 1990, the AT&T long distance network had a full on collapse due to a single line of code and an obscure set of circumstances. The part of the event that caused more pain, was the fact that the system was supposed to isolate problem switches, or ones deemed “crazy”. In this instance however, all switches went “crazy” at the same time.
The full article is here:
This is an old event, but the lessons learned are just as relevant today as more and more systems are becoming independent and autonomous.
- How do you safeguard the safeguards?
- What is your strategy for responding to catastrophic cascading failures?
I think it is important to design cascading logic into systems, so that they become aware of non-expected operations and level set before a continual failure path completes. This can be a tedious affair, but I fully support it as a valid end goal.
When developing self healing systems, have levels of reset operations with time thresholds.
- If failure A occurs, then compensate with reaction 1.
- If reaction 1 does not resolve it, then apply reaction 2.
- If that doesn’t resolve it within the threshold, then classify the situation as failure B and apply reaction 3.
- Continue until the layers of correction have exhausted all possible avenues for correction and manual intervention is required.
Another example of a self healing system that caused more problems was seen recently at a McDonalds McCafe. An order entry display system was experiencing some issues during rush hour. The solution applied was to re-image the system.
In a microservices environment, this makes a lot of sense because it happens almost instantaneously. However, in a low bandwidth branch office with a large image that does not have a localized version of it, this caused an outage of over an hour.
When designing self healing systems, ensure that the cure isn’t worse than the ailment.