Transparency in the site failure (of Jan 31st 2017 ) and post-mortem for GitLab


On January 31st 2017, the GitLab site had to be taken down for emergency maintenance. Something bad had happened due to a number of issues occurring and the planned recovery process was thrown out the window because, in their own words:

“out of five backup/replication techniques deployed none are working reliably or set up in the first place. We ended up restoring a six-hour-old backup.”

What I found from this incident was a level of transparency and process that is rarely made public. They had a YouTube live stream of their internal discussions on what they thought the problems were and how to resolve it. They had a public  google doc with notes being posted of what they were doing in real time. They had a public monitoring page with detailed information by using Grafana.

They had a twitter account just for the status (https://twitter.com/gitlabstatus) ,

They owned the issue, without trying to hide anything and showed the world not only the entire process of how they fixed it, but the timeline and steps that they used to get there. The thought processes, the interactions. These are the real moments when something great can be learned.

After the issue was resolved and thoroughly investigated, a post-mortem document was created that went into great depth on the details, with information such as:

  1. What happened (summary)
  2. The configuration / topology
  3. The timeline
  4. Broken procedures
  5. Recovery process
  6. Publication details of the incident
  7. Impact of the event on data
  8. Impact of the event on the company
  9. Root cause analysis
  10. Steps to improve recovery
  11. Troubleshooting FAQ

This was an excellent example of how to address an incident, learn from it, make improvements and socialize the knowledge gained. See the link below for the full post mortem from GitLab.

https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/