Site Reliability Engineering and embracing risk

Over the years, Google has had many different services come into existence. Some worked with the market, others did not. One thing that you could always count on though, was the reliability of any services they did have. Gmail for example, has over 1 Billion active users and has had 300% growth year over year for the last three years! Gmail also has 99.978% availability and no scheduled downtime.

So how can these services be so reliable? Well the secret sauce is the way that services are designed, deployed and maintained. The divisions between developers and operations becomes blurred, hence the term DevOps. This, however is more than DevOps, it’s Site Reliability Engineering or SRE.

Here is a definition of the contrast between DevOps and SRE from the book “Site Reliability Engineering – How Google Runs Production Systems ”

The term “DevOps” emerged in industry in late 2008 and as of this writing (early 2016) is still in a state of flux. Its core principles—involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks—are consistent with many of SRE’s principles and practices. One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.

SRE is now broadly spread across the entire industry and it is widely accepted as one of the best ways to run large environment service management.

Google has now released this book for everyone to read for free. I particularly like the chapter on embracing risk, as it meshes perfectly with the book I have written. There are a lot of parallels in the material, though the approaches are different.

Here is the link to Chapter 3 – Embracing Risk, from the Site Reliability Engineering book.