Book review: Site Reliability Engineering

Google has been working for about 2 years on a book about Site Reliability Engineering, the discipline and organization that keeps Google’s large-scale systems running smoothly. “Site Reliability Engineering” was finally published last week. It spans some 500 pages, and offers a rare inside glimpse into how Google actually works. The authors are remarkably open, naming technologies and projects, and explaining how systems work. There may not be source code, but there’s lots here that can be implemented outside Google. That makes it a great read for startups expecting to scale and small-to-medium tech companies that want to up their reliability game.

Right off the top, you should know I work as an SRE at Google, and I was involved in putting together a small part of Chapter 33, which talks about how SRE principles are found outside of tech. I had the opportunity to read that chapter prior to publication, and I thought it told a story well worth telling, so I’ve been excited about this book’s release for a while now. The google.com/sre landing page has launched (and has lots of great resources), and the book went on sale (O’Reilly, Amazon, Google Books) on April 6 2016.

Obviously, the book isn’t likely to have many earth-shattering revelations for me, coming from inside the organization that put this book together. Still, I’m interested in what we’re saying to the rest of the tech community. Google has historically struggled to articulate well what SRE is. That uncertainty was something I had to wrestle with when I was first offered a job here. I’m very glad I decided to come here and be immersed in Google’s SRE tech and culture. Google is remarkably good at running large systems, and SRE is the collection of practices that resulted in building the tech and culture that allow us to do that effectively.

The good news is that even though you don’t have Borg or Chubby, you can still do a lot of the same things that Google SRE does. You do not have to give up and go home because you don’t have the source code for Borgmon. Just use Prometheus. This book gives lots of practical advice that you can actually start using right away. It can also give you ideas of what to aim for as you start building larger and larger systems – if you are going to write your own Borgmon, there’s lots of great information on how to do it (and some on how not to – the book is refreshingly honest about missteps).

The book (to my knowledge) doesn’t talk about any technologies that weren’t already been public. Papers and public talks on Piper, BorgMaglev, and more have come out in recent years, so the authors can talk about them publicly now. Specific technologies are interesting as case studies, but the principles of SRE are how Google made those investments, and the book is about SRE, not specific software projects or systems. For the most part, this book doesn’t describe ready-made systems, but rather practices and principles that you can take individually as you start out, though they work best as a coherent whole. Luckily, there are tractable pieces of advice for lots of different kinds of readers. I talk more about that at the end of the review.

Overall, I’m impressed with how open the authors are. I’ve been surprised to see products and projects mentioned by name, so tech voyeurs may find the book interesting for that reason.There are some great “war” stories – some I had heard, some I hadn’t. The war stories are invariably instructive, and serve as a reminder that though Google may look from the outside like a duck calmly bobbing across a pond, underneath, the duck’s feet are paddling like mad. SRE are the duck’s feet. Our SRE teams are engineers who make it seem like Google’s systems always works perfectly. They don’t. They’re pretty much always something failing – but we’ve built systems to withstand failure of all scales, from a bad hard drive to a hurricane taking out a whole datacentre. The SRE book tries to show how you can too.

A guided tour

The book is divided into five sections: Introduction, Principles, Practices (the largest by far, ~60% of the book), Management, Conclusions. I want to talk briefly about specific chapters that I think will be especially interesting or valuable for readers. If that doesn’t interest you, skip ahead to my reflections on the value of the SRE book for readers.

“Introduction” is actually important to set the stage for talking about specific practices, and I urge you not to skip it. The first chapter outlines what SRE is, and draws distinctions between SRE, system administration, and DevOps. The second chapter gives a high-level overview of Google’s production environment, from Borg to storage, networking, and the development environment. As you read this, think of alternatives that are available externally: Apache Mesos and Kubernetes are mentioned specifically, but Nomad is a newcomer that also looks promising.

“Principles” expands on chapter 1, beginning with risk management.  This is key to understanding the tension inherent in SRE. If we wanted 100% stability, we’d never allow product teams to change anything. This kills the business. Instead, we want to manage the level of risk we take so we move as fast as possible without breaking things worse than our error budget allows.

Skipping ahead, chapters 6 and 10 discuss in some depth how Google monitors systems of such massive scale, and how we alert ourselves when things go wrong (it also discusses what “wrong” means, which is very important). This is perhaps as hard a problem as building the systems to be monitored in the first place, and pulling it off is a feat of software and systems engineering.

Chapter 7 talks about SRE’s dedication to automation. Automation is incredibly valuable at our scale, but as Google’s scale increases, we aspire to build more systems which go beyond this. We want systems that are autonomous. This is the only way we can ever hope to operate our largest systems, and the systems of the future.

One of the most important things about SRE is the culture. This is discussed in pockets throughout the book, but a key element is “postmortem culture”. Chapter 15 explains what this means, and in particular why postmortems must be blameless.

Blameless postmortems are a tenet of SRE culture… You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.

This is really key. It is critical that postmortems be blameless so we can understand honestly and fully what happened, why the people involved did what they did, and how to make the system more reliable, even though it has unreliable components (power supply, humans, etc).

Another purpose of postmortems that isn’t discussed until chapter 28 is that some postmortems are “teachable” – they can give really great insight into how systems (don’t) work, how incidents are handled, and also serve as proof that your postmortem culture takes the “blameless” part seriously.

Google’s founders Larry Page and Sergey Brin host TGIF, a weekly all-hands held live at our headquarters in Mountain View, California, and broadcast to Google offices around the world. A 2014 TGIF focused on “The Art of the Postmortem,” which featured SRE discussion of high-impact incidents. One SRE discussed a release he had recently pushed; despite thorough testing, an unexpected interaction inadvertently took down a critical service for four minutes. The incident only lasted four minutes because the SRE had the presence of mind to roll back the change immediately, averting a much longer and larger-scale outage. Not only did this engineer receive two peer bonuses immediately afterward in recognition of his quick and level-headed handling of the incident, but he also received a huge round of applause from the TGIF audience.

When we say “blameless” postmortems, we mean it, and you should too. Displaying that for new team members is an important part of the acculturation process. Teachable postmortems also let new team members understand the systems and how they interact. When I was preparing to interview with Google, I found this a really useful exercise: reading postmortems from companies that are more public with postmortems than is typical helped me think about larger-scale systems than I’d encountered previously. When I arrived at Google, I found that my team had a great collection of postmortems that I read throughout the onboarding process. That helped me get better at systems thinking in general, and learn about our systems specifically.

Chapter 17 discusses testing for reliability. This is one of the few chapters I felt was a let-down. While it discusses critical strategies like load testing and canaries, they weren’t discussed in much detail. It’s possible the authors didn’t want to say more publicly, or the material was cut for length (if so, I would have cut elsewhere), but I was disappointed nonetheless. I’m not sure there’s enough meat there for folks who don’t know about these practices already to get a good grip on how they should incorporate them into their existing environment.

There’s also a section that talks about the “order” of a software fault. I don’t use those concepts in my work, and I don’t know anyone who does either. Perhaps we should. And perhaps I should take a more positive view on this – Google’s SRE teams are not all the same, and we continually learn from one another. This may be something my team should learn and internalize. I’ll be coming back to chapter 17 in the coming weeks.

A series of four chapters detail how Google does load balancing at various levels (chapters 19 & 20), and how Google handles overload and avoids cascading failure (chapters 21 & 22). These are interrelated topics, and deserve their collective 60 pages. We have standard server- and client-side implementations for backpressure and respecting backpressure, weighted round robin load balancing, backend subsetting, request priority/criticality and load shedding, query cost, and more. These are all important to get right in order to avoid overload and cascading failure, so letting everyone learn for their mistakes individually is a bad plan.

We learned this lesson the hard way: modeling capacity as “queries per second” or using static features of the requests that are believed to be a proxy for the resources they consume (e.g., “how many keys are the requests reading”) often makes for a poor metric… A moving target makes a poor metric for designing and implementing load balancing… We often speak about the cost of a request to refer to a normalized measure of how much CPU time it has consumed.

The next pair of chapters, 23 and 24, cover distributed consensus systems and Borgcron, Google’s distributed cron service, which uses distributed consensus. Distributed cron is harder than you might think, and the exercise of walking through the design in layers, building from single-machine cron up to Borgcron is instructive.

Part 4 concerns management of SRE teams. Since this is less interesting (to me) than the technical parts of the book, I want to skip ahead. I will point out that my team’s work gets a mention at the end of chapter 32, in the section “Evolving Services Development: Frameworks and SRE Platform” (p. 451). We believe this sort of platform standardization effort is key for scaling Google SRE, and therefore for scaling Google systems.

Part 5 leads off with some stories about how other industries achieve high reliability. I’m quoted a few times, and this is the chapter I was asked to review pre-publication. I’m not sure how much value it adds to the book, but it is satisfying to see commonalities between other industries that require high levels of reliability, suggesting that Google SRE isn’t totally on the wrong track.

Some reflections

With that whirlwind tour of some of the most notable parts of the book completed, I want to discuss the value this book can have. Is this just Google showing off, or is there value for the reader as well? Are the practices described in the book really practicable for smaller companies? I think so. There’s lots of tractable advice in here for open-source projects, smaller companies, and even for large tech companies that don’t have a mature SRE organization like Google’s.

For open-source projects, many of the software development and testing practices described can be easily implemented. You should be designing with backpressure and graceful degradation in mind; with white-box monitoring in mind; with extensive testing that goes beyond unit tests; and so on.

For small companies, perhaps with a handful of engineers doing operations in a traditional sysadmin way, how can this book help you? First, it can show you a different path. But implementing all of this can seem overwhelming, and in fact, I don’t think that’s the right approach. Instead, you should start with the principles of SRE. You should alter your training and hiring so that sysadmin team gets the software development chops it may be missing. You should make sure you’re doing blameless postmortems, and fixing all the action items when outages happen. You should have SLOs, and slow down release cadence when you exhaust your error budgets. In particular, look at chapter 30 “Embedding an SRE to Recover from Operational Overload” – all you’ll be missing is that SRE who embeds with the team having challenges. It’ll be up to you to hire that person, and then make sure they’re set up for success as they try to steer your ship. See chapters 1 and 28 as well.

For larger companies with lots of engineering talent, but without a mature SRE organization, the likely paths are more numerous, and depend on what you think you’re not getting out of your engineering organization. If you still have a pure operations department, or if you do DevOps, but without a dedicated SRE team to build a platform for your DevOps folks, it may be time to change how your engineering organization is structured. Larger companies are also in the best position to look at the specific technological choices Google has made. For example, if you don’t have standards for avoiding overload and cascading failure, invest the software engineering cycles into building that infrastructure. If you’re having problems with load balancing, take cues from how Google is doing it.

Last, I want to say that the book was actually really enjoyable to read.

I strongly recommend that folks interested in DevOps, operations, reliability, and software engineering for scale read the book. It’s a valuable glimpse into Google’s inner workings, and it offers tractable advice, even if you don’t have access to Google engineers and source code. You absolutely can find implementable ideas in here.

If you’re interested in talking about SRE or the SRE book, a bunch of us hang out on Slack.