Copyright
Table of Contents
Introduction
And So It Begins...
Origin Story
Voices
Forward in All Directions!1Apologies to 3 Mustaphas 3.
Acknowledgments
Part I. SRE Implementation
Chapter 1. Context Versus Control in SRE
Contributor Bio
Chapter 2. Interviewing Site Reliability Engineers
Interviewing 101
Who Is Involved
Industry Versus University
Biases
The Funnel
SRE Funnels
Phone Screens
The Onsite Interview
Take-Home Questions
Advice for Hiring Managers
Final Thoughts on Interviewing SREs
Further Reading
Contributor Bio
Chapter 3. So, You Want to Build an SRE Team?
Choose SRE for the Right Reasons
Orienting to a Data-Driven Approach
Commitment to SRE
Making a Decision About SRE
Contributor Bio
Chapter 4. Using Incident Metrics to Improve SRE at Scale
The Virtuous Cycle to the Rescue: If You Don’t Measure It…
Metrics Review: If a Metric Falls in the Forest…
Surrogate Metrics
Repair Debt
Virtual Repair Debt: Exorcising the Ghost in the Machine
Real-Time Dashboards: The Bread and Butter of SRE
Learnings: TL;DR
Further Reading
Contributor Bio
Chapter 5. Working with Third Parties Shouldn’t Suck
Build, Buy, or Adopt?
Establish Importance
Identify Stakeholders
Make a Decision
Acknowledge Reality
Third Parties as First-Class Citizens
When They’re Down, You’re Down
Running the Black Box Like a Service
Service-Level Indicators, Service-Level Objectives, and SLAs
Playbook: From Staging to Production
Closing Thoughts
Contributor Bio
Chapter 6. How to Apply SRE Principles Without Dedicated SRE Teams
SREs to the Rescue! (and How They Failed)
A Matter of Scale in Terms of Headcount
The Embedded SRE
You Build It, You Run It
The Deployment Platform
Closing the Loop: Take Your Own Pager
Introducing Production Engineering
Some Implementation Details
Developers’ Productivity and Health Versus the Pager
Resolving Cross-Team Reliability Issues by Using
Postmortems
Uniform Infrastructure and Tooling Versus Autonomy
and Innovation
Getting Buy-In
Conclusion
Further Reading
Contributor Bios
Chapter 7. SRE Without SRE: The Spotify Case Study
Tabula Rasa: 2006–2007
Prelude
Key Learnings
Beta and Release: 2008–2009
Prelude
Bringing Scalability and Reliability to the Forefront
Key Learnings
The Curse of Success: 2010
Prelude
A New Ownership Model
Formalizing Core Services
Blessed Deployment Time Slots
On-Call and Alerting
Spawning Off Internal Office Support
Addressing the Remaining Top Concerns
Creating Detectives
Key Learnings
Pets and Cattle, and Agile: 2011
Prelude
Forming Bad Habits
Breaking Those Bad Habits
Key Learnings
A System That Didn’t Scale: 2012
Prelude
Manual Work Hits a Cliff
Key Learnings
Introducing Ops-in-Squads: 2013–2015
Prelude
Building on Trust
Driving the Paradigm Shift
Key Learnings
Autonomy Versus Consistency: 2015–2017
Prelude
Benefits
Trade-Offs
Key Learnings
The Future: Speed at Scale, Safely
Contributor Bios
Chapter 8. Introducing SRE in Large Enterprises
Background
Introducing SRE
Defining Current State
Identifying and Educating Stakeholders
Presenting the Business Case
Implementing the SRE Team
Lessons Learned
Sample Implementation Roadmap
Closing Thoughts
Further Reading
Contributor Bio
Chapter 9. From SysAdmin to SRE in 8,963 Words
Clarifying Terminology
Service-Level Indicator
SLA
Service-Level Objective
Establishing SLAs for Internal Components
Understanding External Dependencies
Nontechnical Solutions
Tracking Availability Level
Dealing with Corner Cases
Conclusion
Contributor Bio
Chapter 10. Clearing the Way for SRE in the Enterprise
Toil, the Enemy of SRE
Toil in the Enterprise
Silos, Queues, and Tickets
Silos Get in the Way
Ticket-Driven Request Queues Are Expensive
Take Action Now
Start by Leaning on Lean
Get Rid of as Many Handoffs as Possible
Replace Remaining Handoffs with Self-Service
Self-Service Is More Than a Button
Self-Service Helps SREs in Multiple Ways
Operations as a Service
Error Budgets, Toil Limits, and Other Tools for Empowering Humans
Error Budgets
Toil Limits
Leverage Existing Enthusiasm for DevOps
Unify Backlogs and Protect Capacity
Psychological Safety and Human Factors
Join the Movement
Contributor Bio
Chapter 11. SRE Patterns Loved by DevOps People Everywhere
Pattern 1: Birth of Automated Testing at Google
Pattern 2: Launch and Handoff Readiness Review at Google
Pattern 3: Create a Shared Source Code Repository
Conclusion
Further Reading and Source Material
Contributor Bio
Chapter 12. DevOps and SRE: Voices from the Community
Background
Method
Results
Replies
Chapter 13. Production Engineering at Facebook
Contributor Bio
Part II. Near Edge SRE
Chapter 14. In the Beginning, There Was Chaos
The Problem with Systems
Economic Pillars of Complexity
Beginning Chaos
Navigating Complexity for Safety
Chaos Goes Big
Formalization
Advanced Principles
Frequently Asked Questions
Conclusion
Contributor Bio
Chapter 15. The Intersection of Reliability and Privacy
The Intersection of Reliability and Privacy
The General Landscape of Privacy Engineering
Privacy and SRE: Common Approaches
Reducing Toil
Efficient and Deliberate Problem Solving
Relationship Management
Early Intervention and Education Through Evangelism
Nuances, Differences, and Trade-Offs
Conclusion
Further Reading
Contributor Bios
Chapter 16. Database Reliability Engineering
Guiding Principles of the Database Reliability Engineer
Protect the Data
Self-Service for Scale
Databases Are Not Special
A Culture of Database Reliability Engineering
Recoverability
Considerations for Recovery
Anatomy of a Recovery Strategy
Building Block 1: Detection
Building Block 2: Diverse Storage
Building Block 3: A Varied Toolbox
Building Block 4: Testing
Championing Recovery Reliability
Continuous Delivery: From Development to Production
Education and Collaboration
Collaboration
Deployment
Migrations and Versioning
Impact Analysis
Migration Patterns
Championing CD
Making the Case for DBRE
Further Reading
Contributor Bio
Chapter 17. Engineering for Data Durability
Replication Is Table Stakes
Backups
Replication
Real-World Durability
Isolation
Protection
Testing
Safeguards
Recovery
Verification
The Power of Zero
Verification Coverage
Watching the Watchers
Automation
Window of Vulnerability
Operator Fatigue
Reliability
Conclusion
Contributor Bio
Chapter 18. Introduction to Machine Learning for SRE
Why Use Machine Learning for SRE?
Why and How Should My Company Be Engaging in This?
Some SRE Problems Machine Learning Can Help Solve
The Awakening of Applied AI
What Is Machine Learning?
What Do We Mean by Learning?
From Chess to Go: How Deep Can We Dive?
Why Now? What Changed for Us?
What Are Neural Networks?
Neurons and Neural Networks
How and When Should We Apply Neural Networks?
What Kinds of Data Can We Use?
Practical Machine Learning
Popular Libraries for Neural Networks
Practical Machine Learning Examples
Success Stories
Further Reading
My GitHub Repository
Recommended Books
Contributor Bio
Part III. SRE Best Practices and Technologies
Chapter 19. Do Docs Better: Integrating Documentation into the Engineering Workflow
Defining Quality: What Do Good Docs Look Like?
Functional Requirements for SRE Documentation
Integrating Docs into the Engineering Workflow
The Google Experience: g3doc and EngPlay
What We Learned
Doing Docs Better: Best Practices
Create Templates for Each Documentation Type
Better > Best: Set Realistic Standards for Quality
Require Docs as Part of Code Review
Ruthlessly Prune Your Docs
Recognize and Reward Documentation
Communicating the Value of Documentation
Further Reading
Contributor Bios
Chapter 20. Active Teaching and Learning
Active Learning
Active Learning Example: Wheel of Misfortune
Active Learning Example: Incident Manager (a Card Game)
Active Learning Example: SRE Classroom
The Costs of Failing to Learn
Learning Habits of Effective SRE Teams
Production Meetings
Postmortems
A Call to Action: Ditch the Boring Slides
Bio
Chapter 21. The Art and Science of the Service-Level Objective
Why Set Goals?
Availability
Time Quanta
Transactions
Transactions over Time Quanta
On Evaluating SLOs
Histograms
Where Percentiles Fall Down (and Histograms Step Up)
Parting Thought: Looking at SLOs Upside Down
Further Reading
Contributor Bio
Chapter 22. SRE as a Success Culture
Where Did SRE Come From?
Key Values for SRE
Keeping the Site Up
Empowering Teams to “Do the Right Thing”
Approaching Operations as an Engineering Problem
Achieving Business Success Through Promises (Service Levels)
Critical Enabling Functions of SRE
Monitoring, Metrics, and KPIs
Incident Management and Emergency Response
Capacity Planning and Demand Forecasting
Performance Analysis and Optimization
Provisioning, Change Management, and Velocity
Phases of SRE Execution
Phase 1: Firefighting/Reactive
Phase 2: Gatekeepers
Phase 3: Advocates/Partners
Phase 4: Catalytic
Complications of Differing Phases
Focus on the Details of Success
Further Reading
Contributor Bio
Chapter 23. SRE Antipatterns
Antipattern 1: Site Reliability Operations
Antipattern 2: Humans Staring at Screens
Antipattern 3: Mob Incident Response
Antipattern 4: Root Cause = Human Error
Antipattern 5: Passing the Pager
Antipattern 6: Magic Smoke Jumping!
Antipattern 7: Alert Reliability Engineering
Antipattern 8: Hiring a Dog-Walker to Tend Your Pets
Antipattern 9: Speed-Bump Engineering
Antipattern 10: Design Chokepoints
Antipattern 11: Too Much Stick, Not Enough Carrot
Antipattern 12: Postponing Production
Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
Antipattern 14: Dependency Hell
Antipattern 15: Ungainly Governance
Antipattern 16: Ill-Considered SLOh-Ohs
Antipattern 17: Tossing Your API Over the Firewall
Antipattern 18: Fixing the Ops Team
So, That’s It, Then?
Contributor Bio
Chapter 24. Immutable Infrastructure and SRE
Scalability, Reliability, and Performance
Failure Recovery
Simpler Operations
Faster Startup Times
Known State
Continuous Integration/Continuous Deployment with Confidence
Security
Multiregion Operations
Release Engineering
Building the Base Image
Deploying Applications
Disadvantages
Conclusion
Contributor Bio
Chapter 25. Scriptable Load Balancers
Scriptable Load Balancers: The New Kid on the Block
Why Scriptable Load Balancers?
Making the Difficult Easy
Shard-Aware Routing
Harnessing Potential
Case Study: Intermission
Service-Level Middleware
Middleware to the Rescue
APIs of Service-Level Middleware
Case Study: WAF/Bot Mitigation
Avoiding Disaster
Getting Clever with State
Case Study: Checkout Queue
Looking to the Future and Further Reading
Contributor Bio
Chapter 26. The Service Mesh: Wrangler of Your Microservices?
Ready to Get Rid of the Monolith?
Current State of Microservice Networking
Service Mesh to the Rescue
The Benefits of a Sidecar Proxy
Eventually Consistent Service Discovery
Observability and Alarming
Sidecar Performance Implications
Thin Libraries and Context Propagation
Configuration Management (Control Plane Versus Data Plane)
The Service Mesh in Practice
The Origin and Development of Envoy at Lyft
Operating Envoy at Lyft
The Future of the Service Mesh
Further Reading
Bio
Part IV. The Human Side of SRE
Chapter 27. Psychological Safety in SRE
The Primary Indicator of a Successful Team
How to Build Psychological Safety into Your Own Team
Further Reading
Bio
Chapter 28. SRE Cognitive Work
Introduction
What Do SRE People Do?
Why Should We Care About Practitioner Cognition?
Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
Human Performance in Modern Complex Systems: The Main Themes
Observations on SRE Cognitive Work Around Incidents
Every Incident Could Have Been Worse
Sacrifice Decisions Take Place Under Uncertainty
Repairs to Functional Systems
Special Knowledge About Complex Systems
Managing the Costs of Coordination
SREs Are Cognitive Agents Working in a Joint Cognitive System
The Calibration Problem
Mental Models
Incidents Trigger Individual Recalibration
Incidents Are Opportunities for Collective Recalibration
What Are the Implications of All This?
Incidents Will Continue
Incidents Will Impose Costs
Incident Patterns Will Change
Incidents Point to Specific Calibration Problems and Locations
What Should Happen Next?
Build a Corpus of Cases
Focus on Making Automation a Team Player in SRE Work
Address the Calibration Problem
What Can You Do?
Conclusion
References
Contributor Bio
Chapter 29. Beyond Burnout
Defining Mental Disorders
Mental Disorders Are Missing from the Diversity Conversation
Sanity Isn’t a Business Requirement
Thoughts and Prayers Aren’t Scalable
Full-Stack Inclusivity
Application
Interviewing
Compensation
Benefits
Onboarding
Working Conditions
Job Duties
Training
Promotion
Leaving
Inclusivity for Anyone Helps Everyone
Mental Disorder Resources
Contributor Bio
Chapter 30. Against On-Call: A Polemic
The Rationale for On-Call
First, Do No Harm
Parallels with SRE
Differences with SRE
Underlying Assumptions Driving On-Call for Engineers
On-Call Is Emergency Medicine Instead of Ward Medicine
Counterarguments
The Cost to Humans of Doing On-Call
We don’t need another hero
Actual Solutions
Training
Prioritization
Improving On-the-Job Performance
We Need a Fundamental Change in Approach
Strong-Anti-On-Call
Weak-Anti-On-Call
A Union of the Two
Conclusion
Contributor Bio
Chapter 31. Elegy for Complex Systems
The Computer and Human Systems Cannot Be Separated
Decoherence and Cascading Failure
Always in a State of Partial Failure
Novelty Priority Inversion
Nobody Anticipates the Overhead of Coordination
Your healthcare.gov Is Out There
To Get Involved
Further Reading
Contributor Bio
Chapter 32. Intersections Between Operations and Social Activism
Before, During, After
Creating the Perfect Plan
Principles of Organizing
Managing Crisis: Responding When Things Break Down
Writing Our Own History: Making Sense of What Went Down
The Long Tail: Turning Action into Change
Activism and Change Within a Company
Conclusion
Contributor Bio
Chapter 33. Conclusion
Index
About the Editor
Colophon