logo资料库

Seeking SRE Conversations About Running Production Systems at Sc....pdf

第1页 / 共587页
第2页 / 共587页
第3页 / 共587页
第4页 / 共587页
第5页 / 共587页
第6页 / 共587页
第7页 / 共587页
第8页 / 共587页
资料共587页,剩余部分请下载后查看
Copyright
Table of Contents
Introduction
And So It Begins...
Origin Story
Voices
Forward in All Directions!1Apologies to 3 Mustaphas 3.
Acknowledgments
Part I. SRE Implementation
Chapter 1. Context Versus Control in SRE
Contributor Bio
Chapter 2. Interviewing Site Reliability Engineers
Interviewing 101
Who Is Involved
Industry Versus University
Biases
The Funnel
SRE Funnels
Phone Screens
The Onsite Interview
Take-Home Questions
Advice for Hiring Managers
Final Thoughts on Interviewing SREs
Further Reading
Contributor Bio
Chapter 3. So, You Want to Build an SRE Team?
Choose SRE for the Right Reasons
Orienting to a Data-Driven Approach
Commitment to SRE
Making a Decision About SRE
Contributor Bio
Chapter 4. Using Incident Metrics to Improve SRE at Scale
The Virtuous Cycle to the Rescue: If You Don’t Measure It…
Metrics Review: If a Metric Falls in the Forest…
Surrogate Metrics
Repair Debt
Virtual Repair Debt: Exorcising the Ghost in the Machine
Real-Time Dashboards: The Bread and Butter of SRE
Learnings: TL;DR
Further Reading
Contributor Bio
Chapter 5. Working with Third Parties Shouldn’t Suck
Build, Buy, or Adopt?
Establish Importance
Identify Stakeholders
Make a Decision
Acknowledge Reality
Third Parties as First-Class Citizens
When They’re Down, You’re Down
Running the Black Box Like a Service
Service-Level Indicators, Service-Level Objectives, and SLAs
Playbook: From Staging to Production
Closing Thoughts
Contributor Bio
Chapter 6. How to Apply SRE Principles Without Dedicated SRE Teams
SREs to the Rescue! (and How They Failed)
A Matter of Scale in Terms of Headcount
The Embedded SRE
You Build It, You Run It
The Deployment Platform
Closing the Loop: Take Your Own Pager
Introducing Production Engineering
Some Implementation Details
Developers’ Productivity and Health Versus the Pager
Resolving Cross-Team Reliability Issues by Using Postmortems
Uniform Infrastructure and Tooling Versus Autonomy and Innovation
Getting Buy-In
Conclusion
Further Reading
Contributor Bios
Chapter 7. SRE Without SRE: The Spotify Case Study
Tabula Rasa: 2006–2007
Prelude
Key Learnings
Beta and Release: 2008–2009
Prelude
Bringing Scalability and Reliability to the Forefront
Key Learnings
The Curse of Success: 2010
Prelude
A New Ownership Model
Formalizing Core Services
Blessed Deployment Time Slots
On-Call and Alerting
Spawning Off Internal Office Support
Addressing the Remaining Top Concerns
Creating Detectives
Key Learnings
Pets and Cattle, and Agile: 2011
Prelude
Forming Bad Habits
Breaking Those Bad Habits
Key Learnings
A System That Didn’t Scale: 2012
Prelude
Manual Work Hits a Cliff
Key Learnings
Introducing Ops-in-Squads: 2013–2015
Prelude
Building on Trust
Driving the Paradigm Shift
Key Learnings
Autonomy Versus Consistency: 2015–2017
Prelude
Benefits
Trade-Offs
Key Learnings
The Future: Speed at Scale, Safely
Contributor Bios
Chapter 8. Introducing SRE in Large Enterprises
Background
Introducing SRE
Defining Current State
Identifying and Educating Stakeholders
Presenting the Business Case
Implementing the SRE Team
Lessons Learned
Sample Implementation Roadmap
Closing Thoughts
Further Reading
Contributor Bio
Chapter 9. From SysAdmin to SRE in 8,963 Words
Clarifying Terminology
Service-Level Indicator
SLA
Service-Level Objective
Establishing SLAs for Internal Components
Understanding External Dependencies
Nontechnical Solutions
Tracking Availability Level
Dealing with Corner Cases
Conclusion
Contributor Bio
Chapter 10. Clearing the Way for SRE in the Enterprise
Toil, the Enemy of SRE
Toil in the Enterprise
Silos, Queues, and Tickets
Silos Get in the Way
Ticket-Driven Request Queues Are Expensive
Take Action Now
Start by Leaning on Lean
Get Rid of as Many Handoffs as Possible
Replace Remaining Handoffs with Self-Service
Self-Service Is More Than a Button
Self-Service Helps SREs in Multiple Ways
Operations as a Service
Error Budgets, Toil Limits, and Other Tools for Empowering Humans
Error Budgets
Toil Limits
Leverage Existing Enthusiasm for DevOps
Unify Backlogs and Protect Capacity
Psychological Safety and Human Factors
Join the Movement
Contributor Bio
Chapter 11. SRE Patterns Loved by DevOps People Everywhere
Pattern 1: Birth of Automated Testing at Google
Pattern 2: Launch and Handoff Readiness Review at Google
Pattern 3: Create a Shared Source Code Repository
Conclusion
Further Reading and Source Material
Contributor Bio
Chapter 12. DevOps and SRE: Voices from the Community
Background
Method
Results
Replies
Chapter 13. Production Engineering at Facebook
Contributor Bio
Part II. Near Edge SRE
Chapter 14. In the Beginning, There Was Chaos
The Problem with Systems
Economic Pillars of Complexity
Beginning Chaos
Navigating Complexity for Safety
Chaos Goes Big
Formalization
Advanced Principles
Frequently Asked Questions
Conclusion
Contributor Bio
Chapter 15. The Intersection of Reliability and Privacy
The Intersection of Reliability and Privacy
The General Landscape of Privacy Engineering
Privacy and SRE: Common Approaches
Reducing Toil
Efficient and Deliberate Problem Solving
Relationship Management
Early Intervention and Education Through Evangelism
Nuances, Differences, and Trade-Offs
Conclusion
Further Reading
Contributor Bios
Chapter 16. Database Reliability Engineering
Guiding Principles of the Database Reliability Engineer
Protect the Data
Self-Service for Scale
Databases Are Not Special
A Culture of Database Reliability Engineering
Recoverability
Considerations for Recovery
Anatomy of a Recovery Strategy
Building Block 1: Detection
Building Block 2: Diverse Storage
Building Block 3: A Varied Toolbox
Building Block 4: Testing
Championing Recovery Reliability
Continuous Delivery: From Development to Production
Education and Collaboration
Collaboration
Deployment
Migrations and Versioning
Impact Analysis
Migration Patterns
Championing CD
Making the Case for DBRE
Further Reading
Contributor Bio
Chapter 17. Engineering for Data Durability
Replication Is Table Stakes
Backups
Replication
Real-World Durability
Isolation
Protection
Testing
Safeguards
Recovery
Verification
The Power of Zero
Verification Coverage
Watching the Watchers
Automation
Window of Vulnerability
Operator Fatigue
Reliability
Conclusion
Contributor Bio
Chapter 18. Introduction to Machine Learning for SRE
Why Use Machine Learning for SRE?
Why and How Should My Company Be Engaging in This?
Some SRE Problems Machine Learning Can Help Solve
The Awakening of Applied AI
What Is Machine Learning?
What Do We Mean by Learning?
From Chess to Go: How Deep Can We Dive?
Why Now? What Changed for Us?
What Are Neural Networks?
Neurons and Neural Networks
How and When Should We Apply Neural Networks?
What Kinds of Data Can We Use?
Practical Machine Learning
Popular Libraries for Neural Networks
Practical Machine Learning Examples
Success Stories
Further Reading
My GitHub Repository
Recommended Books
Contributor Bio
Part III. SRE Best Practices and Technologies
Chapter 19. Do Docs Better: Integrating Documentation into the Engineering Workflow
Defining Quality: What Do Good Docs Look Like?
Functional Requirements for SRE Documentation
Integrating Docs into the Engineering Workflow
The Google Experience: g3doc and EngPlay
What We Learned
Doing Docs Better: Best Practices
Create Templates for Each Documentation Type
Better > Best: Set Realistic Standards for Quality
Require Docs as Part of Code Review
Ruthlessly Prune Your Docs
Recognize and Reward Documentation
Communicating the Value of Documentation
Further Reading
Contributor Bios
Chapter 20. Active Teaching and Learning
Active Learning
Active Learning Example: Wheel of Misfortune
Active Learning Example: Incident Manager (a Card Game)
Active Learning Example: SRE Classroom
The Costs of Failing to Learn
Learning Habits of Effective SRE Teams
Production Meetings
Postmortems
A Call to Action: Ditch the Boring Slides
Bio
Chapter 21. The Art and Science of the Service-Level Objective
Why Set Goals?
Availability
Time Quanta
Transactions
Transactions over Time Quanta
On Evaluating SLOs
Histograms
Where Percentiles Fall Down (and Histograms Step Up)
Parting Thought: Looking at SLOs Upside Down
Further Reading
Contributor Bio
Chapter 22. SRE as a Success Culture
Where Did SRE Come From?
Key Values for SRE
Keeping the Site Up
Empowering Teams to “Do the Right Thing”
Approaching Operations as an Engineering Problem
Achieving Business Success Through Promises (Service Levels)
Critical Enabling Functions of SRE
Monitoring, Metrics, and KPIs
Incident Management and Emergency Response
Capacity Planning and Demand Forecasting
Performance Analysis and Optimization
Provisioning, Change Management, and Velocity
Phases of SRE Execution
Phase 1: Firefighting/Reactive
Phase 2: Gatekeepers
Phase 3: Advocates/Partners
Phase 4: Catalytic
Complications of Differing Phases
Focus on the Details of Success
Further Reading
Contributor Bio
Chapter 23. SRE Antipatterns
Antipattern 1: Site Reliability Operations
Antipattern 2: Humans Staring at Screens
Antipattern 3: Mob Incident Response
Antipattern 4: Root Cause = Human Error
Antipattern 5: Passing the Pager
Antipattern 6: Magic Smoke Jumping!
Antipattern 7: Alert Reliability Engineering
Antipattern 8: Hiring a Dog-Walker to Tend Your Pets
Antipattern 9: Speed-Bump Engineering
Antipattern 10: Design Chokepoints
Antipattern 11: Too Much Stick, Not Enough Carrot
Antipattern 12: Postponing Production
Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
Antipattern 14: Dependency Hell
Antipattern 15: Ungainly Governance
Antipattern 16: Ill-Considered SLOh-Ohs
Antipattern 17: Tossing Your API Over the Firewall
Antipattern 18: Fixing the Ops Team
So, That’s It, Then?
Contributor Bio
Chapter 24. Immutable Infrastructure and SRE
Scalability, Reliability, and Performance
Failure Recovery
Simpler Operations
Faster Startup Times
Known State
Continuous Integration/Continuous Deployment with Confidence
Security
Multiregion Operations
Release Engineering
Building the Base Image
Deploying Applications
Disadvantages
Conclusion
Contributor Bio
Chapter 25. Scriptable Load Balancers
Scriptable Load Balancers: The New Kid on the Block
Why Scriptable Load Balancers?
Making the Difficult Easy
Shard-Aware Routing
Harnessing Potential
Case Study: Intermission
Service-Level Middleware
Middleware to the Rescue
APIs of Service-Level Middleware
Case Study: WAF/Bot Mitigation
Avoiding Disaster
Getting Clever with State
Case Study: Checkout Queue
Looking to the Future and Further Reading
Contributor Bio
Chapter 26. The Service Mesh: Wrangler of Your Microservices?
Ready to Get Rid of the Monolith?
Current State of Microservice Networking
Service Mesh to the Rescue
The Benefits of a Sidecar Proxy
Eventually Consistent Service Discovery
Observability and Alarming
Sidecar Performance Implications
Thin Libraries and Context Propagation
Configuration Management (Control Plane Versus Data Plane)
The Service Mesh in Practice
The Origin and Development of Envoy at Lyft
Operating Envoy at Lyft
The Future of the Service Mesh
Further Reading
Bio
Part IV. The Human Side of SRE
Chapter 27. Psychological Safety in SRE
The Primary Indicator of a Successful Team
How to Build Psychological Safety into Your Own Team
Further Reading
Bio
Chapter 28. SRE Cognitive Work
Introduction
What Do SRE People Do?
Why Should We Care About Practitioner Cognition?
Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
Human Performance in Modern Complex Systems: The Main Themes
Observations on SRE Cognitive Work Around Incidents
Every Incident Could Have Been Worse
Sacrifice Decisions Take Place Under Uncertainty
Repairs to Functional Systems
Special Knowledge About Complex Systems
Managing the Costs of Coordination
SREs Are Cognitive Agents Working in a Joint Cognitive System
The Calibration Problem
Mental Models
Incidents Trigger Individual Recalibration
Incidents Are Opportunities for Collective Recalibration
What Are the Implications of All This?
Incidents Will Continue
Incidents Will Impose Costs
Incident Patterns Will Change
Incidents Point to Specific Calibration Problems and Locations
What Should Happen Next?
Build a Corpus of Cases
Focus on Making Automation a Team Player in SRE Work
Address the Calibration Problem
What Can You Do?
Conclusion
References
Contributor Bio
Chapter 29. Beyond Burnout
Defining Mental Disorders
Mental Disorders Are Missing from the Diversity Conversation
Sanity Isn’t a Business Requirement
Thoughts and Prayers Aren’t Scalable
Full-Stack Inclusivity
Application
Interviewing
Compensation
Benefits
Onboarding
Working Conditions
Job Duties
Training
Promotion
Leaving
Inclusivity for Anyone Helps Everyone
Mental Disorder Resources
Contributor Bio
Chapter 30. Against On-Call: A Polemic
The Rationale for On-Call
First, Do No Harm
Parallels with SRE
Differences with SRE
Underlying Assumptions Driving On-Call for Engineers
On-Call Is Emergency Medicine Instead of Ward Medicine
Counterarguments
The Cost to Humans of Doing On-Call
We don’t need another hero
Actual Solutions
Training
Prioritization
Improving On-the-Job Performance
We Need a Fundamental Change in Approach
Strong-Anti-On-Call
Weak-Anti-On-Call
A Union of the Two
Conclusion
Contributor Bio
Chapter 31. Elegy for Complex Systems
The Computer and Human Systems Cannot Be Separated
Decoherence and Cascading Failure
Always in a State of Partial Failure
Novelty Priority Inversion
Nobody Anticipates the Overhead of Coordination
Your healthcare.gov Is Out There
To Get Involved
Further Reading
Contributor Bio
Chapter 32. Intersections Between Operations and Social Activism
Before, During, After
Creating the Perfect Plan
Principles of Organizing
Managing Crisis: Responding When Things Break Down
Writing Our Own History: Making Sense of What Went Down
The Long Tail: Turning Action into Change
Activism and Change Within a Company
Conclusion
Contributor Bio
Chapter 33. Conclusion
Index
About the Editor
Colophon
Seeking SRE CONVERSATIONS ABOUT RUNNING PRODUCTION SYSTEMS AT SCALE Curated and edited by David N. Blank-Edelman
Praise for Seeking SRE “Reading this book is like being a fly on the wall as SREs discuss the challenges and successes they’ve had implementing SRE strategies outside of Google. A must-read for everyone in tech!” —Thomas A. Limoncelli SRE Manager, Stack Overflow, Inc. Google SRE Alum “A fantastic collection of SRE insights and principles from engineers at Google, Netflix, Dropbox, SoundCloud, Spotify, Amazon, and more. Seeking SRE shares the secrets to high availability and durability for many of the most popular products we all know and use.” —Tammy Butow Principle SRE, Gremlin “Imagine you invited all your favorite SREs to a big dinner party where you just walked around all night quietly eavesdropping. What would you hear? This book is that. These are the conversations that happen between the sessions at conferences or over lunch. These are the (sometimes animated, but always principled) debates we have among ourselves. This book is your seat at the SRE family kitchen table.” —Dave Rensin Director of Google CRE
“Although Google’s two SRE books have been a force for good in the industry, they primarily frame the SRE narrative in the context of the solutions Google decided upon, and those may or may not work for every organization. Seeking SRE does an excellent job of demonstrating how SRE tenets can be adopted (or adapted) in various contexts across different organizations, while still staying true to the core principles championed by Google. In addition to providing the rationale and technical underpinning behind several of the infrastructural paradigms du jour that are required to build resilient systems, Seeking SRE also underscores the cultural scaffolding needed to ensure their successful implementation. The result is an actionable blueprint that the reader can use to make informed choices about when, why, and how to introduce these changes into existing infrastructures and organizations.” —Cindy Sridharan Distributed Systems Engineer
Seeking SRE Conversations About Running Production Systems at Scale Curated and edited by David N. Blank-Edelman Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo
Seeking SRE Curated and edited by David N. Blank-Edelman Copyright © 2018 David N. Blank-Edelman. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest Editor: Virginia Wilson Acquisitions Editor: Nikki McDonald Proofreader: Rachel Monaghan Copyeditor: Octal Publishing Services, Inc. Production Editors: Kristen Brown and Melanie Yarbrough September 2018: First Edition Revision History for the First Edition 2018-08-21: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491978863 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Seeking SRE, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-97886-3 [GP]
Table of Contents Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Part I. SRE Implementation 1. Context Versus Control in SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Interviewing Site Reliability Engineers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3. So, You Want to Build an SRE Team?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4. Using Incident Metrics to Improve SRE at Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5. Working with Third Parties Shouldn’t Suck. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6. How to Apply SRE Principles Without Dedicated SRE Teams. . . . . . . . . . . . . . . . . . . . . . 65 7. SRE Without SRE: The Spotify Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8. Introducing SRE in Large Enterprises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 9. From SysAdmin to SRE in 8,963 Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 10. Clearing the Way for SRE in the Enterprise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11. SRE Patterns Loved by DevOps People Everywhere. . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 12. DevOps and SRE: Voices from the Community. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 v
13. Production Engineering at Facebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Part II. Near Edge SRE 14. In the Beginning, There Was Chaos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 15. The Intersection of Reliability and Privacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 16. Database Reliability Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 17. Engineering for Data Durability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 18. Introduction to Machine Learning for SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Part III. SRE Best Practices and Technologies 19. Do Docs Better: Integrating Documentation into the Engineering Workflow. . . . . . 325 20. Active Teaching and Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 21. The Art and Science of the Service-Level Objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 22. SRE as a Success Culture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 23. SRE Antipatterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 24. Immutable Infrastructure and SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 25. Scriptable Load Balancers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 26. The Service Mesh: Wrangler of Your Microservices?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Part IV. The Human Side of SRE 27. Psychological Safety in SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 28. SRE Cognitive Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 29. Beyond Burnout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 vi | Table of Contents
30. Against On-Call: A Polemic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 31. Elegy for Complex Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 32. Intersections Between Operations and Social Activism. . . . . . . . . . . . . . . . . . . . . . . . . 537 33. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 Table of Contents | vii
分享到:
收藏