Seeking SRE Conversations About Running Production Systems at Sc....pdf

发布时间：2022-06-08 发布人：admin 分类：说明书资料大小：14.93M 资料格式：pdf 举报版权申诉

weixin_43960172-10907587-4744302543412388064.pdf-第1页.png

第1页 / 共587页

weixin_43960172-10907587-4744302543412388064.pdf-第2页.png

第2页 / 共587页

weixin_43960172-10907587-4744302543412388064.pdf-第3页.png

第3页 / 共587页

weixin_43960172-10907587-4744302543412388064.pdf-第4页.png

第4页 / 共587页

weixin_43960172-10907587-4744302543412388064.pdf-第5页.png

第5页 / 共587页

weixin_43960172-10907587-4744302543412388064.pdf-第6页.png

第6页 / 共587页

weixin_43960172-10907587-4744302543412388064.pdf-第7页.png

第7页 / 共587页

weixin_43960172-10907587-4744302543412388064.pdf-第8页.png

第8页 / 共587页

Table of Contents

Introduction

And So It Begins...

Origin Story

Voices

Forward in All Directions!1Apologies to 3 Mustaphas 3.

Acknowledgments

Part I. SRE Implementation

Chapter 1. Context Versus Control in SRE

Contributor Bio

Chapter 2. Interviewing Site Reliability Engineers

Interviewing 101

Who Is Involved

Industry Versus University

Biases

The Funnel

SRE Funnels

Phone Screens

The Onsite Interview

Take-Home Questions

Advice for Hiring Managers

Final Thoughts on Interviewing SREs

Further Reading

Contributor Bio

Chapter 3. So, You Want to Build an SRE Team?

Choose SRE for the Right Reasons

Orienting to a Data-Driven Approach

Commitment to SRE

Making a Decision About SRE

Contributor Bio

Chapter 4. Using Incident Metrics to Improve SRE at Scale

The Virtuous Cycle to the Rescue: If You Don’t Measure It…

Metrics Review: If a Metric Falls in the Forest…

Surrogate Metrics

Repair Debt

Virtual Repair Debt: Exorcising the Ghost in the Machine

Real-Time Dashboards: The Bread and Butter of SRE

Learnings: TL;DR

Further Reading

Contributor Bio

Chapter 5. Working with Third Parties Shouldn’t Suck

Build, Buy, or Adopt?

Establish Importance

Identify Stakeholders

Make a Decision

Acknowledge Reality

Third Parties as First-Class Citizens

When They’re Down, You’re Down

Running the Black Box Like a Service

Service-Level Indicators, Service-Level Objectives, and SLAs

Playbook: From Staging to Production

Closing Thoughts

Contributor Bio

Chapter 6. How to Apply SRE Principles Without Dedicated SRE Teams

SREs to the Rescue! (and How They Failed)

A Matter of Scale in Terms of Headcount

The Embedded SRE

You Build It, You Run It

The Deployment Platform

Closing the Loop: Take Your Own Pager

Introducing Production Engineering

Some Implementation Details

Developers’ Productivity and Health Versus the Pager

Resolving Cross-Team Reliability Issues by Using Postmortems

Uniform Infrastructure and Tooling Versus Autonomy and Innovation

Getting Buy-In

Conclusion

Further Reading

Contributor Bios

Chapter 7. SRE Without SRE: The Spotify Case Study

Tabula Rasa: 2006–2007

Prelude

Key Learnings

Beta and Release: 2008–2009

Prelude

Bringing Scalability and Reliability to the Forefront

Key Learnings

The Curse of Success: 2010

Prelude

A New Ownership Model

Formalizing Core Services

Blessed Deployment Time Slots

On-Call and Alerting

Spawning Off Internal Office Support

Addressing the Remaining Top Concerns

Creating Detectives

Key Learnings

Pets and Cattle, and Agile: 2011

Prelude

Forming Bad Habits

Breaking Those Bad Habits

Key Learnings

A System That Didn’t Scale: 2012

Prelude

Manual Work Hits a Cliff

Key Learnings

Introducing Ops-in-Squads: 2013–2015

Prelude

Building on Trust

Driving the Paradigm Shift

Key Learnings

Autonomy Versus Consistency: 2015–2017

Prelude

Benefits

Trade-Offs

Key Learnings

The Future: Speed at Scale, Safely

Contributor Bios

Chapter 8. Introducing SRE in Large Enterprises

Background

Introducing SRE

Defining Current State

Identifying and Educating Stakeholders

Presenting the Business Case

Implementing the SRE Team

Lessons Learned

Sample Implementation Roadmap

Closing Thoughts

Further Reading

Contributor Bio

Chapter 9. From SysAdmin to SRE in 8,963 Words

Clarifying Terminology

Service-Level Indicator

SLA

Service-Level Objective

Establishing SLAs for Internal Components

Understanding External Dependencies

Nontechnical Solutions

Tracking Availability Level

Dealing with Corner Cases

Conclusion

Contributor Bio

Chapter 10. Clearing the Way for SRE in the Enterprise

Toil, the Enemy of SRE

Toil in the Enterprise

Silos, Queues, and Tickets

Silos Get in the Way

Ticket-Driven Request Queues Are Expensive

Take Action Now

Start by Leaning on Lean

Get Rid of as Many Handoffs as Possible

Replace Remaining Handoffs with Self-Service

Self-Service Is More Than a Button

Self-Service Helps SREs in Multiple Ways

Operations as a Service

Error Budgets, Toil Limits, and Other Tools for Empowering Humans

Error Budgets

Toil Limits

Leverage Existing Enthusiasm for DevOps

Unify Backlogs and Protect Capacity

Psychological Safety and Human Factors

Join the Movement

Contributor Bio

Chapter 11. SRE Patterns Loved by DevOps People Everywhere

Pattern 1: Birth of Automated Testing at Google

Pattern 2: Launch and Handoff Readiness Review at Google

Pattern 3: Create a Shared Source Code Repository

Conclusion

Further Reading

Contributor Bios

Chapter 16. Database Reliability Engineering

Guiding Principles of the Database Reliability Engineer

Protect the Data

Self-Service for Scale

Databases Are Not Special

A Culture of Database Reliability Engineering

Recoverability

Considerations for Recovery

Anatomy of a Recovery Strategy

Building Block 1: Detection

Building Block 2: Diverse Storage

Building Block 3: A Varied Toolbox

Building Block 4: Testing

Championing Recovery Reliability

Continuous Delivery: From Development to Production

Education and Collaboration

Collaboration

Deployment

Migrations and Versioning

Impact Analysis

Migration Patterns

Championing CD

Making the Case for DBRE

Further Reading

Contributor Bio

Chapter 17. Engineering for Data Durability

Replication Is Table Stakes

Backups

Replication

Real-World Durability

Isolation

Protection

Testing

Safeguards

Recovery

Verification

The Power of Zero

Verification Coverage

Watching the Watchers

Automation

Window of Vulnerability

Operator Fatigue

Reliability

Conclusion

Contributor Bio

Chapter 18. Introduction to Machine Learning for SRE

Why Use Machine Learning for SRE?

Why and How Should My Company Be Engaging in This?

Some SRE Problems Machine Learning Can Help Solve

The Awakening of Applied AI

What Is Machine Learning?

What Do We Mean by Learning?

From Chess to Go: How Deep Can We Dive?

Why Now? What Changed for Us?

What Are Neural Networks?

Neurons and Neural Networks

How and When Should We Apply Neural Networks?

What Kinds of Data Can We Use?

Practical Machine Learning

Popular Libraries for Neural Networks

Practical Machine Learning Examples

Success Stories

Further Reading

My GitHub Repository

Recommended Books

Contributor Bio

Part III. SRE Best Practices and Technologies

Chapter 19. Do Docs Better: Integrating Documentation into the Engineering Workflow

Defining Quality: What Do Good Docs Look Like?

Functional Requirements for SRE Documentation

Integrating Docs into the Engineering Workflow

The Google Experience: g3doc and EngPlay

What We Learned

Doing Docs Better: Best Practices

Create Templates for Each Documentation Type

Better > Best: Set Realistic Standards for Quality

Require Docs as Part of Code Review

Ruthlessly Prune Your Docs

Recognize and Reward Documentation

Communicating the Value of Documentation

Further Reading

Contributor Bios

Chapter 20. Active Teaching and Learning

Active Learning

Active Learning Example: Wheel of Misfortune

Active Learning Example: Incident Manager (a Card Game)

Active Learning Example: SRE Classroom

The Costs of Failing to Learn

Learning Habits of Effective SRE Teams

Production Meetings

Postmortems

A Call to Action: Ditch the Boring Slides

Bio

Chapter 21. The Art and Science of the Service-Level Objective

Why Set Goals?

Availability

Time Quanta

Transactions

Transactions over Time Quanta

On Evaluating SLOs

Histograms

Where Percentiles Fall Down (and Histograms Step Up)

Parting Thought: Looking at SLOs Upside Down

Further Reading

Contributor Bio

Chapter 22. SRE as a Success Culture

Where Did SRE Come From?

Key Values for SRE

Keeping the Site Up

Empowering Teams to “Do the Right Thing”

Approaching Operations as an Engineering Problem

Achieving Business Success Through Promises (Service Levels)

Critical Enabling Functions of SRE

Monitoring, Metrics, and KPIs

Incident Management and Emergency Response

Capacity Planning and Demand Forecasting

Performance Analysis and Optimization

Provisioning, Change Management, and Velocity

Phases of SRE Execution

Phase 1: Firefighting/Reactive

Phase 2: Gatekeepers

Phase 3: Advocates/Partners

Phase 4: Catalytic

Complications of Differing Phases

Focus on the Details of Success

Further Reading

Contributor Bio

Chapter 23. SRE Antipatterns

Antipattern 1: Site Reliability Operations

Antipattern 2: Humans Staring at Screens

Antipattern 3: Mob Incident Response

Antipattern 4: Root Cause = Human Error

Antipattern 5: Passing the Pager

Antipattern 6: Magic Smoke Jumping!

Antipattern 7: Alert Reliability Engineering

Antipattern 8: Hiring a Dog-Walker to Tend Your Pets

Antipattern 9: Speed-Bump Engineering

Antipattern 10: Design Chokepoints

Antipattern 11: Too Much Stick, Not Enough Carrot

Antipattern 12: Postponing Production

Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)

Antipattern 14: Dependency Hell

Antipattern 15: Ungainly Governance

Antipattern 16: Ill-Considered SLOh-Ohs

Antipattern 17: Tossing Your API Over the Firewall

Antipattern 18: Fixing the Ops Team

So, That’s It, Then?

Contributor Bio

Chapter 24. Immutable Infrastructure and SRE

Scalability, Reliability, and Performance

Failure Recovery

Simpler Operations

Faster Startup Times

Known State

Continuous Integration/Continuous Deployment with Confidence

Security

Multiregion Operations

Release Engineering

Building the Base Image

Deploying Applications

Disadvantages

Conclusion

Contributor Bio

Chapter 25. Scriptable Load Balancers

Scriptable Load Balancers: The New Kid on the Block

Why Scriptable Load Balancers?

Making the Difficult Easy

Shard-Aware Routing

Harnessing Potential

Case Study: Intermission

Service-Level Middleware

Middleware to the Rescue

APIs of Service-Level Middleware

Case Study: WAF/Bot Mitigation

Avoiding Disaster

Getting Clever with State

Case Study: Checkout Queue

Looking to the Future and Further Reading

Contributor Bio

Chapter 26. The Service Mesh: Wrangler of Your Microservices?

Ready to Get Rid of the Monolith?

Current State of Microservice Networking

Service Mesh to the Rescue

The Benefits of a Sidecar Proxy

Eventually Consistent Service Discovery

Observability and Alarming

Sidecar Performance Implications

Thin Libraries and Context Propagation

Configuration Management (Control Plane Versus Data Plane)

The Service Mesh in Practice

The Origin and Development of Envoy at Lyft

Operating Envoy at Lyft

The Future of the Service Mesh

Further Reading

Bio

Part IV. The Human Side of SRE

Chapter 27. Psychological Safety in SRE

The Primary Indicator of a Successful Team

How to Build Psychological Safety into Your Own Team

Further Reading

Bio

Chapter 28. SRE Cognitive Work

Introduction

What Do SRE People Do?

Why Should We Care About Practitioner Cognition?

Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted

Human Performance in Modern Complex Systems: The Main Themes

Observations on SRE Cognitive Work Around Incidents

Every Incident Could Have Been Worse

Sacrifice Decisions Take Place Under Uncertainty

Repairs to Functional Systems

Special Knowledge About Complex Systems

Managing the Costs of Coordination

SREs Are Cognitive Agents Working in a Joint Cognitive System

The Calibration Problem

Mental Models

Incidents Trigger Individual Recalibration

Incidents Are Opportunities for Collective Recalibration

What Are the Implications of All This?

Incidents Will Continue

Incidents Will Impose Costs

Incident Patterns Will Change

Incidents Point to Specific Calibration Problems and Locations

What Should Happen Next?

Build a Corpus of Cases

Focus on Making Automation a Team Player in SRE Work

Address the Calibration Problem

What Can You Do?

Conclusion

References

Contributor Bio

Chapter 29. Beyond Burnout

Defining Mental Disorders

Mental Disorders Are Missing from the Diversity Conversation

Sanity Isn’t a Business Requirement

Thoughts and Prayers Aren’t Scalable

Full-Stack Inclusivity

Application

Interviewing

Compensation

Benefits

Onboarding

Working Conditions

Job Duties

Training

Promotion

Leaving

Inclusivity for Anyone Helps Everyone

Mental Disorder Resources

Contributor Bio

Chapter 30. Against On-Call: A Polemic

The Rationale for On-Call

First, Do No Harm

Parallels with SRE

Differences with SRE

Underlying Assumptions Driving On-Call for Engineers

On-Call Is Emergency Medicine Instead of Ward Medicine

Counterarguments

The Cost to Humans of Doing On-Call

We don’t need another hero

Actual Solutions

Training

Prioritization

Improving On-the-Job Performance

We Need a Fundamental Change in Approach

Strong-Anti-On-Call

Weak-Anti-On-Call

A Union of the Two

Conclusion

Contributor Bio

Chapter 31. Elegy for Complex Systems

The Computer and Human Systems Cannot Be Separated

Decoherence and Cascading Failure

Always in a State of Partial Failure

Novelty Priority Inversion

Nobody Anticipates the Overhead of Coordination

Your healthcare.gov Is Out There

To Get Involved

Further Reading

Contributor Bio

Chapter 32. Intersections Between Operations and Social Activism

Before, During, After

Creating the Perfect Plan

Principles of Organizing

Managing Crisis: Responding When Things Break Down

Writing Our Own History: Making Sense of What Went Down

The Long Tail: Turning Action into Change

Activism and Change Within a Company

Conclusion

Contributor Bio

Chapter 33. Conclusion

Index

About the Editor

Colophon

Seeking SRE CONVERSATIONS ABOUT RUNNING PRODUCTION SYSTEMS AT SCALE Curated and edited by David N. Blank-Edelman

Praise for Seeking SRE “Reading this book is like being a fly on the wall as SREs discuss the challenges and successes they’ve had implementing SRE strategies outside of Google. A must-read for everyone in tech!” —Thomas A. Limoncelli SRE Manager, Stack Overflow, Inc. Google SRE Alum “A fantastic collection of SRE insights and principles from engineers at Google, Netflix, Dropbox, SoundCloud, Spotify, Amazon, and more. Seeking SRE shares the secrets to high availability and durability for many of the most popular products we all know and use.” —Tammy Butow Principle SRE, Gremlin “Imagine you invited all your favorite SREs to a big dinner party where you just walked around all night quietly eavesdropping. What would you hear? This book is that. These are the conversations that happen between the sessions at conferences or over lunch. These are the (sometimes animated, but always principled) debates we have among ourselves. This book is your seat at the SRE family kitchen table.” —Dave Rensin Director of Google CRE

“Although Google’s two SRE books have been a force for good in the industry, they primarily frame the SRE narrative in the context of the solutions Google decided upon, and those may or may not work for every organization. Seeking SRE does an excellent job of demonstrating how SRE tenets can be adopted (or adapted) in various contexts across different organizations, while still staying true to the core principles championed by Google. In addition to providing the rationale and technical underpinning behind several of the infrastructural paradigms du jour that are required to build resilient systems, Seeking SRE also underscores the cultural scaffolding needed to ensure their successful implementation. The result is an actionable blueprint that the reader can use to make informed choices about when, why, and how to introduce these changes into existing infrastructures and organizations.” —Cindy Sridharan Distributed Systems Engineer

Seeking SRE Conversations About Running Production Systems at Scale Curated and edited by David N. Blank-Edelman Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo

Seeking SRE Curated and edited by David N. Blank-Edelman Copyright © 2018 David N. Blank-Edelman. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest Editor: Virginia Wilson Acquisitions Editor: Nikki McDonald Proofreader: Rachel Monaghan Copyeditor: Octal Publishing Services, Inc. Production Editors: Kristen Brown and Melanie Yarbrough September 2018: First Edition Revision History for the First Edition 2018-08-21: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491978863 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Seeking SRE, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-97886-3 [GP]

Table of Contents Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Part I. SRE Implementation 1. Context Versus Control in SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Interviewing Site Reliability Engineers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3. So, You Want to Build an SRE Team?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4. Using Incident Metrics to Improve SRE at Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5. Working with Third Parties Shouldn’t Suck. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6. How to Apply SRE Principles Without Dedicated SRE Teams. . . . . . . . . . . . . . . . . . . . . . 65 7. SRE Without SRE: The Spotify Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8. Introducing SRE in Large Enterprises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 9. From SysAdmin to SRE in 8,963 Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 10. Clearing the Way for SRE in the Enterprise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11. SRE Patterns Loved by DevOps People Everywhere. . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 12. DevOps and SRE: Voices from the Community. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 v

13. Production Engineering at Facebook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Part II. Near Edge SRE 14. In the Beginning, There Was Chaos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 15. The Intersection of Reliability and Privacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 16. Database Reliability Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 17. Engineering for Data Durability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 18. Introduction to Machine Learning for SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Part III. SRE Best Practices and Technologies 19. Do Docs Better: Integrating Documentation into the Engineering Workflow. . . . . . 325 20. Active Teaching and Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 21. The Art and Science of the Service-Level Objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 22. SRE as a Success Culture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 23. SRE Antipatterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 24. Immutable Infrastructure and SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 25. Scriptable Load Balancers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 26. The Service Mesh: Wrangler of Your Microservices?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Part IV. The Human Side of SRE 27. Psychological Safety in SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 28. SRE Cognitive Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 29. Beyond Burnout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 vi | Table of Contents

30. Against On-Call: A Polemic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 31. Elegy for Complex Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 32. Intersections Between Operations and Social Activism. . . . . . . . . . . . . . . . . . . . . . . . . 537 33. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 Table of Contents | vii

分享到：

赞收藏

资料库

Seeking SRE Conversations About Running Production Systems at Sc....pdf

相关推荐

课程资源

热门标签

最新资料