logo资料库

chaos engineering.pdf

第1页 / 共72页
第2页 / 共72页
第3页 / 共72页
第4页 / 共72页
第5页 / 共72页
第6页 / 共72页
第7页 / 共72页
第8页 / 共72页
资料共72页,剩余部分请下载后查看
Cover
Copyright
Table of Contents
Part I. Introduction
Chapter 1. Why Do Chaos Engineering?
How Does Chaos Engineering Differ from Testing?
It’s Not Just for Netflix
Prerequisites for Chaos Engineering
Chapter 2. Managing Complexity
Understanding Complex Systems
Example of Systemic Complexity
Takeaway from the Example
Part II. The Principles of Chaos
Chapter 3. Hypothesize about Steady State
Characterizing Steady State
Forming Hypotheses
Chapter 4. Vary Real-World Events
Chapter 5. Run Experiments in Production
State and Services
Input in Production
Other People’s Systems
Agents Making Changes
External Validity
Poor Excuses for Not Practicing Chaos
I’m pretty sure it will break!
If it does break, we’re in big trouble!
Get as Close as You Can
Chapter 6. Automate Experiments to Run Continuously
Automatically Executing Experiments
Automatically Creating Experiments
Chapter 7. Minimize Blast Radius
Part III. Chaos In Practice
Chapter 8. Designing Experiments
1. Pick a Hypothesis
2. Choose the Scope of the Experiment
3. Identify the Metrics You’re Going to Watch
4. Notify the Organization
5. Run the Experiment
6. Analyze the Results
7. Increase the Scope
8. Automate
Chapter 9. Chaos Maturity Model
Sophistication
Adoption
Draw the Map
Chapter 10. Conclusion
Resources
About the Authors
Acknowledgments
Chaos Engineering Building Confidence in System Behavior through Experiments Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo
Chaos Engineering by Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri Copyright © 2017 Netflix, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Brian Anderson Production Editor: Colleen Cole Copyeditor: Christina Edwards Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest August 2017: First Edition Revision History for the First Edition 2017-08-15: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Chaos Engineer‐ ing, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-95306-8 [LSI]
Table of Contents Part I. Introduction 1. Why Do Chaos Engineering?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 How Does Chaos Engineering Differ from Testing? 3 It’s Not Just for Netflix 5 Prerequisites for Chaos Engineering 6 2. Managing Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Understanding Complex Systems 10 Example of Systemic Complexity 13 Takeaway from the Example 15 Part II. The Principles of Chaos 3. Hypothesize about Steady State. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Characterizing Steady State 24 Forming Hypotheses 25 4. Vary Real-World Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5. Run Experiments in Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 State and Services 36 Input in Production 37 Other People’s Systems 37 Agents Making Changes 38 iii
External Validity 38 Poor Excuses for Not Practicing Chaos 39 Get as Close as You Can 40 6. Automate Experiments to Run Continuously. . . . . . . . . . . . . . . . . . 41 Automatically Executing Experiments 41 Automatically Creating Experiments 44 7. Minimize Blast Radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Part III. Chaos In Practice 8. Designing Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 1. Pick a Hypothesis 53 2. Choose the Scope of the Experiment 54 3. Identify the Metrics You’re Going to Watch 54 4. Notify the Organization 55 5. Run the Experiment 56 6. Analyze the Results 56 7. Increase the Scope 56 8. Automate 56 9. Chaos Maturity Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Sophistication 57 Adoption 59 Draw the Map 60 10. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Resources 63 iv | Table of Contents
PART I Introduction Chaos Engineering is the discipline of experimenting on a dis‐ tributed system in order to build confidence in the system’s capabil‐ ity to withstand turbulent conditions in production. —Principles of Chaos If you’ve ever run a distributed system in production, you know that unpredictable events are bound to happen. Distributed systems con‐ tain so many interacting components that the number of things that can go wrong is enormous. Hard disks can fail, the network can go down, a sudden surge in customer traffic can overload a functional component—the list goes on. All too often, these events trigger out‐ ages, poor performance, and other undesirable behaviors. We’ll never be able to prevent all possible failure modes, but we can identify many of the weaknesses in our system before they are trig‐ gered by these events. When we do, we can fix them, preventing those future outages from ever happening. We can make the system more resilient and build confidence in it. Chaos Engineering is a method of experimentation on infrastruc‐ ture that brings systemic weaknesses to light. This empirical process of verification leads to more resilient systems, and builds confidence in the operational behavior of those systems.
Using Chaos Engineering may be as simple as manually running kill -9 on a box inside of your staging environment to simulate failure of a service. Or, it can be as sophisticated as automatically designing and carrying out experiments in a production enviroment against a small but statistically significant fraction of live traffic. The History of Chaos Engineering at Netflix Ever since Netflix began moving out of a datacenter into the cloud in 2008, we have been practicing some form of resiliency testing in production. Only later did our take on it become known as Chaos Engineering. Chaos Monkey started the ball rolling, gaining notori‐ ety for turning off services in the production environment. Chaos Kong transferred those benefits from the small scale to the very large. A tool called Failure Injection Testing (FIT) laid the founda‐ tion for tackling the space in between. Principles of Chaos helped formalize the discipline, and our Chaos Automation Platform is ful‐ filling the potential of running chaos experimentation across the microservice architecture 24/7. As we developed these tools and experience, we realized that Chaos Engineering isn’t about causing disruptions in a service. Sure, breaking stuff is easy, but it’s not always productive. Chaos Engi‐ neering is about surfacing the chaos already inherent in a complex system. Better comprehension of systemic effects leads to better engineering in distributed systems, which improves resiliency. This book explains the main concepts of Chaos Engineering, and how you can apply these concepts in your organization. While the tools that we have written may be specific to Netflix’s environment, we believe the principles are widely applicable to other contexts.
CHAPTER 1 Why Do Chaos Engineering? Chaos Engineering is an approach for learning about how your sys‐ tem behaves by applying a discipline of empirical exploration. Just as scientists conduct experiments to study physical and social phenom‐ ena, Chaos Engineering uses experiments to learn about a particular system. Applying Chaos Engineering improves the resilience of a system. By designing and executing Chaos Engineering experiments, you will learn about weaknesses in your system that could potentially lead to outages that cause customer harm. You can then address those weaknesses proactively, going beyond the reactive processes that currently dominate most incident response models. How Does Chaos Engineering Differ from Testing? Chaos Engineering, fault injection, and failure testing have a large overlap in concerns and often in tooling as well; for example, many Chaos Engineering experiments at Netflix rely on fault injection to introduce the effect being studied. The primary difference between Chaos Engineering and these other approaches is that Chaos Engi‐ neering is a practice for generating new information, while fault injection is a specific approach to testing one condition. When you want to explore the many ways a complex system can misbehave, injecting communication failures like latency and errors is one good approach. But we also want to explore things like a large 3
分享到:
收藏