Practical
Monitoring
EFFECTIVE STRATEGIES FOR
THE REAL WORLD
Mike Julian
www.iebukes.com
www.iebukes.com
Practical Monitoring
Effective Strategies for the Real World
Mike Julian
Beijing
Beijing
Boston
Boston
Farnham Sebastopol
Farnham Sebastopol
Tokyo
Tokyo
www.iebukes.com
www.iebukes.com
Practical Monitoring
by Mike Julian
Copyright © 2018 Mike Julian. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Virginia Wilson and Nikki McDonald
Production Editor: Justin Billing
Copyeditor: Dwight Ramsey
Proofreader: Amanda Kersey
Indexer: Wendy Catalano
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
November 2017:
First Edition
Revision History for the First Edition
2017-10-26: First Release
See http://oreil.ly/2y3s5AB for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Monitoring, the cover image,
and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-95735-6
[LSI]
www.iebukes.com
www.iebukes.com
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Part I. Monitoring Principles
Anti-Pattern #1: Tool Obsession
1. Monitoring Anti-Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3
Monitoring Is Multiple Complex Problems Under One Name
4
Avoid Cargo-Culting Tools 6
Sometimes, You Really Do Have to Build It
7
8
The Single Pane of Glass Is a Myth
8
Anti-Pattern #2: Monitoring-as-a-Job
9
Anti-Pattern #3: Checkbox Monitoring
10
10
11
Anti-Pattern #4: Using Monitoring as a Crutch
11
Anti-Pattern #5: Manual Configuration
12
Wrap-Up 13
What Does “Working” Actually Mean? Monitor That.
OS Metrics Aren’t Very Useful—for Alerting
Collect Your Metrics More Often
Pattern #1: Composable Monitoring
The Components of a Monitoring Service
2. Monitoring Design Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
15
16
24
25
It’s Cheaper 26
26
You’re (Probably) Not an Expert at Architecting These Tools
SaaS Allows You to Focus on the Company’s Product
27
Pattern #2: Monitor from the User Perspective
Pattern #3: Buy, Not Build
www.iebukes.com
iii
27
No, Really, SaaS Is Actually Better
Pattern #4: Continual Improvement
28
Wrap-Up 28
What Makes a Good Alert?
On-Call
Stop Using Email for Alerts
Write Runbooks
Arbitrary Static Thresholds Aren’t the Only Way
Delete and Tune Alerts
Use Maintenance Periods
Attempt Automated Self-Healing First
3. Alerts, On-Call, and Incident Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
32
33
33
34
35
35
36
37
37
37
38
40
42
43
Fixing False Alarms
Cutting Down on Needless Firefighting
Building a Better On-Call Rotation
Incident Management
Postmortems
Wrap-Up
4. Statistics Primer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Before Statistics in Systems Operations
Math to the Rescue!
Statistics Isn’t Magic
Mean and Average
Median
Seasonality
Quantiles
Standard Deviation
Wrap-Up
45
45
46
47
47
49
49
50
51
52
Part II. Monitoring Tactics
5. Monitoring the Business. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Business KPIs 57
60
Two Real-World Examples
60
61
62
62
63
Tying Business KPIs to Technical Metrics
My App Doesn’t Have Those Metrics!
Finding Your Company’s Business KPIs
Yelp
Reddit
iv
|
Table of Contents
www.iebukes.com
Wrap-Up 64
6. Frontend Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
The Cost of a Slow App 66
Two Approaches to Frontend Monitoring 67
Document Object Model (DOM) 68
Frontend Performance Metrics 69
OK, That’s Great, but How Do I Use This? 71
Logging 72
Synthetic Monitoring 72
Wrap-Up 73
7. Application Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Instrumenting Your Apps with Metrics 75
How It Works Under the Hood 77
Monitoring Build and Release Pipelines 79
Health Endpoint Pattern 80
Application Logging 84
Wait a Minute…Should I Have a Metric or a Log Entry? 85
What Should I Be Logging? 85
Write to Disk or Write to Network? 86
Serverless / Function-as-a-Service 87
Monitoring Microservice Architectures 87
Wrap-Up 91
8. Server Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Standard OS Metrics 93
CPU 94
Memory 94
Network 95
Disk 95
Load 96
SSL Certificates 97
SNMP 98
Web Servers 98
Database Servers 100
Load Balancers 101
Message Queues 101
Caching 102
DNS 102
NTP 103
Miscellaneous Corporate Infrastructure 103
Table of Contents
|
v
DHCP 103
SMTP 104
Monitoring Scheduled Jobs 104
Logging 106
Collection 106
Storage 107
Analysis 107
Wrap-Up 108
9. Network Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
The Pains of SNMP 110
What Is SNMP? 110
How Does It Work? 110
A Word on Security 112
How Do I Use SNMP? 113
Interface Metrics 116
Interface and Logging 118
Recap 118
Configuration Tracking 119
Voice and Video 119
Routing 120
Spanning Tree Protocol (STP) 121
Chassis 121
CPU and Memory 121
Hardware 121
Flow Monitoring 122
Capacity Planning 123
Working Backward 123
Forecasting 123
Wrap-up 124
10. Security Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Monitoring and Compliance 126
User, Command, and Filesystem Auditing 127
Setting Up auditd 127
auditd and Remote Logs 128
Host Intrusion Detection System (HIDS) 129
rkhunter 129
Network Intrusion Detection System (NIDS) 130
Wrap-Up 132
vi
|
Table of Contents
11. Conducting a Monitoring Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Business KPIs 133
Frontend Monitoring 134
Application and Server Monitoring 134
Security Monitoring 136
Alerting 136
Wrap-Up 137
A. An Example Runbook: Demo App. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
B. Availability Chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Table of Contents
|
vii