logo资料库

Designing.Data-Intensive.Applications.2017.3.pdf

第1页 / 共613页
第2页 / 共613页
第3页 / 共613页
第4页 / 共613页
第5页 / 共613页
第6页 / 共613页
第7页 / 共613页
第8页 / 共613页
资料共613页,剩余部分请下载后查看
Copyright
Table of Contents
Preface
Who Should Read This Book?
Scope of This Book
Outline of This Book
References and Further Reading
O’Reilly Safari
How to Contact Us
Acknowledgments
Part I. Foundations of Data Systems
Chapter 1. Reliable, Scalable, and Maintainable Applications
Thinking About Data Systems
Reliability
Hardware Faults
Software Errors
Human Errors
How Important Is Reliability?
Scalability
Describing Load
Describing Performance
Approaches for Coping with Load
Maintainability
Operability: Making Life Easy for Operations
Simplicity: Managing Complexity
Evolvability: Making Change Easy
Summary
Chapter 2. Data Models and Query Languages
Relational Model Versus Document Model
The Birth of NoSQL
The Object-Relational Mismatch
Many-to-One and Many-to-Many Relationships
Are Document Databases Repeating History?
Relational Versus Document Databases Today
Query Languages for Data
Declarative Queries on the Web
MapReduce Querying
Graph-Like Data Models
Property Graphs
The Cypher Query Language
Graph Queries in SQL
Triple-Stores and SPARQL
The Foundation: Datalog
Summary
Chapter 3. Storage and Retrieval
Data Structures That Power Your Database
Hash Indexes
SSTables and LSM-Trees
B-Trees
Comparing B-Trees and LSM-Trees
Other Indexing Structures
Transaction Processing or Analytics?
Data Warehousing
Stars and Snowflakes: Schemas for Analytics
Column-Oriented Storage
Column Compression
Sort Order in Column Storage
Writing to Column-Oriented Storage
Aggregation: Data Cubes and Materialized Views
Summary
Chapter 4. Encoding and Evolution
Formats for Encoding Data
Language-Specific Formats
JSON, XML, and Binary Variants
Thrift and Protocol Buffers
Avro
The Merits of Schemas
Modes of Dataflow
Dataflow Through Databases
Dataflow Through Services: REST and RPC
Message-Passing Dataflow
Summary
Part II. Distributed Data
Chapter 5. Replication
Leaders and Followers
Synchronous Versus Asynchronous Replication
Setting Up New Followers
Handling Node Outages
Implementation of Replication Logs
Problems with Replication Lag
Reading Your Own Writes
Monotonic Reads
Consistent Prefix Reads
Solutions for Replication Lag
Multi-Leader Replication
Use Cases for Multi-Leader Replication
Handling Write Conflicts
Multi-Leader Replication Topologies
Leaderless Replication
Writing to the Database When a Node Is Down
Limitations of Quorum Consistency
Sloppy Quorums and Hinted Handoff
Detecting Concurrent Writes
Summary
Chapter 6. Partitioning
Partitioning and Replication
Partitioning of Key-Value Data
Partitioning by Key Range
Partitioning by Hash of Key
Skewed Workloads and Relieving Hot Spots
Partitioning and Secondary Indexes
Partitioning Secondary Indexes by Document
Partitioning Secondary Indexes by Term
Rebalancing Partitions
Strategies for Rebalancing
Operations: Automatic or Manual Rebalancing
Request Routing
Parallel Query Execution
Summary
Chapter 7. Transactions
The Slippery Concept of a Transaction
The Meaning of ACID
Single-Object and Multi-Object Operations
Weak Isolation Levels
Read Committed
Snapshot Isolation and Repeatable Read
Preventing Lost Updates
Write Skew and Phantoms
Serializability
Actual Serial Execution
Two-Phase Locking (2PL)
Serializable Snapshot Isolation (SSI)
Summary
Chapter 8. The Trouble with Distributed Systems
Faults and Partial Failures
Cloud Computing and Supercomputing
Unreliable Networks
Network Faults in Practice
Detecting Faults
Timeouts and Unbounded Delays
Synchronous Versus Asynchronous Networks
Unreliable Clocks
Monotonic Versus Time-of-Day Clocks
Clock Synchronization and Accuracy
Relying on Synchronized Clocks
Process Pauses
Knowledge, Truth, and Lies
The Truth Is Defined by the Majority
Byzantine Faults
System Model and Reality
Summary
Chapter 9. Consistency and Consensus
Consistency Guarantees
Linearizability
What Makes a System Linearizable?
Relying on Linearizability
Implementing Linearizable Systems
The Cost of Linearizability
Ordering Guarantees
Ordering and Causality
Sequence Number Ordering
Total Order Broadcast
Distributed Transactions and Consensus
Atomic Commit and Two-Phase Commit (2PC)
Distributed Transactions in Practice
Fault-Tolerant Consensus
Membership and Coordination Services
Summary
Part III. Derived Data
Chapter 10. Batch Processing
Batch Processing with Unix Tools
Simple Log Analysis
The Unix Philosophy
MapReduce and Distributed Filesystems
MapReduce Job Execution
Reduce-Side Joins and Grouping
Map-Side Joins
The Output of Batch Workflows
Comparing Hadoop to Distributed Databases
Beyond MapReduce
Materialization of Intermediate State
Graphs and Iterative Processing
High-Level APIs and Languages
Summary
Chapter 11. Stream Processing
Transmitting Event Streams
Messaging Systems
Partitioned Logs
Databases and Streams
Keeping Systems in Sync
Change Data Capture
Event Sourcing
State, Streams, and Immutability
Processing Streams
Uses of Stream Processing
Reasoning About Time
Stream Joins
Fault Tolerance
Summary
Chapter 12. The Future of Data Systems
Data Integration
Combining Specialized Tools by Deriving Data
Batch and Stream Processing
Unbundling Databases
Composing Data Storage Technologies
Designing Applications Around Dataflow
Observing Derived State
Aiming for Correctness
The End-to-End Argument for Databases
Enforcing Constraints
Timeliness and Integrity
Trust, but Verify
Doing the Right Thing
Predictive Analytics
Privacy and Tracking
Summary
Glossary
Index
About the Author
Colophon
Designing Data-Intensive Applications THE BIG IDEAS BEHIND RELIABLE, SCALABLE, AND MAINTAINABLE SYSTEMS Martin Kleppmann
Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems Martin Kleppmann Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo
Designing Data-Intensive Applications by Martin Kleppmann Copyright © 2017 Martin Kleppmann. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Ann Spencer and Marie Beaugureau Production Editor: Kristen Brown Copyeditor: Rachel Head Proofreader: Amanda Kersey Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest March 2017: First Edition Revision History for the First Edition 2017-03-01: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781449373320 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Designing Data-Intensive Applications, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-449-37332-0 [LSI]
Technology is a powerful force in our society. Data, software, and communication can be used for bad: to entrench unfair power structures, to undermine human rights, and to protect vested interests. But they can also be used for good: to make underrepresented people’s voices heard, to create opportunities for everyone, and to avert disasters. This book is dedicated to everyone working toward the good.
Computing is pop culture. […] Pop culture holds a disdain for history. Pop culture is all about identity and feeling like you’re participating. It has nothing to do with cooperation, the past or the future—it’s living in the present. I think the same is true of most people who write code for money. They have no idea where [their culture came from]. —Alan Kay, in interview with Dr Dobb’s Journal (2012)
分享到:
收藏