Big Data
brief contents
contents
preface
acknowledgments
about this book
Roadmap
Code downloads and conventions
Author Online
About the cover illustration
Chapter 1: A new paradigm for Big Data
1.1 How this book is structured
1.2 Scaling with a traditional database
1.2.1 Scaling with a queue
1.2.2 Scaling by sharding the database
1.2.3 Fault-tolerance issues begin
1.2.4 Corruption issues
1.2.5 What went wrong?
1.2.6 How will Big Data techniques help?
1.3 NoSQL is not a panacea
1.4 First principles
1.5 Desired properties of a Big Data system
1.5.1 Robustness and fault tolerance
1.5.2 Low latency reads and updates
1.5.3 Scalability
1.5.4 Generalization
1.5.5 Extensibility
1.5.6 Ad hoc queries
1.5.7 Minimal maintenance
1.5.8 Debuggability
1.6 The problems with fully incremental architectures
1.6.1 Operational complexity
1.6.2 Extreme complexity of achieving eventual consistency
1.6.3 Lack of human-fault tolerance
1.6.4 Fully incremental solution vs. Lambda Architecture solution
1.7 Lambda Architecture
1.7.1 Batch layer
1.7.2 Serving layer
1.7.3 Batch and serving layers satisfy almost all properties
1.7.4 Speed layer
1.8 Recent trends in technology
1.8.1 CPUs aren’t getting faster
1.8.2 Elastic clouds
1.8.3 Vibrant open source ecosystem for Big Data
1.9 Example application: SuperWebAnalytics.com
1.10 Summary
Part 1: Batch layer
Chapter 2: Data model for Big Data
2.1 The properties of data
2.1.1 Data is raw
2.1.2 Data is immutable
2.1.3 Data is eternally true
2.2 The fact-based model for representing data
2.2.1 Example facts and their properties
2.2.2 Benefits of the fact-based model
2.3 Graph schemas
2.3.1 Elements of a graph schema
2.3.2 The need for an enforceable schema
2.4 A complete data model for SuperWebAnalytics.com
2.5 Summary
Chapter 3: Data model for Big Data: Illustration
3.1 Why a serialization framework?
3.2 Apache Thrift
3.2.1 Nodes
3.2.2 Edges
3.2.3 Properties
3.2.4 Tying everything together into data objects
3.2.5 Evolving your schema
3.3 Limitations of serialization frameworks
3.4 Summary
Chapter 4: Data storage on the batch layer
4.1 Storage requirements for the master dataset
4.2 Choosing a storage solution for the batch layer
4.2.1 Using a key/value store for the master dataset
4.2.2 Distributed filesystems
4.3 How distributed filesystems work
4.4 Storing a master dataset with a distributed filesystem
4.5 Vertical partitioning
4.6 Low-level nature of distributed filesystems
4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem
4.8 Summary
Chapter 5: Data storage on the batch layer: Illustration
5.1 Using the Hadoop Distributed File System
5.1.1 The small-files problem
5.1.2 Towards a higher-level abstraction
5.2 Data storage in the batch layer with Pail
5.2.1 Basic Pail operations
5.2.2 Serializing objects into pails
5.2.3 Batch operations using Pail
5.2.4 Vertical partitioning with Pail
5.2.5 Pail file formats and compression
5.2.6 Summarizing the benefits of Pail
5.3 Storing the master dataset for SuperWebAnalytics.com
5.3.1 A structured pail for Thrift objects
5.3.2 A basic pail for SuperWebAnalytics.com
5.3.3 A split pail to vertically partition the dataset
5.4 Summary
Chapter 6: Batch layer
6.1 Motivating examples
6.1.1 Number of pageviews over time
6.1.2 Gender inference
6.1.3 Influence score
6.2 Computing on the batch layer
6.3 Recomputation algorithms vs. incremental algorithms
6.3.1 Performance
6.3.2 Human-fault tolerance
6.3.3 Generality of the algorithms
6.3.4 Choosing a style of algorithm
6.4 Scalability in the batch layer
6.5 MapReduce: a paradigm for Big Data computing
6.5.1 Scalability
6.5.2 Fault-tolerance
6.5.3 Generality of MapReduce
6.6 Low-level nature of MapReduce
6.6.1 Multistep computations are unnatural
6.6.2 Joins are very complicated to implement manually
6.6.3 Logical and physical execution tightly coupled
6.7 Pipe diagrams: a higher-level way of thinking about batch computation
6.7.1 Concepts of pipe diagrams
6.7.2 Executing pipe diagrams via MapReduce
6.7.3 Combiner aggregators
6.7.4 Pipe diagram examples
6.8 Summary
Chapter 7: Batch layer: Illustration
7.1 An illustrative example
7.2 Common pitfalls of data-processing tools
7.2.1 Custom languages
7.2.2 Poorly composable abstractions
7.3 An introduction to JCascalog
7.3.1 The JCascalog data model
7.3.2 The structure of a JCascalog query
7.3.3 Querying multiple datasets
7.3.4 Grouping and aggregators
7.3.5 Stepping though an example query
7.3.6 Custom predicate operations
7.4 Composition
7.4.1 Combining subqueries
7.4.2 Dynamically created subqueries
7.4.3 Predicate macros
7.4.4 Dynamically created predicate macros
7.5 Summary
Chapter 8: An example batch layer: Architecture and algorithms
8.1 Design of the SuperWebAnalytics.com batch layer
8.1.1 Supported queries
8.1.2 Batch views
8.2 Workflow overview
8.3 Ingesting new data
8.4 URL normalization
8.5 User-identifier normalization
8.6 Deduplicate pageviews
8.7 Computing batch views
8.7.1 Pageviews over time
8.7.2 Unique visitors over time
8.7.3 Bounce-rate analysis
8.8 Summary
Chapter 9: An example batch layer: Implementation
9.1 Starting point
9.2 Preparing the workflow
9.3 Ingesting new data
9.4 URL normalization
9.5 User-identifier normalization
9.6 Deduplicate pageviews
9.7 Computing batch views
9.7.1 Pageviews over time
9.7.2 Uniques over time
9.7.3 Bounce-rate analysis
9.8 Summary
Part 2: Serving layer
Chapter 10: Serving layer
10.1 Performance metrics for the serving layer
10.2 The serving layer solution to the normalization/ denormalization problem
10.3 Requirements for a serving layer database
10.4 Designing a serving layer for SuperWebAnalytics.com
10.4.1 Pageviews over time
10.4.2 Uniques over time
10.4.3 Bounce-rate analysis
10.5 Contrasting with a fully incremental solution
10.5.1 Fully incremental solution to uniques over time
10.5.2 Comparing to the Lambda Architecture solution
10.6 Summary
Chapter 11: Serving layer: Illustration
11.1 Basics of ElephantDB
11.1.1 View creation in ElephantDB
11.1.2 View serving in ElephantDB
11.1.3 Using ElephantDB
11.2 Building the serving layer for SuperWebAnalytics.com
11.2.1 Pageviews over time
11.2.2 Uniques over time
11.2.3 Bounce-rate analysis
11.3 Summary
Part 3: Speed layer
Chapter 12: Realtime views
12.1 Computing realtime views
12.2 Storing realtime views
12.2.1 Eventual accuracy
12.2.2 Amount of state stored in the speed layer
12.3 Challenges of incremental computation
12.3.1 Validity of the CAP theorem
12.3.2 The complex interaction between the CAP theorem and incremental algorithms
12.4 Asynchronous versus synchronous updates
12.5 Expiring realtime views
12.6 Summary
Chapter 13: Realtime views: Illustration
13.1 Cassandra’s data model
13.2 Using Cassandra
13.2.1 Advanced Cassandra
13.3 Summary
Chapter 14: Queuing and stream processing
14.1 Queuing
14.1.1 Single-consumer queue servers
14.1.2 Multi-consumer queues
14.2 Stream processing
14.2.1 Queues and workers
14.2.2 Queues-and-workers pitfalls
14.3 Higher-level, one-at-a-time stream processing
14.3.1 Storm model
14.3.2 Guaranteeing message processing
14.4 SuperWebAnalytics.com speed layer
14.4.1 Topology structure
14.5 Summary
Chapter 15: Queuing and stream processing: Illustration
15.1 Defining topologies with Apache Storm
15.2 Apache Storm clusters and deployment
15.3 Guaranteeing message processing
15.4 Implementing the SuperWebAnalytics.com uniques-over-time speed layer
15.5 Summary
Chapter 16: Micro-batch stream processing
16.1 Achieving exactly-once semantics
16.1.1 Strongly ordered processing
16.1.2 Micro-batch stream processing
16.1.3 Micro-batch processing topologies
16.2 Core concepts of micro-batch stream processing
16.3 Extending pipe diagrams for micro-batch processing
16.4 Finishing the speed layer for SuperWebAnalytics.com
16.4.1 Pageviews over time
16.4.2 Bounce-rate analysis
16.5 Another look at the bounce-rate-analysis example
16.6 Summary
Chapter 17: Micro-batch stream processing: Illustration
17.1 Using Trident
17.2 Finishing the SuperWebAnalytics.com speed layer
17.2.1 Pageviews over time
17.2.2 Bounce-rate analysis
17.3 Fully fault-tolerant, in-memory, micro-batch processing
17.4 Summary
Chapter 18: Lambda Architecture in depth
18.1 Defining data systems
18.2 Batch and serving layers
18.2.1 Incremental batch processing
18.2.2 Measuring and optimizing batch layer resource usage
18.3 Speed layer
18.4 Query layer
18.5 Summary
index
Symbols
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Z