Big Data Principles and best practices of scalable realtime data systems.pdf

发布时间：2022-06-20 发布人：admin 分类：说明书资料大小：6.58M 资料格式：pdf 举报版权申诉

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第1页.png

第1页 / 共330页

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第2页.png

第2页 / 共330页

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第3页.png

第3页 / 共330页

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第4页.png

第4页 / 共330页

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第5页.png

第5页 / 共330页

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第6页.png

第6页 / 共330页

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第7页.png

第7页 / 共330页

1b4af0b8-602d-4b56-8dd5-c5dcfad1431c.pdf-第8页.png

第8页 / 共330页

Big Data

brief contents

contents

preface

acknowledgments

about this book

Roadmap

Code downloads and conventions

Author Online

About the cover illustration

Chapter 1: A new paradigm for Big Data

1.1 How this book is structured

1.2 Scaling with a traditional database

1.2.1 Scaling with a queue

1.2.2 Scaling by sharding the database

1.2.3 Fault-tolerance issues begin

1.2.4 Corruption issues

1.2.5 What went wrong?

1.2.6 How will Big Data techniques help?

1.3 NoSQL is not a panacea

1.4 First principles

1.5 Desired properties of a Big Data system

1.5.1 Robustness and fault tolerance

1.5.2 Low latency reads and updates

1.5.3 Scalability

1.5.4 Generalization

1.5.5 Extensibility

1.5.6 Ad hoc queries

1.5.7 Minimal maintenance

1.5.8 Debuggability

1.6 The problems with fully incremental architectures

1.6.1 Operational complexity

1.6.2 Extreme complexity of achieving eventual consistency

1.6.3 Lack of human-fault tolerance

1.6.4 Fully incremental solution vs. Lambda Architecture solution

1.7 Lambda Architecture

1.7.1 Batch layer

1.7.2 Serving layer

1.7.3 Batch and serving layers satisfy almost all properties

1.7.4 Speed layer

1.8 Recent trends in technology

1.8.1 CPUs aren’t getting faster

1.8.2 Elastic clouds

1.8.3 Vibrant open source ecosystem for Big Data

1.9 Example application: SuperWebAnalytics.com

1.10 Summary

Part 1: Batch layer

Chapter 2: Data model for Big Data

2.1 The properties of data

2.1.1 Data is raw

2.1.2 Data is immutable

2.1.3 Data is eternally true

2.2 The fact-based model for representing data

2.2.1 Example facts and their properties

2.2.2 Benefits of the fact-based model

2.3 Graph schemas

2.3.1 Elements of a graph schema

2.3.2 The need for an enforceable schema

2.4 A complete data model for SuperWebAnalytics.com

2.5 Summary

Chapter 3: Data model for Big Data: Illustration

3.1 Why a serialization framework?

3.2 Apache Thrift

3.2.1 Nodes

3.2.2 Edges

3.2.3 Properties

3.2.4 Tying everything together into data objects

3.2.5 Evolving your schema

3.3 Limitations of serialization frameworks

3.4 Summary

Chapter 4: Data storage on the batch layer

4.1 Storage requirements for the master dataset

4.2 Choosing a storage solution for the batch layer

4.2.1 Using a key/value store for the master dataset

4.2.2 Distributed filesystems

4.3 How distributed filesystems work

4.4 Storing a master dataset with a distributed filesystem

4.5 Vertical partitioning

4.6 Low-level nature of distributed filesystems

4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem

4.8 Summary

Chapter 5: Data storage on the batch layer: Illustration

5.1 Using the Hadoop Distributed File System

5.1.1 The small-files problem

5.1.2 Towards a higher-level abstraction

5.2 Data storage in the batch layer with Pail

5.2.1 Basic Pail operations

5.2.2 Serializing objects into pails

5.2.3 Batch operations using Pail

5.2.4 Vertical partitioning with Pail

5.2.5 Pail file formats and compression

5.2.6 Summarizing the benefits of Pail

5.3 Storing the master dataset for SuperWebAnalytics.com

5.3.1 A structured pail for Thrift objects

5.3.2 A basic pail for SuperWebAnalytics.com

5.3.3 A split pail to vertically partition the dataset

5.4 Summary

Chapter 6: Batch layer

6.1 Motivating examples

6.1.1 Number of pageviews over time

6.1.2 Gender inference

6.1.3 Influence score

6.2 Computing on the batch layer

6.3 Recomputation algorithms vs. incremental algorithms

6.3.1 Performance

6.3.2 Human-fault tolerance

6.3.3 Generality of the algorithms

6.3.4 Choosing a style of algorithm

6.4 Scalability in the batch layer

6.5 MapReduce: a paradigm for Big Data computing

6.5.1 Scalability

6.5.2 Fault-tolerance

6.5.3 Generality of MapReduce

6.6 Low-level nature of MapReduce

6.6.1 Multistep computations are unnatural

6.6.2 Joins are very complicated to implement manually

6.6.3 Logical and physical execution tightly coupled

6.7 Pipe diagrams: a higher-level way of thinking about batch computation

6.7.1 Concepts of pipe diagrams

6.7.2 Executing pipe diagrams via MapReduce

6.7.3 Combiner aggregators

6.7.4 Pipe diagram examples

6.8 Summary

Chapter 7: Batch layer: Illustration

7.1 An illustrative example

7.2 Common pitfalls of data-processing tools

7.2.1 Custom languages

7.2.2 Poorly composable abstractions

7.3 An introduction to JCascalog

7.3.1 The JCascalog data model

7.3.2 The structure of a JCascalog query

7.3.3 Querying multiple datasets

7.3.4 Grouping and aggregators

7.3.5 Stepping though an example query

7.3.6 Custom predicate operations

7.4 Composition

7.4.1 Combining subqueries

7.4.2 Dynamically created subqueries

7.4.3 Predicate macros

7.4.4 Dynamically created predicate macros

7.5 Summary

Chapter 8: An example batch layer: Architecture and algorithms

8.1 Design of the SuperWebAnalytics.com batch layer

8.1.1 Supported queries

8.1.2 Batch views

8.2 Workflow overview

8.3 Ingesting new data

8.4 URL normalization

8.5 User-identifier normalization

8.6 Deduplicate pageviews

8.7 Computing batch views

8.7.1 Pageviews over time

8.7.2 Unique visitors over time

8.7.3 Bounce-rate analysis

8.8 Summary

Chapter 9: An example batch layer: Implementation

9.1 Starting point

9.2 Preparing the workflow

9.3 Ingesting new data

9.4 URL normalization

9.5 User-identifier normalization

9.6 Deduplicate pageviews

9.7 Computing batch views

9.7.1 Pageviews over time

9.7.2 Uniques over time

9.7.3 Bounce-rate analysis

9.8 Summary

Part 2: Serving layer

Chapter 10: Serving layer

10.1 Performance metrics for the serving layer

10.2 The serving layer solution to the normalization/ denormalization problem

10.3 Requirements for a serving layer database

10.4 Designing a serving layer for SuperWebAnalytics.com

10.4.1 Pageviews over time

10.4.2 Uniques over time

10.4.3 Bounce-rate analysis

10.5 Contrasting with a fully incremental solution

10.5.1 Fully incremental solution to uniques over time

10.5.2 Comparing to the Lambda Architecture solution

10.6 Summary

Chapter 11: Serving layer: Illustration

11.1 Basics of ElephantDB

11.1.1 View creation in ElephantDB

11.1.2 View serving in ElephantDB

11.1.3 Using ElephantDB

11.2 Building the serving layer for SuperWebAnalytics.com

11.2.1 Pageviews over time

11.2.2 Uniques over time

11.2.3 Bounce-rate analysis

11.3 Summary

Part 3: Speed layer

Chapter 12: Realtime views

12.1 Computing realtime views

12.2 Storing realtime views

12.2.1 Eventual accuracy

12.2.2 Amount of state stored in the speed layer

12.3 Challenges of incremental computation

12.3.1 Validity of the CAP theorem

12.3.2 The complex interaction between the CAP theorem and incremental algorithms

12.4 Asynchronous versus synchronous updates

12.5 Expiring realtime views

12.6 Summary

Chapter 13: Realtime views: Illustration

13.1 Cassandra’s data model

13.2 Using Cassandra

13.2.1 Advanced Cassandra

13.3 Summary

Chapter 14: Queuing and stream processing

14.1 Queuing

14.1.1 Single-consumer queue servers

14.1.2 Multi-consumer queues

14.2 Stream processing

14.2.1 Queues and workers

14.2.2 Queues-and-workers pitfalls

14.3 Higher-level, one-at-a-time stream processing

14.3.1 Storm model

14.3.2 Guaranteeing message processing

14.4 SuperWebAnalytics.com speed layer

14.4.1 Topology structure

14.5 Summary

Chapter 15: Queuing and stream processing: Illustration

15.1 Defining topologies with Apache Storm

15.2 Apache Storm clusters and deployment

15.3 Guaranteeing message processing

15.4 Implementing the SuperWebAnalytics.com uniques-over-time speed layer

15.5 Summary

Chapter 16: Micro-batch stream processing

16.1 Achieving exactly-once semantics

16.1.1 Strongly ordered processing

16.1.2 Micro-batch stream processing

16.1.3 Micro-batch processing topologies

16.2 Core concepts of micro-batch stream processing

16.3 Extending pipe diagrams for micro-batch processing

16.4 Finishing the speed layer for SuperWebAnalytics.com

16.4.1 Pageviews over time

16.4.2 Bounce-rate analysis

16.5 Another look at the bounce-rate-analysis example

16.6 Summary

Chapter 17: Micro-batch stream processing: Illustration

17.1 Using Trident

17.2 Finishing the SuperWebAnalytics.com speed layer

17.2.1 Pageviews over time

17.2.2 Bounce-rate analysis

17.3 Fully fault-tolerant, in-memory, micro-batch processing

17.4 Summary

Chapter 18: Lambda Architecture in depth

18.1 Defining data systems

18.2 Batch and serving layers

18.2.1 Incremental batch processing

18.2.2 Measuring and optimizing batch layer resource usage

18.3 Speed layer

18.4 Query layer

18.5 Summary

index

Symbols

Principles and best practices of scalable real-time data systems Nathan Marz WITH James Warren M A N N I N G www.allitebooks.com

Big Data PRINCIPLES AND BEST PRACTICES OF SCALABLE REAL-TIME DATA SYSTEMS NATHAN MARZ with JAMES WARREN M A N N I NG Shelter Island Licensed to Mark Watson www.allitebooks.com

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2015 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Development editors: Renae Gregoire, Jennifer Stout Technical development editor: Jerry Gaines Copyeditor: Andy Carroll Proofreader: Katie Tennant Technical proofreader: Jerry Kuch Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617290343 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 20 19 18 17 16 15 Licensed to Mark Watson www.allitebooks.com

brief contents 1 ■ A new paradigm for Big Data 1 PART 1 BATCH LAYER.................................................................25 2 ■ Data model for Big Data 27 3 ■ Data model for Big Data: Illustration 47 4 ■ Data storage on the batch layer 54 5 ■ Data storage on the batch layer: Illustration 65 6 ■ Batch layer 83 7 ■ Batch layer: Illustration 111 8 ■ An example batch layer: Architecture and algorithms 139 9 ■ An example batch layer: Implementation 156 PART 2 SERVING LAYER ............................................................177 10 ■ Serving layer 179 11 ■ Serving layer: Illustration 196 iii Licensed to Mark Watson www.allitebooks.com

iv BRIEF CONTENTS PART 3 SPEED LAYER ................................................................205 12 ■ Realtime views 207 13 ■ Realtime views: Illustration 220 14 ■ Queuing and stream processing 225 15 ■ Queuing and stream processing: Illustration 242 16 ■ Micro-batch stream processing 254 17 ■ Micro-batch stream processing: Illustration 269 18 ■ Lambda Architecture in depth 284 Licensed to Mark Watson www.allitebooks.com

contents xiii preface acknowledgments about this book xv xviii 1 A new paradigm for Big Data 1 1.1 How this book is structured 2 1.2 Scaling with a traditional database 3 Scaling with a queue 3 ■ Scaling by sharding the database 4 Fault-tolerance issues begin 5 ■ Corruption issues 5 ■ What went wrong? 5 ■ How will Big Data techniques help? 6 1.3 NoSQL is not a panacea 6 1.4 1.5 Desired properties of a Big Data system 7 First principles 6 Robustness and fault tolerance 7 ■ Low latency reads and updates 8 ■ Scalability 8 ■ Generalization 8 ■ Extensibility 8 Ad hoc queries 8 ■ Minimal maintenance 9 ■ Debuggability 9 1.6 The problems with fully incremental architectures 9 Operational complexity 10 ■ Extreme complexity of achieving eventual consistency 11 ■ Lack of human-fault tolerance 12 Fully incremental solution vs. Lambda Architecture solution 13 v Licensed to Mark Watson www.allitebooks.com

vi CONTENTS 1.7 Lambda Architecture 14 Batch layer 16 ■ Serving layer 17 ■ Batch and serving layers satisfy almost all properties 17 ■ Speed layer 18 1.8 Recent trends in technology 20 CPUs aren’t getting faster 20 ■ Elastic clouds 21 ■ Vibrant open source ecosystem for Big Data 21 1.9 Example application: SuperWebAnalytics.com 22 1.10 Summary 23 PART 1 BATCH LAYER .......................................................25 2 Data model for Big Data 27 2.1 The properties of data 29 Data is raw 31 ■ Data is immutable 34 ■ Data is eternally true 36 2.2 The fact-based model for representing data 37 Example facts and their properties 37 ■ Benefits of the fact-based model 39 2.3 Graph schemas 43 Elements of a graph schema 43 ■ The need for an enforceable schema 44 2.4 A complete data model for SuperWebAnalytics.com 45 2.5 Summary 46 3 Data model for Big Data: Illustration 3.1 Why a serialization framework? 48 3.2 Apache Thrift 48 47 Nodes 49 ■ Edges 49 ■ Properties 50 ■ Tying everything together into data objects 51 ■ Evolving your schema 51 3.3 Limitations of serialization frameworks 52 3.4 Summary 53 4 Data storage on the batch layer 54 Storage requirements for the master dataset 55 4.1 4.2 Choosing a storage solution for the batch layer 56 Using a key/value store for the master dataset 56 ■ Distributed filesystems 57 Licensed to Mark Watson www.allitebooks.com

CONTENTS vii Storing a master dataset with a distributed filesystem 59 4.3 How distributed filesystems work 58 4.4 4.5 Vertical partitioning 61 4.6 Low-level nature of distributed filesystems 62 4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem 64 Summary 64 4.8 5 Data storage on the batch layer: Illustration 65 5.1 Using the Hadoop Distributed File System 66 The small-files problem 67 ■ Towards a higher-level abstraction 67 5.2 Data storage in the batch layer with Pail 68 Basic Pail operations 69 ■ Serializing objects into pails 70 Batch operations using Pail 72 ■ Vertical partitioning with Pail 73 ■ Pail file formats and compression 74 ■ Summarizing the benefits of Pail 75 5.3 Storing the master dataset for SuperWebAnalytics.com 76 A structured pail for Thrift objects 77 ■ A basic pail for SuperWebAnalytics.com 78 ■ A split pail to vertically partition the dataset 78 Summary 82 5.4 6 Batch layer 83 6.1 Motivating examples 84 Number of pageviews over time 84 ■ Gender inference 85 Influence score 85 6.2 Computing on the batch layer 86 6.3 Recomputation algorithms vs. incremental algorithms 88 Performance 89 ■ Human-fault tolerance 90 ■ Generality of the algorithms 91 ■ Choosing a style of algorithm 91 Scalability in the batch layer 92 6.4 6.5 MapReduce: a paradigm for Big Data computing 93 Scalability 94 ■ Fault-tolerance 96 ■ Generality of MapReduce 97 6.6 Low-level nature of MapReduce 99 Multistep computations are unnatural 99 ■ Joins are very complicated to implement manually 99 ■ Logical and physical execution tightly coupled 101 Licensed to Mark Watson www.allitebooks.com

分享到：

赞收藏

资料库

Big Data Principles and best practices of scalable realtime data systems.pdf

相关推荐

大数据

热门标签

最新资料