logo资料库

Data.Analytics.with.Hadoop.An.Introduction.for.Data.Scientists.pdf

第1页 / 共288页
第2页 / 共288页
第3页 / 共288页
第4页 / 共288页
第5页 / 共288页
第6页 / 共288页
第7页 / 共288页
第8页 / 共288页
资料共288页,剩余部分请下载后查看
Copyright
Table of Contents
Preface
What to Expect from This Book
Who This Book Is For
How to Read This Book
Overview of Chapters
Programming and Code Examples
GitHub Repository
Executing Distributed Jobs
Permissions and Citation
Feedback and How to Contact Us
Safari® Books Online
How to Contact Us
Acknowledgments
Part I. Introduction to Distributed Computing
Chapter 1. The Age of the Data Product
What Is a Data Product?
Building Data Products at Scale with Hadoop
Leveraging Large Datasets
Hadoop for Data Products
The Data Science Pipeline and the Hadoop Ecosystem
Big Data Workflows
Conclusion
Chapter 2. An Operating System for Big Data
Basic Concepts
Hadoop Architecture
A Hadoop Cluster
HDFS
YARN
Working with a Distributed File System
Basic File System Operations
File Permissions in HDFS
Other HDFS Interfaces
Working with Distributed Computation
MapReduce: A Functional Programming Model
MapReduce: Implemented on a Cluster
Beyond a Map and Reduce: Job Chaining
Submitting a MapReduce Job to YARN
Conclusion
Chapter 3. A Framework for Python and Hadoop Streaming
Hadoop Streaming
Computing on CSV Data with Streaming
Executing Streaming Jobs
A Framework for MapReduce with Python
Counting Bigrams
Other Frameworks
Advanced MapReduce
Combiners
Partitioners
Job Chaining
Conclusion
Chapter 4. In-Memory Computing with Spark
Spark Basics
The Spark Stack
Resilient Distributed Datasets
Programming with RDDs
Interactive Spark Using PySpark
Writing Spark Applications
Visualizing Airline Delays with Spark
Conclusion
Chapter 5. Distributed Analysis and Patterns
Computing with Keys
Compound Keys
Keyspace Patterns
Pairs versus Stripes
Design Patterns
Summarization
Indexing
Filtering
Toward Last-Mile Analytics
Fitting a Model
Validating Models
Conclusion
Part II. Workflows and Tools for Big Data Science
Chapter 6. Data Mining and Warehousing
Structured Data Queries with Hive
The Hive Command-Line Interface (CLI)
Hive Query Language (HQL)
Data Analysis with Hive
HBase
NoSQL and Column-Oriented Databases
Real-Time Analytics with HBase
Conclusion
Chapter 7. Data Ingestion
Importing Relational Data with Sqoop
Importing from MySQL to HDFS
Importing from MySQL to Hive
Importing from MySQL to HBase
Ingesting Streaming Data with Flume
Flume Data Flows
Ingesting Product Impression Data with Flume
Conclusion
Chapter 8. Analytics with Higher-Level APIs
Pig
Pig Latin
Data Types
Relational Operators
User-Defined Functions
Wrapping Up
Spark’s Higher-Level APIs
Spark SQL
DataFrames
Conclusion
Chapter 9. Machine Learning
Scalable Machine Learning with Spark
Collaborative Filtering
Classification
Clustering
Conclusion
Chapter 10. Summary: Doing Distributed Data Science
Data Product Lifecycle
Data Lakes
Data Ingestion
Computational Data Stores
Machine Learning Lifecycle
Conclusion
Appendix A. Creating a Hadoop Pseudo-Distributed Development Environment
Quick Start
Setting Up Linux
Creating a Hadoop User
Configuring SSH
Installing Java
Disabling IPv6
Installing Hadoop
Unpacking
Environment
Hadoop Configuration
Formatting the Namenode
Starting Hadoop
Restarting Hadoop
Appendix B. Installing Hadoop Ecosystem Products
Packaged Hadoop Distributions
Self-Installation of Apache Hadoop Ecosystem Products
Basic Installation and Configuration Steps
Sqoop-Specific Configurations
Hive-Specific Configuration
HBase-Specific Configurations
Installing Spark
Glossary
Index
About the Authors
Colophon
Data Analytics with Hadoop AN INTRODUCTION FOR DATA SCIENTISTS Benjamin Bengfort & Jenny Kim
Data Analytics with Hadoop An Introduction for Data Scientists Benjamin Bengfort and Jenny Kim Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo
Data Analytics with Hadoop by Benjamin Bengfort and Jenny Kim Copyright © 2016 Jenny Kim and Benjamin Bengfort. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Nicole Tache Production Editor: Melanie Yarbrough Copyeditor: Colleen Toporek Proofreader: Jasmine Kwityn June 2016: First Edition Revision History for the First Edition 2016-05-25: First Release Indexer: WordCo Indexing Services Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest See http://oreilly.com/catalog/errata.csp?isbn=9781491913703 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analytics with Hadoop, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91370-3 [LSI]
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Part I. Introduction to Distributed Computing 1. The Age of the Data Product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is a Data Product? 4 Building Data Products at Scale with Hadoop 5 Leveraging Large Datasets 6 Hadoop for Data Products 7 The Data Science Pipeline and the Hadoop Ecosystem 8 Big Data Workflows 10 Conclusion 11 2. An Operating System for Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Basic Concepts 14 Hadoop Architecture 15 A Hadoop Cluster 17 HDFS 20 YARN 21 Working with a Distributed File System 22 Basic File System Operations 23 File Permissions in HDFS 25 Other HDFS Interfaces 26 Working with Distributed Computation 27 MapReduce: A Functional Programming Model 28 MapReduce: Implemented on a Cluster 30 Beyond a Map and Reduce: Job Chaining 37 iii
Submitting a MapReduce Job to YARN 38 Conclusion 40 3. A Framework for Python and Hadoop Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Hadoop Streaming 42 Computing on CSV Data with Streaming 45 Executing Streaming Jobs 50 A Framework for MapReduce with Python 52 Counting Bigrams 55 Other Frameworks 59 Advanced MapReduce 60 Combiners 60 Partitioners 61 Job Chaining 62 Conclusion 65 4. In-Memory Computing with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Spark Basics 68 The Spark Stack 70 Resilient Distributed Datasets 72 Programming with RDDs 73 Interactive Spark Using PySpark 77 Writing Spark Applications 79 Visualizing Airline Delays with Spark 81 Conclusion 87 5. Distributed Analysis and Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Computing with Keys 91 Compound Keys 92 Keyspace Patterns 96 Pairs versus Stripes 100 Design Patterns 104 Summarization 105 Indexing 110 Filtering 117 Toward Last-Mile Analytics 123 Fitting a Model 124 Validating Models 125 Conclusion 127 iv | Table of Contents
Part II. Workflows and Tools for Big Data Science 6. Data Mining and Warehousing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Structured Data Queries with Hive 132 The Hive Command-Line Interface (CLI) 133 Hive Query Language (HQL) 134 Data Analysis with Hive 139 HBase 144 NoSQL and Column-Oriented Databases 145 Real-Time Analytics with HBase 148 Conclusion 155 7. Data Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Importing Relational Data with Sqoop 158 Importing from MySQL to HDFS 158 Importing from MySQL to Hive 161 Importing from MySQL to HBase 163 Ingesting Streaming Data with Flume 165 Flume Data Flows 165 Ingesting Product Impression Data with Flume 169 Conclusion 173 8. Analytics with Higher-Level APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Pig 175 Pig Latin 177 Data Types 181 Relational Operators 182 User-Defined Functions 182 Wrapping Up 184 Spark’s Higher-Level APIs 184 Spark SQL 186 DataFrames 189 Conclusion 195 9. Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Scalable Machine Learning with Spark 197 Collaborative Filtering 199 Classification 206 Clustering 208 Conclusion 212 Table of Contents | v
10. Summary: Doing Distributed Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Data Product Lifecycle 214 Data Lakes 216 Data Ingestion 218 Computational Data Stores 220 Machine Learning Lifecycle 222 Conclusion 224 A. Creating a Hadoop Pseudo-Distributed Development Environment. . . . . . . . . . . . . . . 227 B. Installing Hadoop Ecosystem Products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 vi | Table of Contents
分享到:
收藏