Data.Analytics.with.Hadoop.An.Introduction.for.Data.Scientists.pdf

发布时间：2022-06-16 发布人：admin 分类：说明书资料大小：6.62M 资料格式：pdf 举报版权申诉

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第1页.png

第1页 / 共288页

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第2页.png

第2页 / 共288页

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第3页.png

第3页 / 共288页

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第4页.png

第4页 / 共288页

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第5页.png

第5页 / 共288页

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第6页.png

第6页 / 共288页

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第7页.png

第7页 / 共288页

fc35c891-5c6b-4250-a3dd-ef6765796faa.pdf-第8页.png

第8页 / 共288页

Table of Contents

Preface

What to Expect from This Book

Who This Book Is For

How to Read This Book

Overview of Chapters

Programming and Code Examples

GitHub Repository

Executing Distributed Jobs

Permissions and Citation

Feedback and How to Contact Us

Safari® Books Online

How to Contact Us

Acknowledgments

Part I. Introduction to Distributed Computing

Chapter 1. The Age of the Data Product

What Is a Data Product?

Building Data Products at Scale with Hadoop

Leveraging Large Datasets

Hadoop for Data Products

The Data Science Pipeline and the Hadoop Ecosystem

Big Data Workflows

Conclusion

Chapter 2. An Operating System for Big Data

Basic Concepts

Hadoop Architecture

A Hadoop Cluster

HDFS

YARN

Working with a Distributed File System

Basic File System Operations

File Permissions in HDFS

Other HDFS Interfaces

Working with Distributed Computation

MapReduce: A Functional Programming Model

MapReduce: Implemented on a Cluster

Beyond a Map and Reduce: Job Chaining

Submitting a MapReduce Job to YARN

Conclusion

Chapter 3. A Framework for Python and Hadoop Streaming

Hadoop Streaming

Computing on CSV Data with Streaming

Executing Streaming Jobs

A Framework for MapReduce with Python

Counting Bigrams

Other Frameworks

Advanced MapReduce

Combiners

Partitioners

Job Chaining

Conclusion

Chapter 4. In-Memory Computing with Spark

Spark Basics

The Spark Stack

Resilient Distributed Datasets

Programming with RDDs

Interactive Spark Using PySpark

Writing Spark Applications

Visualizing Airline Delays with Spark

Conclusion

Chapter 5. Distributed Analysis and Patterns

Computing with Keys

Compound Keys

Keyspace Patterns

Pairs versus Stripes

Design Patterns

Summarization

Indexing

Filtering

Toward Last-Mile Analytics

Fitting a Model

Validating Models

Conclusion

Part II. Workflows and Tools for Big Data Science

Chapter 6. Data Mining and Warehousing

Structured Data Queries with Hive

The Hive Command-Line Interface (CLI)

Hive Query Language (HQL)

Data Analysis with Hive

HBase

NoSQL and Column-Oriented Databases

Real-Time Analytics with HBase

Conclusion

Chapter 7. Data Ingestion

Importing Relational Data with Sqoop

Importing from MySQL to HDFS

Importing from MySQL to Hive

Importing from MySQL to HBase

Ingesting Streaming Data with Flume

Flume Data Flows

Ingesting Product Impression Data with Flume

Conclusion

Chapter 8. Analytics with Higher-Level APIs

Pig

Pig Latin

Data Types

Relational Operators

User-Defined Functions

Wrapping Up

Spark’s Higher-Level APIs

Spark SQL

DataFrames

Conclusion

Chapter 9. Machine Learning

Scalable Machine Learning with Spark

Collaborative Filtering

Classification

Clustering

Conclusion

Chapter 10. Summary: Doing Distributed Data Science

Data Product Lifecycle

Data Lakes

Data Ingestion

Computational Data Stores

Machine Learning Lifecycle

Conclusion

Appendix A. Creating a Hadoop Pseudo-Distributed Development Environment

Quick Start

Setting Up Linux

Creating a Hadoop User

Configuring SSH

Installing Java

Disabling IPv6

Installing Hadoop

Unpacking

Environment

Hadoop Configuration

Formatting the Namenode

Starting Hadoop

Restarting Hadoop

Appendix B. Installing Hadoop Ecosystem Products

Packaged Hadoop Distributions

Self-Installation of Apache Hadoop Ecosystem Products

Basic Installation and Configuration Steps

Sqoop-Specific Configurations

Hive-Specific Configuration

HBase-Specific Configurations

Installing Spark

Glossary

Index

About the Authors

Colophon

Data Analytics with Hadoop AN INTRODUCTION FOR DATA SCIENTISTS Benjamin Bengfort & Jenny Kim

Data Analytics with Hadoop An Introduction for Data Scientists Benjamin Bengfort and Jenny Kim Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo

Data Analytics with Hadoop by Benjamin Bengfort and Jenny Kim Copyright © 2016 Jenny Kim and Benjamin Bengfort. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Nicole Tache Production Editor: Melanie Yarbrough Copyeditor: Colleen Toporek Proofreader: Jasmine Kwityn June 2016: First Edition Revision History for the First Edition 2016-05-25: First Release Indexer: WordCo Indexing Services Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest See http://oreilly.com/catalog/errata.csp?isbn=9781491913703 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Analytics with Hadoop, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-91370-3 [LSI]

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Part I. Introduction to Distributed Computing 1. The Age of the Data Product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is a Data Product? 4 Building Data Products at Scale with Hadoop 5 Leveraging Large Datasets 6 Hadoop for Data Products 7 The Data Science Pipeline and the Hadoop Ecosystem 8 Big Data Workflows 10 Conclusion 11 2. An Operating System for Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Basic Concepts 14 Hadoop Architecture 15 A Hadoop Cluster 17 HDFS 20 YARN 21 Working with a Distributed File System 22 Basic File System Operations 23 File Permissions in HDFS 25 Other HDFS Interfaces 26 Working with Distributed Computation 27 MapReduce: A Functional Programming Model 28 MapReduce: Implemented on a Cluster 30 Beyond a Map and Reduce: Job Chaining 37 iii

Submitting a MapReduce Job to YARN 38 Conclusion 40 3. A Framework for Python and Hadoop Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Hadoop Streaming 42 Computing on CSV Data with Streaming 45 Executing Streaming Jobs 50 A Framework for MapReduce with Python 52 Counting Bigrams 55 Other Frameworks 59 Advanced MapReduce 60 Combiners 60 Partitioners 61 Job Chaining 62 Conclusion 65 4. In-Memory Computing with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Spark Basics 68 The Spark Stack 70 Resilient Distributed Datasets 72 Programming with RDDs 73 Interactive Spark Using PySpark 77 Writing Spark Applications 79 Visualizing Airline Delays with Spark 81 Conclusion 87 5. Distributed Analysis and Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Computing with Keys 91 Compound Keys 92 Keyspace Patterns 96 Pairs versus Stripes 100 Design Patterns 104 Summarization 105 Indexing 110 Filtering 117 Toward Last-Mile Analytics 123 Fitting a Model 124 Validating Models 125 Conclusion 127 iv | Table of Contents

Part II. Workflows and Tools for Big Data Science 6. Data Mining and Warehousing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Structured Data Queries with Hive 132 The Hive Command-Line Interface (CLI) 133 Hive Query Language (HQL) 134 Data Analysis with Hive 139 HBase 144 NoSQL and Column-Oriented Databases 145 Real-Time Analytics with HBase 148 Conclusion 155 7. Data Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Importing Relational Data with Sqoop 158 Importing from MySQL to HDFS 158 Importing from MySQL to Hive 161 Importing from MySQL to HBase 163 Ingesting Streaming Data with Flume 165 Flume Data Flows 165 Ingesting Product Impression Data with Flume 169 Conclusion 173 8. Analytics with Higher-Level APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Pig 175 Pig Latin 177 Data Types 181 Relational Operators 182 User-Defined Functions 182 Wrapping Up 184 Spark’s Higher-Level APIs 184 Spark SQL 186 DataFrames 189 Conclusion 195 9. Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Scalable Machine Learning with Spark 197 Collaborative Filtering 199 Classification 206 Clustering 208 Conclusion 212 Table of Contents | v

10. Summary: Doing Distributed Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Data Product Lifecycle 214 Data Lakes 216 Data Ingestion 218 Computational Data Stores 220 Machine Learning Lifecycle 222 Conclusion 224 A. Creating a Hadoop Pseudo-Distributed Development Environment. . . . . . . . . . . . . . . 227 B. Installing Hadoop Ecosystem Products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 vi | Table of Contents

分享到：

赞收藏

资料库

Data.Analytics.with.Hadoop.An.Introduction.for.Data.Scientists.pdf

相关推荐

行业

热门标签

最新资料