Hadoop in Practice.pdf

发布时间：2022-05-30 发布人：admin 分类：说明书资料大小：25.90M 资料格式：pdf 举报版权申诉

w397090770-6643349-4744300845395257318.pdf-第1页.png

第1页 / 共537页

w397090770-6643349-4744300845395257318.pdf-第2页.png

第2页 / 共537页

w397090770-6643349-4744300845395257318.pdf-第3页.png

第3页 / 共537页

w397090770-6643349-4744300845395257318.pdf-第4页.png

第4页 / 共537页

w397090770-6643349-4744300845395257318.pdf-第5页.png

第5页 / 共537页

w397090770-6643349-4744300845395257318.pdf-第6页.png

第6页 / 共537页

w397090770-6643349-4744300845395257318.pdf-第7页.png

第7页 / 共537页

w397090770-6643349-4744300845395257318.pdf-第8页.png

第8页 / 共537页

Hadoop in Practice

brief contents

contents

preface

acknowledgments

about this book

Roadmap

Code conventions and downloads

Third-party libraries

Datasets

Getting help

Author Online

About the author

About the cover illustration

Part 1 Background and fundamentals

Chapter 1 Hadoop in a heartbeat

1.1 What is Hadoop?

1.1.1 Core Hadoop components

1.1.2 The Hadoop ecosystem

1.1.3 Physical architecture

1.1.4 Who’s using Hadoop?

1.1.5 Hadoop limitations

1.2 Running Hadoop

1.2.1 Downloading and installing Hadoop

1.2.2 Hadoop configuration

1.2.3 Basic CLI commands

1.2.4 Running a MapReduce job

1.3 Chapter summary

Part 2 Data logistics

Chapter 2 Moving data in and out of Hadoop

2.1 Key elements of ingress and egress

2.2 Moving data into Hadoop

2.2.1 Pushing log files into Hadoop

Technique 1 Pushing system log messages into HDFS with Flume

2.2.2 Pushing and pulling semistructured and binary files

Technique 2 An automated mechanism to copy files into HDFS

Technique 3 Scheduling regular ingress activities with Oozie

2.2.3 Pulling data from databases

Technique 4 Database ingress with MapReduce

Technique 5 Using Sqoop to import data from MySQL

2.2.4 HBase

Technique 6 HBase ingress into HDFS

Technique 7 MapReduce with HBase as a data source

2.3 Moving data out of Hadoop

2.3.1 Egress to a local filesystem

Technique 8 Automated file copying from HDFS

2.3.2 Databases

Technique 9 Using Sqoop to export data to MySQL

2.3.3 HBase

Technique 10 HDFS egress to HBase

Technique 11 Using HBase as a data sink in MapReduce

2.4 Chapter summary

Chapter 3 Data serialization— working with text and beyond

3.1 Understanding inputs and outputs in MapReduce

3.1.1 Data input

3.1.2 Data output

3.2 Processing common serialization formats

3.2.1 XML

Technique 12 MapReduce and XML

3.2.2 JSON

Technique 13 MapReduce and JSON

3.3 Big data serialization formats

3.3.1 Comparing SequenceFiles, Protocol Buffers, Thrift, and Avro

3.3.2 SequenceFiles

Technique 14 Working with SequenceFiles

3.3.3 Protocol Buffers

Technique 15 Integrating Protocol Buffers with MapReduce

3.3.4 Thrift

Technique 16 Working with Thrift

3.3.5 Avro

Technique 17 Next-generation data serialization with MapReduce

3.4 Custom file formats

3.4.1 Input and output formats

Technique 18 Writing input and output formats for CSV

3.4.2 The importance of output committing

3.5 Chapter summary

Part 3 Big data patterns

Chapter 4 Applying MapReduce patterns to big data

4.1 Joining

4.1.1 Repartition join

Technique 19 Optimized repartition joins

4.1.2 Replicated joins

4.1.3 Semi-joins

Technique 20 Implementing a semi-join

4.1.4 Picking the best join strategy for your data

4.2 Sorting

4.2.1 Secondary sort

Technique 21 Implementing a secondary sort

4.2.2 Total order sorting

Technique 22 Sorting keys across multiple reducers

4.3 Sampling

Technique 23 Reservoir sampling

4.4 Chapter summary

Chapter 5 Streamlining HDFS for big data

5.1 Working with small files

Technique 24 Using Avro to store multiple small files

5.2 Efficient storage with compression

Technique 25 Picking the right compression codec for your data

Technique 26 Compression with HDFS, MapReduce, Pig, and Hive

Technique 27 Splittable LZOP with MapReduce, Hive, and Pig

5.3 Chapter summary

Chapter 6 Diagnosing and tuning performance problems

6.1 Measuring MapReduce and your environment

6.1.1 Tools to extract job statistics

6.1.2 Monitoring

6.2 Determining the cause of your performance woes

6.2.1 Understanding what can impact MapReduce job performance

6.2.2 Map woes

Technique 28 Investigating spikes in input data

Technique 29 Identifying map-side data skew problems

Technique 30 Determining if map tasks have an overall low throughput

Technique 31 Small files

Technique 32 Unsplittable files

6.2.3 Reducer woes

Technique 33 Too few or too many reducers

Technique 34 Identifying reduce-side data skew problems

Technique 35 Determining if reduce tasks have an overall low throughput

Technique 36 Slow shuffle and sort

6.2.4 General task woes

Technique 37 Competing jobs and scheduler throttling

Technique 38 Using stack dumps to discover unoptimized user code

6.2.5 Hardware woes

Technique 39 Discovering hardware failures

Technique 40 CPU contention

Technique 41 Memory swapping

Technique 42 Disk health

Technique 43 Networking

6.3 Visualization

Technique 44 Extracting and visualizing task execution times

6.4 Tuning

6.4.1 Profiling MapReduce user code

Technique 45 Profiling your map and reduce tasks

6.4.2 Configuration

6.4.3 Optimizing the shuffle and sort phase

Technique 46 Avoid the reducer

Technique 47 Filter and project

Technique 48 Using the combiner

Technique 49 Blazingly fast sorting with comparators

6.4.4 Skew mitigation

Technique 50 Collecting skewed data

Technique 51 Reduce skew mitigation

6.4.5 Optimizing user space Java in MapReduce

6.4.6 Data serialization

6.5 Chapter summary

Part 4 Data science

Chapter 7 Utilizing data structures and algorithms

7.1 Modeling data and solving problems with graphs

7.1.1 Modeling graphs

7.1.2 Shortest path algorithm

Technique 52 Find the shortest distance between two users

7.1.3 Friends-of-friends

Technique 53 Calculating FoFs

7.1.4 PageRank

Technique 54 Calculate PageRank over a web graph

7.2 Bloom filters

Technique 55 Parallelized Bloom filter creation in MapReduce

Technique 56 MapReduce semi-join with Bloom filters

7.3 Chapter summary

Chapter 8 Integrating R and Hadoop for statistics and more

8.1 Comparing R and MapReduce integrations

8.2 R fundamentals

8.3 R and Streaming

8.3.1 Streaming and map-only R

Technique 57 Calculate the daily mean for stocks

8.3.2 Streaming, R, and full MapReduce

Technique 58 Calculate the cumulative moving average for stocks

8.4 Rhipe—Client-side R and Hadoop working together

Technique 59 Calculating the CMA using Rhipe

8.5 RHadoop—a simpler integration of client-side R and Hadoop

Technique 60 Calculating CMA with RHadoop

8.6 Chapter summary

Chapter 9 Predictive analytics with Mahout

9.1 Using recommenders to make product suggestions

9.1.1 Visualizing similarity metrics

9.1.2 The GroupLens dataset

9.1.3 User-based recommenders

9.1.4 Item-based recommenders

Technique 61 Item-based recommenders using movie ratings

9.2 Classification

9.2.1 Writing a homemade naïve Bayesian classifier

9.2.2 A scalable spam detection classification system

Technique 62 Using Mahout to train and test a spam classifier

9.2.3 Additional classification algorithms

9.3 Clustering with K-means

9.3.1 A gentle introduction

9.3.2 Parallel K-means

Technique 63 K-means with a synthetic 2D dataset

9.3.3 K-means and text

9.3.4 Other Mahout clustering algorithms

9.4 Chapter summary

Part 5 Taming the elephant

Chapter 10 Hacking with Hive

10.1 Hive fundamentals

10.1.1 Installation

10.1.2 Metastore

10.1.3 Databases, tables, partitions, and storage

10.1.4 Data model

10.1.5 Query language

10.1.6 Interactive and noninteractive Hive

10.2 Data analytics with Hive

10.2.1 Serialization and deserialization

Technique 64 Loading log files

10.2.2 UDFs, partitions, bucketing, and compression

Technique 65 Writing UDFs and compressed partitioned tables

10.2.3 Joining data together

Technique 66 Tuning Hive joins

10.2.4 Grouping, sorting, and explains

10.3 Chapter summary

Chapter 11 Programming pipelines with Pig

11.1 Pig fundamentals

11.1.1 Installation

11.1.2 Architecture

11.1.3 PigLatin

11.1.4 Data types

11.1.5 Operators and functions

11.1.6 Interactive and noninteractive Pig

11.2 Using Pig to find malicious actors in log data

11.2.1 Loading data

Technique 67 Schema-rich Apache log loading

11.2.2 Filtering and projection

Technique 68 Reducing your data with filters and projection

11.2.3 Grouping and aggregate UDFs

Technique 69 Grouping and counting IP addresses

11.2.4 Geolocation with UDFs

Technique 70 IP Geolocation using the distributed cache

11.2.5 Streaming

Technique 71 Combining Pig with your scripts

11.2.6 Joining

Technique 72 Combining data in Pig

11.2.7 Sorting

Technique 73 Sorting tuples

11.2.8 Storing data

Technique 74 Storing data in SequenceFiles

11.3 Optimizing user workflows with Pig

Technique 75 A four-step process to working rapidly with big data

11.4 Performance

Technique 76 Pig optimizations

11.5 Chapter summary

Chapter 12 Crunch and other technologies

12.1 What is Crunch?

12.1.1 Background and concepts

12.1.2 Fundamentals

12.1.3 Simple examples

12.2 Finding the most popular URLs in your logs

Technique 77 Crunch log parsing and basic analytics

12.3 Joins

Technique 78 Crunch’s repartition join

12.4 Cascading

12.5 Chapter summary

Chapter 13 Testing and debugging

13.1 Testing

13.1.1 Essential ingredients for effective unit testing

13.1.2 MRUnit

Technique 79 Unit Testing MapReduce functions, jobs, and pipelines

13.1.3 LocalJobRunner

Technique 80 Heavyweight job testing with the LocalJobRunner

13.1.4 Integration and QA testing

13.2 Debugging user space problems

13.2.1 Accessing task log output

Technique 81 Examining task logs

13.2.2 Debugging unexpected inputs

Technique 82 Pinpointing a problem Input Split

13.2.3 Debugging JVM settings

Technique 83 Figuring out the JVM startup arguments for a task

13.2.4 Coding guidelines for effective debugging

Technique 84 Debugging and error handling

13.3 MapReduce gotchas

Technique 85 MapReduce anti-patterns

13.4 Chapter summary

appendix A Related technologies

A.1 Hadoop 1.0.x and 0.20.x

A.1.1 Getting more information

A.1.2 Apache and CDH tarball installation

A.1.3 Hadoop UI ports

A.2 Flume

A.2.1 Getting more information

A.2.2 Installation on CDH

A.2.3 Installation on non-CDH

A.3 Oozie

A.3.1 Getting more information

A.3.2 Installation on CDH

A.3.3 Installation on non-CDH

A.4 Sqoop

A.4.1 Getting more information

A.4.2 Installation on CDH

A.4.3 Installation on Apache Hadoop

A.5 HBase

A.5.1 Getting more information

A.5.2 Installation on CDH

A.5.3 Installation on non-CDH

A.6 Avro

A.6.1 Getting more information

A.6.2 Installation

A.7 Protocol Buffers

A.7.1 Getting more information

A.7.2 Building Protocol Buffers

A.8 Apache Thrift

A.8.1 Getting more information

A.8.2 Building Thrift 0.5

A.9 Snappy

A.9.1 Getting more information

A.9.2 Install Hadoop native libraries on CDH

A.9.3 Building Snappy for non-CDH

A.10 LZOP

A.10.1 Getting more information

A.10.2 Building LZOP

A.11 Elephant Bird

A.11.1 Getting more information

A.11.2 Installation

A.12 Hoop

A.12.1 Getting more information

A.12.2 Installation

A.13 MySQL

A.13.1 MySQL JDBC drivers

A.13.2 MySQL server installation

A.14 Hive

A.14.1 Getting more information

A.14.2 Installation on CDH

A.14.3 Installation on non-CDH

A.14.4 Configuring MySQL for metastore storage

A.14.5 Hive warehouse directory permissions

A.14.6 Testing your Hive installation

A.15 Pig

A.15.1 Getting more information

A.15.2 Installation on CDH

A.15.3 Installation on non-CDH

A.15.4 Building PiggyBank

A.15.5 Testing your Pig installation

A.16 Crunch

A.16.1 Getting more information

A.16.2 Installation

A.17 R

A.17.1 Getting more information

A.17.2 Installation on RedHat-based systems

A.17.3 Installation on non-RedHat systems

A.18 RHIPE

A.18.1 Getting More Information

A.18.2 Dependencies

A.18.3 Installation on CentOS

A.19 RHadoop

A.19.1 Getting more information

A.19.2 Dependencies

A.19.3 rmr/rhdfs installation

A.20 Mahout

A.20.1 Getting more information

A.20.2 Mahout installation

appendix B Hadoop built-in ingress and egress tools

B.1 Command line

B.2 Java API

B.3 Python/Perl/Ruby with Thrift

B.4 Hadoop FUSE

B.5 NameNode embedded HTTP

B.6 HDFS proxy

B.7 Hoop

B.8 WebHDFS

B.9 Distributed copy

B.10 WebDAV

B.11 MapReduce

appendix C HDFS dissected

C.1 What is HDFS?

C.2 How HDFS writes files

C.3 How HDFS reads files

appendix D Optimized MapReduce join frameworks

D.1 An optimized repartition join framework

D.2 A replicated join framework

index

Symbols

Numerics

IN PRACTICE Alex Holmes M A N N I N G

Hadoop in Practice

Hadoop in Practice ALEX HOLMES M A N N I N G SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2012 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Development editor: Cynthia Kane Copyeditors: Bob Herbtsman, Tara Walsh Proofreader: Katie Tennant Typesetter: Gordan Salinovic Illustrator: Martin Murtonen Cover designer: Marija Tudor ISBN 9781617290237 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 17 16 15 14 13 12

To Michal, Marie, Oliver, Ollie, Mish, and Anch

brief contents PART 1 BACKGROUND AND FUNDAMENTALS . .............................1 1 ■ Hadoop in a heartbeat 3 PART 2 DATA LOGISTICS..........................................................25 2 ■ Moving data in and out of Hadoop 27 3 ■ Data serialization—working with text and beyond 83 PART 3 BIG DATA PATTERNS ..................................................137 4 ■ Applying MapReduce patterns to big data 139 5 ■ Streamlining HDFS for big data 169 6 ■ Diagnosing and tuning performance problems 194 PART 4 DATA SCIENCE...........................................................251 7 ■ Utilizing data structures and algorithms 253 8 ■ 9 ■ Predictive analytics with Mahout 305 Integrating R and Hadoop for statistics and more 285 vii

分享到：

赞收藏

资料库

Hadoop in Practice.pdf

相关推荐

开发技术

热门标签

最新资料