Hadoop in Practice
brief contents
contents
preface
acknowledgments
about this book
Roadmap
Code conventions and downloads
Third-party libraries
Datasets
Getting help
Author Online
About the author
About the cover illustration
Part 1 Background and fundamentals
Chapter 1 Hadoop in a heartbeat
1.1 What is Hadoop?
1.1.1 Core Hadoop components
1.1.2 The Hadoop ecosystem
1.1.3 Physical architecture
1.1.4 Who’s using Hadoop?
1.1.5 Hadoop limitations
1.2 Running Hadoop
1.2.1 Downloading and installing Hadoop
1.2.2 Hadoop configuration
1.2.3 Basic CLI commands
1.2.4 Running a MapReduce job
1.3 Chapter summary
Part 2 Data logistics
Chapter 2 Moving data in and out of Hadoop
2.1 Key elements of ingress and egress
2.2 Moving data into Hadoop
2.2.1 Pushing log files into Hadoop
Technique 1 Pushing system log messages into HDFS with Flume
2.2.2 Pushing and pulling semistructured and binary files
Technique 2 An automated mechanism to copy files into HDFS
Technique 3 Scheduling regular ingress activities with Oozie
2.2.3 Pulling data from databases
Technique 4 Database ingress with MapReduce
Technique 5 Using Sqoop to import data from MySQL
2.2.4 HBase
Technique 6 HBase ingress into HDFS
Technique 7 MapReduce with HBase as a data source
2.3 Moving data out of Hadoop
2.3.1 Egress to a local filesystem
Technique 8 Automated file copying from HDFS
2.3.2 Databases
Technique 9 Using Sqoop to export data to MySQL
2.3.3 HBase
Technique 10 HDFS egress to HBase
Technique 11 Using HBase as a data sink in MapReduce
2.4 Chapter summary
Chapter 3 Data serialization— working with text and beyond
3.1 Understanding inputs and outputs in MapReduce
3.1.1 Data input
3.1.2 Data output
3.2 Processing common serialization formats
3.2.1 XML
Technique 12 MapReduce and XML
3.2.2 JSON
Technique 13 MapReduce and JSON
3.3 Big data serialization formats
3.3.1 Comparing SequenceFiles, Protocol Buffers, Thrift, and Avro
3.3.2 SequenceFiles
Technique 14 Working with SequenceFiles
3.3.3 Protocol Buffers
Technique 15 Integrating Protocol Buffers with MapReduce
3.3.4 Thrift
Technique 16 Working with Thrift
3.3.5 Avro
Technique 17 Next-generation data serialization with MapReduce
3.4 Custom file formats
3.4.1 Input and output formats
Technique 18 Writing input and output formats for CSV
3.4.2 The importance of output committing
3.5 Chapter summary
Part 3 Big data patterns
Chapter 4 Applying MapReduce patterns to big data
4.1 Joining
4.1.1 Repartition join
Technique 19 Optimized repartition joins
4.1.2 Replicated joins
4.1.3 Semi-joins
Technique 20 Implementing a semi-join
4.1.4 Picking the best join strategy for your data
4.2 Sorting
4.2.1 Secondary sort
Technique 21 Implementing a secondary sort
4.2.2 Total order sorting
Technique 22 Sorting keys across multiple reducers
4.3 Sampling
Technique 23 Reservoir sampling
4.4 Chapter summary
Chapter 5 Streamlining HDFS for big data
5.1 Working with small files
Technique 24 Using Avro to store multiple small files
5.2 Efficient storage with compression
Technique 25 Picking the right compression codec for your data
Technique 26 Compression with HDFS, MapReduce, Pig, and Hive
Technique 27 Splittable LZOP with MapReduce, Hive, and Pig
5.3 Chapter summary
Chapter 6 Diagnosing and tuning performance problems
6.1 Measuring MapReduce and your environment
6.1.1 Tools to extract job statistics
6.1.2 Monitoring
6.2 Determining the cause of your performance woes
6.2.1 Understanding what can impact MapReduce job performance
6.2.2 Map woes
Technique 28 Investigating spikes in input data
Technique 29 Identifying map-side data skew problems
Technique 30 Determining if map tasks have an overall low throughput
Technique 31 Small files
Technique 32 Unsplittable files
6.2.3 Reducer woes
Technique 33 Too few or too many reducers
Technique 34 Identifying reduce-side data skew problems
Technique 35 Determining if reduce tasks have an overall low throughput
Technique 36 Slow shuffle and sort
6.2.4 General task woes
Technique 37 Competing jobs and scheduler throttling
Technique 38 Using stack dumps to discover unoptimized user code
6.2.5 Hardware woes
Technique 39 Discovering hardware failures
Technique 40 CPU contention
Technique 41 Memory swapping
Technique 42 Disk health
Technique 43 Networking
6.3 Visualization
Technique 44 Extracting and visualizing task execution times
6.4 Tuning
6.4.1 Profiling MapReduce user code
Technique 45 Profiling your map and reduce tasks
6.4.2 Configuration
6.4.3 Optimizing the shuffle and sort phase
Technique 46 Avoid the reducer
Technique 47 Filter and project
Technique 48 Using the combiner
Technique 49 Blazingly fast sorting with comparators
6.4.4 Skew mitigation
Technique 50 Collecting skewed data
Technique 51 Reduce skew mitigation
6.4.5 Optimizing user space Java in MapReduce
6.4.6 Data serialization
6.5 Chapter summary
Part 4 Data science
Chapter 7 Utilizing data structures and algorithms
7.1 Modeling data and solving problems with graphs
7.1.1 Modeling graphs
7.1.2 Shortest path algorithm
Technique 52 Find the shortest distance between two users
7.1.3 Friends-of-friends
Technique 53 Calculating FoFs
7.1.4 PageRank
Technique 54 Calculate PageRank over a web graph
7.2 Bloom filters
Technique 55 Parallelized Bloom filter creation in MapReduce
Technique 56 MapReduce semi-join with Bloom filters
7.3 Chapter summary
Chapter 8 Integrating R and Hadoop for statistics and more
8.1 Comparing R and MapReduce integrations
8.2 R fundamentals
8.3 R and Streaming
8.3.1 Streaming and map-only R
Technique 57 Calculate the daily mean for stocks
8.3.2 Streaming, R, and full MapReduce
Technique 58 Calculate the cumulative moving average for stocks
8.4 Rhipe—Client-side R and Hadoop working together
Technique 59 Calculating the CMA using Rhipe
8.5 RHadoop—a simpler integration of client-side R and Hadoop
Technique 60 Calculating CMA with RHadoop
8.6 Chapter summary
Chapter 9 Predictive analytics with Mahout
9.1 Using recommenders to make product suggestions
9.1.1 Visualizing similarity metrics
9.1.2 The GroupLens dataset
9.1.3 User-based recommenders
9.1.4 Item-based recommenders
Technique 61 Item-based recommenders using movie ratings
9.2 Classification
9.2.1 Writing a homemade naïve Bayesian classifier
9.2.2 A scalable spam detection classification system
Technique 62 Using Mahout to train and test a spam classifier
9.2.3 Additional classification algorithms
9.3 Clustering with K-means
9.3.1 A gentle introduction
9.3.2 Parallel K-means
Technique 63 K-means with a synthetic 2D dataset
9.3.3 K-means and text
9.3.4 Other Mahout clustering algorithms
9.4 Chapter summary
Part 5 Taming the elephant
Chapter 10 Hacking with Hive
10.1 Hive fundamentals
10.1.1 Installation
10.1.2 Metastore
10.1.3 Databases, tables, partitions, and storage
10.1.4 Data model
10.1.5 Query language
10.1.6 Interactive and noninteractive Hive
10.2 Data analytics with Hive
10.2.1 Serialization and deserialization
Technique 64 Loading log files
10.2.2 UDFs, partitions, bucketing, and compression
Technique 65 Writing UDFs and compressed partitioned tables
10.2.3 Joining data together
Technique 66 Tuning Hive joins
10.2.4 Grouping, sorting, and explains
10.3 Chapter summary
Chapter 11 Programming pipelines with Pig
11.1 Pig fundamentals
11.1.1 Installation
11.1.2 Architecture
11.1.3 PigLatin
11.1.4 Data types
11.1.5 Operators and functions
11.1.6 Interactive and noninteractive Pig
11.2 Using Pig to find malicious actors in log data
11.2.1 Loading data
Technique 67 Schema-rich Apache log loading
11.2.2 Filtering and projection
Technique 68 Reducing your data with filters and projection
11.2.3 Grouping and aggregate UDFs
Technique 69 Grouping and counting IP addresses
11.2.4 Geolocation with UDFs
Technique 70 IP Geolocation using the distributed cache
11.2.5 Streaming
Technique 71 Combining Pig with your scripts
11.2.6 Joining
Technique 72 Combining data in Pig
11.2.7 Sorting
Technique 73 Sorting tuples
11.2.8 Storing data
Technique 74 Storing data in SequenceFiles
11.3 Optimizing user workflows with Pig
Technique 75 A four-step process to working rapidly with big data
11.4 Performance
Technique 76 Pig optimizations
11.5 Chapter summary
Chapter 12 Crunch and other technologies
12.1 What is Crunch?
12.1.1 Background and concepts
12.1.2 Fundamentals
12.1.3 Simple examples
12.2 Finding the most popular URLs in your logs
Technique 77 Crunch log parsing and basic analytics
12.3 Joins
Technique 78 Crunch’s repartition join
12.4 Cascading
12.5 Chapter summary
Chapter 13 Testing and debugging
13.1 Testing
13.1.1 Essential ingredients for effective unit testing
13.1.2 MRUnit
Technique 79 Unit Testing MapReduce functions, jobs, and pipelines
13.1.3 LocalJobRunner
Technique 80 Heavyweight job testing with the LocalJobRunner
13.1.4 Integration and QA testing
13.2 Debugging user space problems
13.2.1 Accessing task log output
Technique 81 Examining task logs
13.2.2 Debugging unexpected inputs
Technique 82 Pinpointing a problem Input Split
13.2.3 Debugging JVM settings
Technique 83 Figuring out the JVM startup arguments for a task
13.2.4 Coding guidelines for effective debugging
Technique 84 Debugging and error handling
13.3 MapReduce gotchas
Technique 85 MapReduce anti-patterns
13.4 Chapter summary
appendix A Related technologies
A.1 Hadoop 1.0.x and 0.20.x
A.1.1 Getting more information
A.1.2 Apache and CDH tarball installation
A.1.3 Hadoop UI ports
A.2 Flume
A.2.1 Getting more information
A.2.2 Installation on CDH
A.2.3 Installation on non-CDH
A.3 Oozie
A.3.1 Getting more information
A.3.2 Installation on CDH
A.3.3 Installation on non-CDH
A.4 Sqoop
A.4.1 Getting more information
A.4.2 Installation on CDH
A.4.3 Installation on Apache Hadoop
A.5 HBase
A.5.1 Getting more information
A.5.2 Installation on CDH
A.5.3 Installation on non-CDH
A.6 Avro
A.6.1 Getting more information
A.6.2 Installation
A.7 Protocol Buffers
A.7.1 Getting more information
A.7.2 Building Protocol Buffers
A.8 Apache Thrift
A.8.1 Getting more information
A.8.2 Building Thrift 0.5
A.9 Snappy
A.9.1 Getting more information
A.9.2 Install Hadoop native libraries on CDH
A.9.3 Building Snappy for non-CDH
A.10 LZOP
A.10.1 Getting more information
A.10.2 Building LZOP
A.11 Elephant Bird
A.11.1 Getting more information
A.11.2 Installation
A.12 Hoop
A.12.1 Getting more information
A.12.2 Installation
A.13 MySQL
A.13.1 MySQL JDBC drivers
A.13.2 MySQL server installation
A.14 Hive
A.14.1 Getting more information
A.14.2 Installation on CDH
A.14.3 Installation on non-CDH
A.14.4 Configuring MySQL for metastore storage
A.14.5 Hive warehouse directory permissions
A.14.6 Testing your Hive installation
A.15 Pig
A.15.1 Getting more information
A.15.2 Installation on CDH
A.15.3 Installation on non-CDH
A.15.4 Building PiggyBank
A.15.5 Testing your Pig installation
A.16 Crunch
A.16.1 Getting more information
A.16.2 Installation
A.17 R
A.17.1 Getting more information
A.17.2 Installation on RedHat-based systems
A.17.3 Installation on non-RedHat systems
A.18 RHIPE
A.18.1 Getting More Information
A.18.2 Dependencies
A.18.3 Installation on CentOS
A.19 RHadoop
A.19.1 Getting more information
A.19.2 Dependencies
A.19.3 rmr/rhdfs installation
A.20 Mahout
A.20.1 Getting more information
A.20.2 Mahout installation
appendix B Hadoop built-in ingress and egress tools
B.1 Command line
B.2 Java API
B.3 Python/Perl/Ruby with Thrift
B.4 Hadoop FUSE
B.5 NameNode embedded HTTP
B.6 HDFS proxy
B.7 Hoop
B.8 WebHDFS
B.9 Distributed copy
B.10 WebDAV
B.11 MapReduce
appendix C HDFS dissected
C.1 What is HDFS?
C.2 How HDFS writes files
C.3 How HDFS reads files
appendix D Optimized MapReduce join frameworks
D.1 An optimized repartition join framework
D.2 A replicated join framework
index
Symbols
Numerics
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X