logo资料库

Hadoop- The Definitive Guide, 4th Edition.pdf

第1页 / 共805页
第2页 / 共805页
第3页 / 共805页
第4页 / 共805页
第5页 / 共805页
第6页 / 共805页
第7页 / 共805页
第8页 / 共805页
资料共805页,剩余部分请下载后查看
Hadoop: The Definitive Guide
Dedication
Foreword
Preface
Administrative Notes
What’s New in the Fourth Edition?
What’s New in the Third Edition?
What’s New in the Second Edition?
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
I. Hadoop Fundamentals
1. Meet Hadoop
Data!
Data Storage and Analysis
Querying All Your Data
Beyond Batch
Comparison with Other Systems
Relational Database Management Systems
Grid Computing
Volunteer Computing
A Brief History of Apache Hadoop
What’s in This Book?
2. MapReduce
A Weather Dataset
Data Format
Analyzing the Data with Unix Tools
Analyzing the Data with Hadoop
Map and Reduce
Java MapReduce
A test run
Scaling Out
Data Flow
Combiner Functions
Specifying a combiner function
Running a Distributed MapReduce Job
Hadoop Streaming
Ruby
Python
3. The Hadoop Distributed Filesystem
The Design of HDFS
HDFS Concepts
Blocks
Namenodes and Datanodes
Block Caching
HDFS Federation
HDFS High Availability
Failover and fencing
The Command-Line Interface
Basic Filesystem Operations
Hadoop Filesystems
Interfaces
HTTP
C
NFS
FUSE
The Java Interface
Reading Data from a Hadoop URL
Reading Data Using the FileSystem API
FSDataInputStream
Writing Data
FSDataOutputStream
Directories
Querying the Filesystem
File metadata: FileStatus
Listing files
File patterns
PathFilter
Deleting Data
Data Flow
Anatomy of a File Read
Anatomy of a File Write
Coherency Model
Consequences for application design
Parallel Copying with distcp
Keeping an HDFS Cluster Balanced
4. YARN
Anatomy of a YARN Application Run
Resource Requests
Application Lifespan
Building YARN Applications
YARN Compared to MapReduce 1
Scheduling in YARN
Scheduler Options
Capacity Scheduler Configuration
Queue placement
Fair Scheduler Configuration
Enabling the Fair Scheduler
Queue configuration
Queue placement
Preemption
Delay Scheduling
Dominant Resource Fairness
Further Reading
5. Hadoop I/O
Data Integrity
Data Integrity in HDFS
LocalFileSystem
ChecksumFileSystem
Compression
Codecs
Compressing and decompressing streams with CompressionCodec
Inferring CompressionCodecs using CompressionCodecFactory
Native libraries
CodecPool
Compression and Input Splits
Using Compression in MapReduce
Compressing map output
Serialization
The Writable Interface
WritableComparable and comparators
Writable Classes
Writable wrappers for Java primitives
Text
Indexing
Unicode
Iteration
Mutability
Resorting to String
BytesWritable
NullWritable
ObjectWritable and GenericWritable
Writable collections
Implementing a Custom Writable
Implementing a RawComparator for speed
Custom comparators
Serialization Frameworks
Serialization IDL
File-Based Data Structures
SequenceFile
Writing a SequenceFile
Reading a SequenceFile
Displaying a SequenceFile with the command-line interface
Sorting and merging SequenceFiles
The SequenceFile format
MapFile
MapFile variants
Other File Formats and Column-Oriented Formats
II. MapReduce
6. Developing a MapReduce Application
The Configuration API
Combining Resources
Variable Expansion
Setting Up the Development Environment
Managing Configuration
GenericOptionsParser, Tool, and ToolRunner
Writing a Unit Test with MRUnit
Mapper
Reducer
Running Locally on Test Data
Running a Job in a Local Job Runner
Testing the Driver
Running on a Cluster
Packaging a Job
The client classpath
The task classpath
Packaging dependencies
Task classpath precedence
Launching a Job
The MapReduce Web UI
The resource manager page
The MapReduce job page
Retrieving the Results
Debugging a Job
The tasks and task attempts pages
Handling malformed data
Hadoop Logs
Remote Debugging
Tuning a Job
Profiling Tasks
The HPROF profiler
MapReduce Workflows
Decomposing a Problem into MapReduce Jobs
JobControl
Apache Oozie
Defining an Oozie workflow
Packaging and deploying an Oozie workflow application
Running an Oozie workflow job
7. How MapReduce Works
Anatomy of a MapReduce Job Run
Job Submission
Job Initialization
Task Assignment
Task Execution
Streaming
Progress and Status Updates
Job Completion
Failures
Task Failure
Application Master Failure
Node Manager Failure
Resource Manager Failure
Shuffle and Sort
The Map Side
The Reduce Side
Configuration Tuning
Task Execution
The Task Execution Environment
Streaming environment variables
Speculative Execution
Output Committers
Task side-effect files
8. MapReduce Types and Formats
MapReduce Types
The Default MapReduce Job
The default Streaming job
Keys and values in Streaming
Input Formats
Input Splits and Records
FileInputFormat
FileInputFormat input paths
FileInputFormat input splits
Small files and CombineFileInputFormat
Preventing splitting
File information in the mapper
Processing a whole file as a record
Text Input
TextInputFormat
Controlling the maximum line length
KeyValueTextInputFormat
NLineInputFormat
XML
Binary Input
SequenceFileInputFormat
SequenceFileAsTextInputFormat
SequenceFileAsBinaryInputFormat
FixedLengthInputFormat
Multiple Inputs
Database Input (and Output)
Output Formats
Text Output
Binary Output
SequenceFileOutputFormat
SequenceFileAsBinaryOutputFormat
MapFileOutputFormat
Multiple Outputs
An example: Partitioning data
MultipleOutputs
Lazy Output
Database Output
9. MapReduce Features
Counters
Built-in Counters
Task counters
Job counters
User-Defined Java Counters
Dynamic counters
Retrieving counters
User-Defined Streaming Counters
Sorting
Preparation
Partial Sort
Total Sort
Secondary Sort
Java code
Streaming
Joins
Map-Side Joins
Reduce-Side Joins
Side Data Distribution
Using the Job Configuration
Distributed Cache
Usage
How it works
The distributed cache API
MapReduce Library Classes
III. Hadoop Operations
10. Setting Up a Hadoop Cluster
Cluster Specification
Cluster Sizing
Master node scenarios
Network Topology
Rack awareness
Cluster Setup and Installation
Installing Java
Creating Unix User Accounts
Installing Hadoop
Configuring SSH
Configuring Hadoop
Formatting the HDFS Filesystem
Starting and Stopping the Daemons
Creating User Directories
Hadoop Configuration
Configuration Management
Environment Settings
Java
Memory heap size
System logfiles
SSH settings
Important Hadoop Daemon Properties
HDFS
YARN
Memory settings in YARN and MapReduce
CPU settings in YARN and MapReduce
Hadoop Daemon Addresses and Ports
Other Hadoop Properties
Cluster membership
Buffer size
HDFS block size
Reserved storage space
Trash
Job scheduler
Reduce slow start
Short-circuit local reads
Security
Kerberos and Hadoop
An example
Delegation Tokens
Other Security Enhancements
Benchmarking a Hadoop Cluster
Hadoop Benchmarks
Benchmarking MapReduce with TeraSort
Other benchmarks
User Jobs
11. Administering Hadoop
HDFS
Persistent Data Structures
Namenode directory structure
The filesystem image and edit log
Secondary namenode directory structure
Datanode directory structure
Safe Mode
Entering and leaving safe mode
Audit Logging
Tools
dfsadmin
Filesystem check (fsck)
Finding the blocks for a file
Datanode block scanner
Balancer
Monitoring
Logging
Setting log levels
Getting stack traces
Metrics and JMX
Maintenance
Routine Administration Procedures
Metadata backups
Data backups
Filesystem check (fsck)
Filesystem balancer
Commissioning and Decommissioning Nodes
Commissioning new nodes
Decommissioning old nodes
Upgrades
HDFS data and metadata upgrades
Start the upgrade
Wait until the upgrade is complete
Check the upgrade
Roll back the upgrade (optional)
Finalize the upgrade (optional)
IV. Related Projects
12. Avro
Avro Data Types and Schemas
In-Memory Serialization and Deserialization
The Specific API
Avro Datafiles
Interoperability
Python API
Avro Tools
Schema Resolution
Sort Order
Avro MapReduce
Sorting Using Avro MapReduce
Avro in Other Languages
13. Parquet
Data Model
Nested Encoding
Parquet File Format
Parquet Configuration
Writing and Reading Parquet Files
Avro, Protocol Buffers, and Thrift
Projection and read schemas
Parquet MapReduce
14. Flume
Installing Flume
An Example
Transactions and Reliability
Batching
The HDFS Sink
Partitioning and Interceptors
File Formats
Fan Out
Delivery Guarantees
Replicating and Multiplexing Selectors
Distribution: Agent Tiers
Delivery Guarantees
Sink Groups
Integrating Flume with Applications
Component Catalog
Further Reading
15. Sqoop
Getting Sqoop
Sqoop Connectors
A Sample Import
Text and Binary File Formats
Generated Code
Additional Serialization Systems
Imports: A Deeper Look
Controlling the Import
Imports and Consistency
Incremental Imports
Direct-Mode Imports
Working with Imported Data
Imported Data and Hive
Importing Large Objects
Performing an Export
Exports: A Deeper Look
Exports and Transactionality
Exports and SequenceFiles
Further Reading
16. Pig
Installing and Running Pig
Execution Types
Local mode
MapReduce mode
Running Pig Programs
Grunt
Pig Latin Editors
An Example
Generating Examples
Comparison with Databases
Pig Latin
Structure
Statements
Expressions
Types
Schemas
Using Hive tables with HCatalog
Validation and nulls
Schema merging
Functions
Other libraries
Macros
User-Defined Functions
A Filter UDF
Leveraging types
An Eval UDF
Dynamic invokers
A Load UDF
Using a schema
Data Processing Operators
Loading and Storing Data
Filtering Data
FOREACH...GENERATE
STREAM
Grouping and Joining Data
JOIN
COGROUP
CROSS
GROUP
Sorting Data
Combining and Splitting Data
Pig in Practice
Parallelism
Anonymous Relations
Parameter Substitution
Dynamic parameters
Parameter substitution processing
Further Reading
17. Hive
Installing Hive
The Hive Shell
An Example
Running Hive
Configuring Hive
Execution engines
Logging
Hive Services
Hive clients
The Metastore
Comparison with Traditional Databases
Schema on Read Versus Schema on Write
Updates, Transactions, and Indexes
SQL-on-Hadoop Alternatives
HiveQL
Data Types
Primitive types
Complex types
Operators and Functions
Conversions
Tables
Managed Tables and External Tables
Partitions and Buckets
Partitions
Buckets
Storage Formats
The default storage format: Delimited text
Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFiles
Using a custom SerDe: RegexSerDe
Storage handlers
Importing Data
Inserts
Multitable insert
CREATE TABLE...AS SELECT
Altering Tables
Dropping Tables
Querying Data
Sorting and Aggregating
MapReduce Scripts
Joins
Inner joins
Outer joins
Semi joins
Map joins
Subqueries
Views
User-Defined Functions
Writing a UDF
Writing a UDAF
A more complex UDAF
Further Reading
18. Crunch
An Example
The Core Crunch API
Primitive Operations
union()
parallelDo()
groupByKey()
combineValues()
Types
Records and tuples
Sources and Targets
Reading from a source
Writing to a target
Existing outputs
Combined sources and targets
Functions
Serialization of functions
Object reuse
Materialization
PObject
Pipeline Execution
Running a Pipeline
Asynchronous execution
Debugging
Stopping a Pipeline
Inspecting a Crunch Plan
Iterative Algorithms
Checkpointing a Pipeline
Crunch Libraries
Further Reading
19. Spark
Installing Spark
An Example
Spark Applications, Jobs, Stages, and Tasks
A Scala Standalone Application
A Java Example
A Python Example
Resilient Distributed Datasets
Creation
Transformations and Actions
Aggregation transformations
Persistence
Persistence levels
Serialization
Data
Functions
Shared Variables
Broadcast Variables
Accumulators
Anatomy of a Spark Job Run
Job Submission
DAG Construction
Task Scheduling
Task Execution
Executors and Cluster Managers
Spark on YARN
YARN client mode
YARN cluster mode
Further Reading
20. HBase
HBasics
Backdrop
Concepts
Whirlwind Tour of the Data Model
Regions
Locking
Implementation
HBase in operation
Installation
Test Drive
Clients
Java
MapReduce
REST and Thrift
Building an Online Query Application
Schema Design
Loading Data
Load distribution
Bulk load
Online Queries
Station queries
Observation queries
HBase Versus RDBMS
Successful Service
HBase
Praxis
HDFS
UI
Metrics
Counters
Further Reading
21. ZooKeeper
Installing and Running ZooKeeper
An Example
Group Membership in ZooKeeper
Creating the Group
Joining a Group
Listing Members in a Group
ZooKeeper command-line tools
Deleting a Group
The ZooKeeper Service
Data Model
Ephemeral znodes
Sequence numbers
Watches
Operations
Multiupdate
APIs
Watch triggers
ACLs
Implementation
Consistency
Sessions
Time
States
Building Applications with ZooKeeper
A Configuration Service
The Resilient ZooKeeper Application
InterruptedException
KeeperException
State exceptions
Recoverable exceptions
Unrecoverable exceptions
A reliable configuration service
A Lock Service
The herd effect
Recoverable exceptions
Unrecoverable exceptions
Implementation
More Distributed Data Structures and Protocols
BookKeeper and Hedwig
ZooKeeper in Production
Resilience and Performance
Configuration
Further Reading
V. Case Studies
22. Composable Data at Cerner
From CPUs to Semantic Integration
Enter Apache Crunch
Building a Complete Picture
Integrating Healthcare Data
Composability over Frameworks
Moving Forward
23. Biological Data Science: Saving Lives with Software
The Structure of DNA
The Genetic Code: Turning DNA Letters into Proteins
Thinking of DNA as Source Code
The Human Genome Project and Reference Genomes
Sequencing and Aligning DNA
ADAM, A Scalable Genome Analysis Platform
Literate programming with the Avro interface description language (IDL)
Column-oriented access with Parquet
A simple example: k-mer counting using Spark and ADAM
From Personalized Ads to Personalized Medicine
Join In
24. Cascading
Fields, Tuples, and Pipes
Operations
Taps, Schemes, and Flows
Cascading in Practice
Flexibility
Hadoop and Cascading at ShareThis
Summary
A. Installing Apache Hadoop
Prerequisites
Installation
Configuration
Standalone Mode
Pseudodistributed Mode
Configuring SSH
Formatting the HDFS filesystem
Starting and stopping the daemons
Creating a user directory
Fully Distributed Mode
B. Cloudera’s Distribution Including Apache Hadoop
C. Preparing the NCDC Weather Data
D. The Old and New Java MapReduce APIs
Index
Colophon
Copyright
www.it-ebooks.info
Hadoop: The Definitive Guide Tom White www.it-ebooks.info
For Eliane, Emilia, and Lottie www.it-ebooks.info
www.it-ebooks.info
Foreword Doug Cutting, April 2009 Shed in the Yard, California Hadoop got its start in Nutch. A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers. Once Google published its GFS and MapReduce papers, the route became clear. They’d devised systems to solve precisely the problems we were having with Nutch. So we started, two of us, half-time, to try to re-create these systems as a part of Nutch. We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web’s massive scale, we’d need to run it on thousands of machines, and moreover, that the job was bigger than two half-time developers could handle. Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web. In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article he’d written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose. From the beginning, Tom’s contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use. Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the MapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee. Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master — not only of the technology, but also of common sense and plain talk. www.it-ebooks.info
www.it-ebooks.info
Preface Martin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.[1] In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien. But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction — to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it. With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book. The Apache Hadoop community has come a long way. Since the publication of the first edition of this book, the Hadoop project has blossomed. “Big data” has become a household term.[2] In this time, the software has made great leaps in adoption, performance, reliability, scalability, and manageability. The number of things being built and run on the Hadoop platform has grown enormously. In fact, it’s difficult for one person to keep track. To gain even wider adoption, I believe we need to make Hadoop even easier to use. This will involve writing more tools; integrating with even more systems; and writing new, improved APIs. I’m looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too. www.it-ebooks.info
分享到:
收藏