Hadoop- The Definitive Guide, 4th Edition.pdf

发布时间：2022-06-15 发布人：admin 分类：说明书资料大小：11.77M 资料格式：pdf 举报版权申诉

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第1页.png

第1页 / 共805页

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第2页.png

第2页 / 共805页

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第3页.png

第3页 / 共805页

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第4页.png

第4页 / 共805页

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第5页.png

第5页 / 共805页

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第6页.png

第6页 / 共805页

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第7页.png

第7页 / 共805页

3c448cd7-2bfd-4439-a71a-c7e7b4e21165.pdf-第8页.png

第8页 / 共805页

Hadoop: The Definitive Guide

Dedication

Foreword

Preface

Administrative Notes

What’s New in the Fourth Edition?

What’s New in the Third Edition?

What’s New in the Second Edition?

Conventions Used in This Book

Using Code Examples

Safari® Books Online

How to Contact Us

Acknowledgments

I. Hadoop Fundamentals

1. Meet Hadoop

Data!

Data Storage and Analysis

Querying All Your Data

Beyond Batch

Comparison with Other Systems

Relational Database Management Systems

Grid Computing

Volunteer Computing

A Brief History of Apache Hadoop

What’s in This Book?

2. MapReduce

A Weather Dataset

Data Format

Analyzing the Data with Unix Tools

Analyzing the Data with Hadoop

Map and Reduce

Java MapReduce

A test run

Scaling Out

Data Flow

Combiner Functions

Specifying a combiner function

Running a Distributed MapReduce Job

Hadoop Streaming

Ruby

Python

3. The Hadoop Distributed Filesystem

The Design of HDFS

HDFS Concepts

Blocks

Namenodes and Datanodes

Block Caching

HDFS Federation

HDFS High Availability

Failover and fencing

The Command-Line Interface

Basic Filesystem Operations

Hadoop Filesystems

Interfaces

HTTP

NFS

FUSE

The Java Interface

Reading Data from a Hadoop URL

Reading Data Using the FileSystem API

FSDataInputStream

Writing Data

FSDataOutputStream

Directories

Querying the Filesystem

File metadata: FileStatus

Listing files

File patterns

PathFilter

Deleting Data

Data Flow

Anatomy of a File Read

Anatomy of a File Write

Coherency Model

Consequences for application design

Parallel Copying with distcp

Keeping an HDFS Cluster Balanced

4. YARN

Anatomy of a YARN Application Run

Resource Requests

Application Lifespan

Building YARN Applications

YARN Compared to MapReduce 1

Scheduling in YARN

Scheduler Options

Capacity Scheduler Configuration

Queue placement

Fair Scheduler Configuration

Enabling the Fair Scheduler

Queue configuration

Queue placement

Preemption

Delay Scheduling

Dominant Resource Fairness

Further Reading

5. Hadoop I/O

Data Integrity

Data Integrity in HDFS

LocalFileSystem

ChecksumFileSystem

Compression

Codecs

Compressing and decompressing streams with CompressionCodec

Inferring CompressionCodecs using CompressionCodecFactory

Native libraries

CodecPool

Compression and Input Splits

Using Compression in MapReduce

Compressing map output

Serialization

The Writable Interface

WritableComparable and comparators

Writable Classes

Writable wrappers for Java primitives

Text

Indexing

Unicode

Iteration

Mutability

Resorting to String

BytesWritable

NullWritable

ObjectWritable and GenericWritable

Writable collections

Implementing a Custom Writable

Implementing a RawComparator for speed

Custom comparators

Serialization Frameworks

Serialization IDL

File-Based Data Structures

SequenceFile

Writing a SequenceFile

Reading a SequenceFile

Displaying a SequenceFile with the command-line interface

Sorting and merging SequenceFiles

The SequenceFile format

MapFile

MapFile variants

Other File Formats and Column-Oriented Formats

II. MapReduce

6. Developing a MapReduce Application

The Configuration API

Combining Resources

Variable Expansion

Setting Up the Development Environment

Managing Configuration

GenericOptionsParser, Tool, and ToolRunner

Writing a Unit Test with MRUnit

Mapper

Reducer

Running Locally on Test Data

Running a Job in a Local Job Runner

Testing the Driver

Running on a Cluster

Packaging a Job

The client classpath

The task classpath

Packaging dependencies

Task classpath precedence

Launching a Job

The MapReduce Web UI

The resource manager page

The MapReduce job page

Retrieving the Results

Debugging a Job

The tasks and task attempts pages

Handling malformed data

Hadoop Logs

Remote Debugging

Tuning a Job

Profiling Tasks

The HPROF profiler

MapReduce Workflows

Decomposing a Problem into MapReduce Jobs

JobControl

Apache Oozie

Defining an Oozie workflow

Packaging and deploying an Oozie workflow application

Running an Oozie workflow job

7. How MapReduce Works

Anatomy of a MapReduce Job Run

Job Submission

Job Initialization

Task Assignment

Task Execution

Streaming

Progress and Status Updates

Job Completion

Failures

Task Failure

Application Master Failure

Node Manager Failure

Resource Manager Failure

Shuffle and Sort

The Map Side

The Reduce Side

Configuration Tuning

Task Execution

The Task Execution Environment

Streaming environment variables

Speculative Execution

Output Committers

Task side-effect files

8. MapReduce Types and Formats

MapReduce Types

The Default MapReduce Job

The default Streaming job

Keys and values in Streaming

Input Formats

Input Splits and Records

FileInputFormat

FileInputFormat input paths

FileInputFormat input splits

Small files and CombineFileInputFormat

Preventing splitting

File information in the mapper

Processing a whole file as a record

Text Input

TextInputFormat

Controlling the maximum line length

KeyValueTextInputFormat

NLineInputFormat

XML

Binary Input

SequenceFileInputFormat

SequenceFileAsTextInputFormat

SequenceFileAsBinaryInputFormat

FixedLengthInputFormat

Multiple Inputs

Database Input (and Output)

Output Formats

Text Output

Binary Output

SequenceFileOutputFormat

SequenceFileAsBinaryOutputFormat

MapFileOutputFormat

Multiple Outputs

An example: Partitioning data

MultipleOutputs

Lazy Output

Database Output

9. MapReduce Features

Counters

Built-in Counters

Task counters

Job counters

User-Defined Java Counters

Dynamic counters

Retrieving counters

User-Defined Streaming Counters

Sorting

Preparation

Partial Sort

Total Sort

Secondary Sort

Java code

Streaming

Joins

Map-Side Joins

Reduce-Side Joins

Side Data Distribution

Using the Job Configuration

Distributed Cache

Usage

How it works

The distributed cache API

MapReduce Library Classes

III. Hadoop Operations

10. Setting Up a Hadoop Cluster

Cluster Specification

Cluster Sizing

Master node scenarios

Network Topology

Rack awareness

Cluster Setup and Installation

Installing Java

Creating Unix User Accounts

Installing Hadoop

Configuring SSH

Configuring Hadoop

Formatting the HDFS Filesystem

Starting and Stopping the Daemons

Creating User Directories

Hadoop Configuration

Configuration Management

Environment Settings

Java

Memory heap size

System logfiles

SSH settings

Important Hadoop Daemon Properties

HDFS

YARN

Memory settings in YARN and MapReduce

CPU settings in YARN and MapReduce

Hadoop Daemon Addresses and Ports

Other Hadoop Properties

Cluster membership

Buffer size

HDFS block size

Reserved storage space

Trash

Job scheduler

Reduce slow start

Short-circuit local reads

Security

Kerberos and Hadoop

An example

Delegation Tokens

Other Security Enhancements

Benchmarking a Hadoop Cluster

Hadoop Benchmarks

Benchmarking MapReduce with TeraSort

Other benchmarks

User Jobs

11. Administering Hadoop

HDFS

Persistent Data Structures

Namenode directory structure

The filesystem image and edit log

Secondary namenode directory structure

Datanode directory structure

Safe Mode

Entering and leaving safe mode

Audit Logging

Tools

dfsadmin

Filesystem check (fsck)

Finding the blocks for a file

Datanode block scanner

Balancer

Monitoring

Logging

Setting log levels

Getting stack traces

Metrics and JMX

Maintenance

Routine Administration Procedures

Metadata backups

Data backups

Filesystem check (fsck)

Filesystem balancer

Commissioning and Decommissioning Nodes

Commissioning new nodes

Decommissioning old nodes

Upgrades

HDFS data and metadata upgrades

Start the upgrade

Wait until the upgrade is complete

Check the upgrade

Roll back the upgrade (optional)

Finalize the upgrade (optional)

IV. Related Projects

12. Avro

Avro Data Types and Schemas

In-Memory Serialization and Deserialization

The Specific API

Avro Datafiles

Interoperability

Python API

Avro Tools

Schema Resolution

Sort Order

Avro MapReduce

Sorting Using Avro MapReduce

Avro in Other Languages

13. Parquet

Data Model

Nested Encoding

Parquet File Format

Parquet Configuration

Writing and Reading Parquet Files

Avro, Protocol Buffers, and Thrift

Projection and read schemas

Parquet MapReduce

14. Flume

Installing Flume

An Example

Transactions and Reliability

Batching

The HDFS Sink

Partitioning and Interceptors

File Formats

Fan Out

Delivery Guarantees

Replicating and Multiplexing Selectors

Distribution: Agent Tiers

Delivery Guarantees

Sink Groups

Integrating Flume with Applications

Component Catalog

Further Reading

15. Sqoop

Getting Sqoop

Sqoop Connectors

A Sample Import

Text and Binary File Formats

Generated Code

Additional Serialization Systems

Imports: A Deeper Look

Controlling the Import

Imports and Consistency

Incremental Imports

Direct-Mode Imports

Working with Imported Data

Imported Data and Hive

Importing Large Objects

Performing an Export

Exports: A Deeper Look

Exports and Transactionality

Exports and SequenceFiles

Further Reading

16. Pig

Installing and Running Pig

Execution Types

Local mode

MapReduce mode

Running Pig Programs

Grunt

Pig Latin Editors

An Example

Generating Examples

Comparison with Databases

Pig Latin

Structure

Statements

Expressions

Types

Schemas

Using Hive tables with HCatalog

Validation and nulls

Schema merging

Functions

Other libraries

Macros

User-Defined Functions

A Filter UDF

Leveraging types

An Eval UDF

Dynamic invokers

A Load UDF

Using a schema

Data Processing Operators

Loading and Storing Data

Filtering Data

FOREACH...GENERATE

STREAM

Grouping and Joining Data

JOIN

COGROUP

CROSS

GROUP

Sorting Data

Combining and Splitting Data

Pig in Practice

Parallelism

Anonymous Relations

Parameter Substitution

Dynamic parameters

Parameter substitution processing

Further Reading

17. Hive

Installing Hive

The Hive Shell

An Example

Running Hive

Configuring Hive

Execution engines

Logging

Hive Services

Hive clients

The Metastore

Comparison with Traditional Databases

Schema on Read Versus Schema on Write

Updates, Transactions, and Indexes

SQL-on-Hadoop Alternatives

HiveQL

Data Types

Primitive types

Complex types

Operators and Functions

Conversions

Tables

Managed Tables and External Tables

Partitions and Buckets

Partitions

Buckets

Storage Formats

The default storage format: Delimited text

Binary storage formats: Sequence files, Avro datafiles, Parquet files, RCFiles, and ORCFiles

Using a custom SerDe: RegexSerDe

Storage handlers

Importing Data

Inserts

Multitable insert

CREATE TABLE...AS SELECT

Altering Tables

Dropping Tables

Querying Data

Sorting and Aggregating

MapReduce Scripts

Joins

Inner joins

Outer joins

Semi joins

Map joins

Subqueries

Views

User-Defined Functions

Writing a UDF

Writing a UDAF

A more complex UDAF

Further Reading

18. Crunch

An Example

The Core Crunch API

Primitive Operations

union()

parallelDo()

groupByKey()

combineValues()

Types

Records and tuples

Sources and Targets

Reading from a source

Writing to a target

Existing outputs

Combined sources and targets

Functions

Serialization of functions

Object reuse

Materialization

PObject

Pipeline Execution

Running a Pipeline

Asynchronous execution

Debugging

Stopping a Pipeline

Inspecting a Crunch Plan

Iterative Algorithms

Checkpointing a Pipeline

Crunch Libraries

Further Reading

19. Spark

Installing Spark

An Example

Spark Applications, Jobs, Stages, and Tasks

A Scala Standalone Application

A Java Example

A Python Example

Resilient Distributed Datasets

Creation

Transformations and Actions

Aggregation transformations

Persistence

Persistence levels

Serialization

Data

Functions

Shared Variables

Broadcast Variables

Accumulators

Anatomy of a Spark Job Run

Job Submission

DAG Construction

Task Scheduling

Task Execution

Executors and Cluster Managers

Spark on YARN

YARN client mode

YARN cluster mode

Further Reading

20. HBase

HBasics

Backdrop

Concepts

Whirlwind Tour of the Data Model

Regions

Locking

Implementation

HBase in operation

Installation

Test Drive

Clients

Java

MapReduce

REST and Thrift

Building an Online Query Application

Schema Design

Loading Data

Load distribution

Bulk load

Online Queries

Station queries

Observation queries

HBase Versus RDBMS

Successful Service

HBase

Praxis

HDFS

Metrics

Counters

Further Reading

21. ZooKeeper

Installing and Running ZooKeeper

An Example

Group Membership in ZooKeeper

Creating the Group

Joining a Group

Listing Members in a Group

ZooKeeper command-line tools

Deleting a Group

The ZooKeeper Service

Data Model

Ephemeral znodes

Sequence numbers

Watches

Operations

Multiupdate

APIs

Watch triggers

ACLs

Implementation

Consistency

Sessions

Time

States

Building Applications with ZooKeeper

A Configuration Service

The Resilient ZooKeeper Application

InterruptedException

KeeperException

State exceptions

Recoverable exceptions

Unrecoverable exceptions

A reliable configuration service

A Lock Service

The herd effect

Recoverable exceptions

Unrecoverable exceptions

Implementation

More Distributed Data Structures and Protocols

BookKeeper and Hedwig

ZooKeeper in Production

Resilience and Performance

Configuration

Further Reading

V. Case Studies

22. Composable Data at Cerner

From CPUs to Semantic Integration

Enter Apache Crunch

Building a Complete Picture

Integrating Healthcare Data

Composability over Frameworks

Moving Forward

23. Biological Data Science: Saving Lives with Software

The Structure of DNA

The Genetic Code: Turning DNA Letters into Proteins

Thinking of DNA as Source Code

The Human Genome Project and Reference Genomes

Sequencing and Aligning DNA

ADAM, A Scalable Genome Analysis Platform

Literate programming with the Avro interface description language (IDL)

Column-oriented access with Parquet

A simple example: k-mer counting using Spark and ADAM

From Personalized Ads to Personalized Medicine

Join In

24. Cascading

Fields, Tuples, and Pipes

Operations

Taps, Schemes, and Flows

Cascading in Practice

Flexibility

Hadoop and Cascading at ShareThis

Summary

A. Installing Apache Hadoop

Prerequisites

Installation

Configuration

Standalone Mode

Pseudodistributed Mode

Configuring SSH

Formatting the HDFS filesystem

Starting and stopping the daemons

Creating a user directory

Fully Distributed Mode

B. Cloudera’s Distribution Including Apache Hadoop

C. Preparing the NCDC Weather Data

D. The Old and New Java MapReduce APIs

Index

Colophon

www.it-ebooks.info

Hadoop: The Definitive Guide Tom White www.it-ebooks.info

For Eliane, Emilia, and Lottie www.it-ebooks.info

www.it-ebooks.info

Foreword Doug Cutting, April 2009 Shed in the Yard, California Hadoop got its start in Nutch. A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers. Once Google published its GFS and MapReduce papers, the route became clear. They’d devised systems to solve precisely the problems we were having with Nutch. So we started, two of us, half-time, to try to re-create these systems as a part of Nutch. We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web’s massive scale, we’d need to run it on thousands of machines, and moreover, that the job was bigger than two half-time developers could handle. Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web. In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article he’d written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose. From the beginning, Tom’s contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use. Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the MapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee. Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master — not only of the technology, but also of common sense and plain talk. www.it-ebooks.info

www.it-ebooks.info

Preface Martin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.[1] In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien. But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction — to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it. With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book. The Apache Hadoop community has come a long way. Since the publication of the first edition of this book, the Hadoop project has blossomed. “Big data” has become a household term.[2] In this time, the software has made great leaps in adoption, performance, reliability, scalability, and manageability. The number of things being built and run on the Hadoop platform has grown enormously. In fact, it’s difficult for one person to keep track. To gain even wider adoption, I believe we need to make Hadoop even easier to use. This will involve writing more tools; integrating with even more systems; and writing new, improved APIs. I’m looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too. www.it-ebooks.info

分享到：

赞收藏

资料库

Hadoop- The Definitive Guide, 4th Edition.pdf

相关推荐

后端

热门标签

最新资料