[PDF]Hadoop MapReduce Cookbook v2 (文字版).pdf

发布时间：2022-05-30 发布人：admin 分类：说明书资料大小：4.44M 资料格式：pdf 举报版权申诉

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第1页.png

第1页 / 共695页

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第2页.png

第2页 / 共695页

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第3页.png

第3页 / 共695页

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第4页.png

第4页 / 共695页

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第5页.png

第5页 / 共695页

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第6页.png

第6页 / 共695页

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第7页.png

第7页 / 共695页

nextack-8514181-Hadoop MapReduce v2 Cookbook(PACKT,2ed,2015).pdf-第8页.png

第8页 / 共695页

Hadoop MapReduce v2 Cookbook Second Edition

Credits

About the Author

Acknowledgments

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Getting Started with Hadoop v2

Introduction

Hadoop Distributed File System – HDFS

Hadoop YARN

Hadoop MapReduce

Hadoop installation modes

Setting up Hadoop v2 on your local machine

Getting ready

How to do it...

How it works...

Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode

Getting ready

How to do it...

How it works...

There's more...

See also

Adding a combiner step to the WordCount MapReduce program

How to do it...

How it works...

There's more...

Setting up HDFS

Getting ready

How to do it...

See also

Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2

Getting ready

How to do it...

How it works...

See also

Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution

Getting ready

How to do it...

There's more...

HDFS command-line file operations

Getting ready

How to do it...

How it works...

There's more...

Running the WordCount program in a distributed cluster environment

Getting ready

How to do it...

How it works...

There's more...

Benchmarking HDFS using DFSIO

Getting ready

How to do it...

How it works...

There's more...

Benchmarking Hadoop MapReduce using TeraSort

Getting ready

How to do it...

How it works...

2. Cloud Deployments – Using Hadoop YARN on Cloud Environments

Introduction

Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce

Getting ready

How to do it...

See also

Saving money using Amazon EC2 Spot Instances to execute EMR job flows

How to do it...

There's more...

See also

Executing a Pig script using EMR

How to do it...

There's more...

Starting a Pig interactive session

Executing a Hive script using EMR

How to do it...

There's more...

Starting a Hive interactive session

See also

Creating an Amazon EMR job flow using the AWS Command Line Interface

Getting ready

How to do it...

There's more...

See also

Deploying an Apache HBase cluster on Amazon EC2 using EMR

Getting ready

How to do it...

See also

Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs

How to do it...

There's more...

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

How to do it...

How it works...

See also

3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs

Introduction

Optimizing Hadoop YARN and MapReduce configurations for cluster deployments

Getting ready

How to do it...

How it works...

There's more...

Shared user Hadoop clusters – using Fair and Capacity schedulers

How to do it...

How it works...

There's more...

Setting classpath precedence to user-provided JARs

How to do it...

How it works...

Speculative execution of straggling tasks

How to do it...

There's more...

Unit testing Hadoop MapReduce applications using MRUnit

Getting ready

How to do it...

See also

Integration testing Hadoop MapReduce applications using MiniYarnCluster

Getting ready

How to do it...

See also

Adding a new DataNode

Getting ready

How to do it...

There's more...

Rebalancing HDFS

See also

Decommissioning DataNodes

How to do it...

How it works...

See also

Using multiple disks/volumes and limiting HDFS disk usage

How to do it...

Setting the HDFS block size

How to do it...

There's more...

See also

Setting the file replication factor

How to do it...

How it works...

There's more...

See also

Using the HDFS Java API

How to do it...

How it works...

There's more...

Configuring the FileSystem object

Retrieving the list of data blocks of a file

4. Developing Complex Hadoop MapReduce Applications

Introduction

Choosing appropriate Hadoop data types

How to do it...

There's more...

See also

Implementing a custom Hadoop Writable data type

How to do it...

How it works...

There's more...

See also

Implementing a custom Hadoop key type

How to do it...

How it works...

See also

Emitting data of different value types from a Mapper

How to do it...

How it works...

There's more...

See also

Choosing a suitable Hadoop InputFormat for your input data format

How to do it...

How it works...

There's more...

See also

Adding support for new input data formats – implementing a custom InputFormat

How to do it...

How it works...

There's more...

See also

Formatting the results of MapReduce computations – using Hadoop OutputFormats

How to do it...

How it works...

There's more...

Writing multiple outputs from a MapReduce computation

How to do it...

How it works...

Using multiple input data types and multiple Mapper implementations in a single MapReduce application

See also

Hadoop intermediate data partitioning

How to do it...

How it works...

There's more...

TotalOrderPartitioner

KeyFieldBasedPartitioner

Secondary sorting – sorting Reduce input values

How to do it...

How it works...

See also

Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache

How to do it...

How it works...

There's more...

Distributing archives using the DistributedCache

Adding resources to the DistributedCache from the command line

Adding resources to the classpath using the DistributedCache

Using Hadoop with legacy applications – Hadoop streaming

How to do it...

How it works...

There's more...

See also

Adding dependencies between MapReduce jobs

How to do it...

How it works...

There's more...

Hadoop counters to report custom metrics

How to do it...

How it works...

5. Analytics

Introduction

Simple analytics using MapReduce

Getting ready

How to do it...

How it works...

There's more...

Performing GROUP BY using MapReduce

Getting ready

How to do it...

How it works...

Calculating frequency distributions and sorting using MapReduce

Getting ready

How to do it...

How it works...

There's more...

Plotting the Hadoop MapReduce results using gnuplot

Getting ready

How to do it...

How it works...

There's more...

Calculating histograms using MapReduce

Getting ready

How to do it...

How it works...

Calculating Scatter plots using MapReduce

Getting ready

How to do it...

How it works...

Parsing a complex dataset with Hadoop

Getting ready

How to do it...

How it works...

There's more...

Joining two datasets using MapReduce

Getting ready

How to do it...

How it works...

6. Hadoop Ecosystem – Apache Hive

Introduction

Getting started with Apache Hive

How to do it...

See also

Creating databases and tables using Hive CLI

Getting ready

How to do it...

How it works...

There's more...

Hive data types

Hive external tables

Using the describe formatted command to inspect the metadata of Hive tables

Simple SQL-style data querying using Apache Hive

Getting ready

How to do it...

How it works...

There's more...

Using Apache Tez as the execution engine for Hive

See also

Creating and populating Hive tables and views using Hive query results

Getting ready

How to do it...

Utilizing different storage formats in Hive - storing table data using ORC files

Getting ready

How to do it...

How it works...

Using Hive built-in functions

Getting ready

How to do it...

How it works...

There's more...

See also

Hive batch mode - using a query file

How to do it...

How it works...

There's more...

See also

Performing a join with Hive

Getting ready

How to do it...

How it works...

See also

Creating partitioned Hive tables

Getting ready

How to do it...

Writing Hive User-defined Functions (UDF)

Getting ready

How to do it...

How it works...

HCatalog – performing Java MapReduce computations on data mapped to Hive tables

Getting ready

How to do it...

How it works...

HCatalog – writing data to Hive tables from Java MapReduce computations

Getting ready

How to do it...

How it works...

7. Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop

Introduction

Getting started with Apache Pig

Getting ready

How to do it...

How it works...

There's more...

See also

Joining two datasets using Pig

How to do it...

How it works...

There's more...

Accessing a Hive table data in Pig using HCatalog

Getting ready

How to do it...

There's more...

See also

Getting started with Apache HBase

Getting ready

How to do it...

There's more...

See also

Data random access using Java client APIs

Getting ready

How to do it...

How it works...

Running MapReduce jobs on HBase

Getting ready

How to do it...

How it works...

Using Hive to insert data into HBase tables

Getting ready

How to do it...

See also

Getting started with Apache Mahout

How to do it...

How it works...

There's more...

Running K-means with Mahout

Getting ready

How to do it...

How it works...

Importing data to HDFS from a relational database using Apache Sqoop

Getting ready

How to do it...

Exporting data from HDFS to a relational database using Apache Sqoop

Getting ready

How to do it...

8. Searching and Indexing

Introduction

Generating an inverted index using Hadoop MapReduce

Getting ready

How to do it...

How it works...

There's more...

Outputting a random accessible indexed InvertedIndex

See also

Intradomain web crawling using Apache Nutch

Getting ready

How to do it...

See also

Indexing and searching web documents using Apache Solr

Getting ready

How to do it...

How it works...

See also

Configuring Apache HBase as the backend data store for Apache Nutch

Getting ready

How to do it...

How it works...

See also

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Getting ready

How to do it...

How it works...

See also

Elasticsearch for indexing and searching

Getting ready

How to do it...

How it works...

See also

Generating the in-links graph for crawled web pages

Getting ready

How to do it...

How it works...

See also

9. Classifications, Recommendations, and Finding Relationships

Introduction

Performing content-based recommendations

How to do it...

How it works...

There's more...

Classification using the naïve Bayes classifier

How to do it...

How it works...

Assigning advertisements to keywords using the Adwords balance algorithm

How to do it...

How it works...

There's more...

10. Mass Text Data Processing

Introduction

Data preprocessing using Hadoop streaming and Python

Getting ready

How to do it...

How it works...

There's more...

See also

De-duplicating data using Hadoop streaming

Getting ready

How to do it...

How it works...

See also

Loading large datasets to an Apache HBase data store – importtsv and bulkload

Getting ready

How to do it…

How it works...

There's more...

Data de-duplication using HBase

See also

Creating TF and TF-IDF vectors for the text data

Getting ready

How to do it…

How it works…

See also

Clustering text data using Apache Mahout

Getting ready

How to do it...

How it works...

See also

Topic discovery using Latent Dirichlet Allocation (LDA)

Getting ready

How to do it…

How it works…

See also

Document classification using Mahout Naive Bayes Classifier

Getting ready

How to do it...

How it works...

See also

Index

www.it-ebooks.info

Hadoop MapReduce v2 Cookbook Second Edition www.it-ebooks.info

Table of Contents Hadoop MapReduce v2 Cookbook Second Edition Credits About the Author Acknowledgments About the Author About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why Subscribe? Free Access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions 1. Getting Started with Hadoop v2 Introduction Hadoop Distributed File System – HDFS Hadoop YARN Hadoop MapReduce Hadoop installation modes Setting up Hadoop v2 on your local machine Getting ready www.it-ebooks.info

How to do it… How it works… Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode Getting ready How to do it… How it works… There’s more… See also Adding a combiner step to the WordCount MapReduce program How to do it… How it works… There’s more… Setting up HDFS Getting ready How to do it… See also Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2 Getting ready How to do it… How it works… See also Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution Getting ready How to do it… There’s more… HDFS command-line file operations Getting ready How to do it… How it works… There’s more… www.it-ebooks.info

Running the WordCount program in a distributed cluster environment Getting ready How to do it… How it works… There’s more… Benchmarking HDFS using DFSIO Getting ready How to do it… How it works… There’s more… Benchmarking Hadoop MapReduce using TeraSort Getting ready How to do it… How it works… 2. Cloud Deployments – Using Hadoop YARN on Cloud Environments Introduction Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce Getting ready How to do it… See also Saving money using Amazon EC2 Spot Instances to execute EMR job flows How to do it… There’s more… See also Executing a Pig script using EMR How to do it… There’s more… Starting a Pig interactive session Executing a Hive script using EMR How to do it… There’s more… www.it-ebooks.info

Starting a Hive interactive session See also Creating an Amazon EMR job flow using the AWS Command Line Interface Getting ready How to do it… There’s more… See also Deploying an Apache HBase cluster on Amazon EC2 using EMR Getting ready How to do it… See also Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs How to do it… There’s more… Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment How to do it… How it works… See also 3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs Introduction Optimizing Hadoop YARN and MapReduce configurations for cluster deployments Getting ready How to do it… How it works… There’s more… Shared user Hadoop clusters – using Fair and Capacity schedulers How to do it… How it works… There’s more… Setting classpath precedence to user-provided JARs How to do it… www.it-ebooks.info

How it works… Speculative execution of straggling tasks How to do it… There’s more… Unit testing Hadoop MapReduce applications using MRUnit Getting ready How to do it… See also Integration testing Hadoop MapReduce applications using MiniYarnCluster Getting ready How to do it… See also Adding a new DataNode Getting ready How to do it… There’s more… Rebalancing HDFS See also Decommissioning DataNodes How to do it… How it works… See also Using multiple disks/volumes and limiting HDFS disk usage How to do it… Setting the HDFS block size How to do it… There’s more… See also Setting the file replication factor How to do it… How it works… www.it-ebooks.info

分享到：

赞收藏

资料库

[PDF]Hadoop MapReduce Cookbook v2 (文字版).pdf

相关推荐

开发技术

热门标签

最新资料