logo资料库

[PDF]Hadoop MapReduce Cookbook v2 (文字版).pdf

第1页 / 共695页
第2页 / 共695页
第3页 / 共695页
第4页 / 共695页
第5页 / 共695页
第6页 / 共695页
第7页 / 共695页
第8页 / 共695页
资料共695页,剩余部分请下载后查看
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started with Hadoop v2
Introduction
Hadoop Distributed File System – HDFS
Hadoop YARN
Hadoop MapReduce
Hadoop installation modes
Setting up Hadoop v2 on your local machine
Getting ready
How to do it...
How it works...
Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode
Getting ready
How to do it...
How it works...
There's more...
See also
Adding a combiner step to the WordCount MapReduce program
How to do it...
How it works...
There's more...
Setting up HDFS
Getting ready
How to do it...
See also
Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
Getting ready
How to do it...
How it works...
See also
Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
Getting ready
How to do it...
There's more...
HDFS command-line file operations
Getting ready
How to do it...
How it works...
There's more...
Running the WordCount program in a distributed cluster environment
Getting ready
How to do it...
How it works...
There's more...
Benchmarking HDFS using DFSIO
Getting ready
How to do it...
How it works...
There's more...
Benchmarking Hadoop MapReduce using TeraSort
Getting ready
How to do it...
How it works...
2. Cloud Deployments – Using Hadoop YARN on Cloud Environments
Introduction
Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
Getting ready
How to do it...
See also
Saving money using Amazon EC2 Spot Instances to execute EMR job flows
How to do it...
There's more...
See also
Executing a Pig script using EMR
How to do it...
There's more...
Starting a Pig interactive session
Executing a Hive script using EMR
How to do it...
There's more...
Starting a Hive interactive session
See also
Creating an Amazon EMR job flow using the AWS Command Line Interface
Getting ready
How to do it...
There's more...
See also
Deploying an Apache HBase cluster on Amazon EC2 using EMR
Getting ready
How to do it...
See also
Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
How to do it...
There's more...
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
How to do it...
How it works...
See also
3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs
Introduction
Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
Getting ready
How to do it...
How it works...
There's more...
Shared user Hadoop clusters – using Fair and Capacity schedulers
How to do it...
How it works...
There's more...
Setting classpath precedence to user-provided JARs
How to do it...
How it works...
Speculative execution of straggling tasks
How to do it...
There's more...
Unit testing Hadoop MapReduce applications using MRUnit
Getting ready
How to do it...
See also
Integration testing Hadoop MapReduce applications using MiniYarnCluster
Getting ready
How to do it...
See also
Adding a new DataNode
Getting ready
How to do it...
There's more...
Rebalancing HDFS
See also
Decommissioning DataNodes
How to do it...
How it works...
See also
Using multiple disks/volumes and limiting HDFS disk usage
How to do it...
Setting the HDFS block size
How to do it...
There's more...
See also
Setting the file replication factor
How to do it...
How it works...
There's more...
See also
Using the HDFS Java API
How to do it...
How it works...
There's more...
Configuring the FileSystem object
Retrieving the list of data blocks of a file
4. Developing Complex Hadoop MapReduce Applications
Introduction
Choosing appropriate Hadoop data types
How to do it...
There's more...
See also
Implementing a custom Hadoop Writable data type
How to do it...
How it works...
There's more...
See also
Implementing a custom Hadoop key type
How to do it...
How it works...
See also
Emitting data of different value types from a Mapper
How to do it...
How it works...
There's more...
See also
Choosing a suitable Hadoop InputFormat for your input data format
How to do it...
How it works...
There's more...
See also
Adding support for new input data formats – implementing a custom InputFormat
How to do it...
How it works...
There's more...
See also
Formatting the results of MapReduce computations – using Hadoop OutputFormats
How to do it...
How it works...
There's more...
Writing multiple outputs from a MapReduce computation
How to do it...
How it works...
Using multiple input data types and multiple Mapper implementations in a single MapReduce application
See also
Hadoop intermediate data partitioning
How to do it...
How it works...
There's more...
TotalOrderPartitioner
KeyFieldBasedPartitioner
Secondary sorting – sorting Reduce input values
How to do it...
How it works...
See also
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
How to do it...
How it works...
There's more...
Distributing archives using the DistributedCache
Adding resources to the DistributedCache from the command line
Adding resources to the classpath using the DistributedCache
Using Hadoop with legacy applications – Hadoop streaming
How to do it...
How it works...
There's more...
See also
Adding dependencies between MapReduce jobs
How to do it...
How it works...
There's more...
Hadoop counters to report custom metrics
How to do it...
How it works...
5. Analytics
Introduction
Simple analytics using MapReduce
Getting ready
How to do it...
How it works...
There's more...
Performing GROUP BY using MapReduce
Getting ready
How to do it...
How it works...
Calculating frequency distributions and sorting using MapReduce
Getting ready
How to do it...
How it works...
There's more...
Plotting the Hadoop MapReduce results using gnuplot
Getting ready
How to do it...
How it works...
There's more...
Calculating histograms using MapReduce
Getting ready
How to do it...
How it works...
Calculating Scatter plots using MapReduce
Getting ready
How to do it...
How it works...
Parsing a complex dataset with Hadoop
Getting ready
How to do it...
How it works...
There's more...
Joining two datasets using MapReduce
Getting ready
How to do it...
How it works...
6. Hadoop Ecosystem – Apache Hive
Introduction
Getting started with Apache Hive
How to do it...
See also
Creating databases and tables using Hive CLI
Getting ready
How to do it...
How it works...
There's more...
Hive data types
Hive external tables
Using the describe formatted command to inspect the metadata of Hive tables
Simple SQL-style data querying using Apache Hive
Getting ready
How to do it...
How it works...
There's more...
Using Apache Tez as the execution engine for Hive
See also
Creating and populating Hive tables and views using Hive query results
Getting ready
How to do it...
Utilizing different storage formats in Hive - storing table data using ORC files
Getting ready
How to do it...
How it works...
Using Hive built-in functions
Getting ready
How to do it...
How it works...
There's more...
See also
Hive batch mode - using a query file
How to do it...
How it works...
There's more...
See also
Performing a join with Hive
Getting ready
How to do it...
How it works...
See also
Creating partitioned Hive tables
Getting ready
How to do it...
Writing Hive User-defined Functions (UDF)
Getting ready
How to do it...
How it works...
HCatalog – performing Java MapReduce computations on data mapped to Hive tables
Getting ready
How to do it...
How it works...
HCatalog – writing data to Hive tables from Java MapReduce computations
Getting ready
How to do it...
How it works...
7. Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop
Introduction
Getting started with Apache Pig
Getting ready
How to do it...
How it works...
There's more...
See also
Joining two datasets using Pig
How to do it...
How it works...
There's more...
Accessing a Hive table data in Pig using HCatalog
Getting ready
How to do it...
There's more...
See also
Getting started with Apache HBase
Getting ready
How to do it...
There's more...
See also
Data random access using Java client APIs
Getting ready
How to do it...
How it works...
Running MapReduce jobs on HBase
Getting ready
How to do it...
How it works...
Using Hive to insert data into HBase tables
Getting ready
How to do it...
See also
Getting started with Apache Mahout
How to do it...
How it works...
There's more...
Running K-means with Mahout
Getting ready
How to do it...
How it works...
Importing data to HDFS from a relational database using Apache Sqoop
Getting ready
How to do it...
Exporting data from HDFS to a relational database using Apache Sqoop
Getting ready
How to do it...
8. Searching and Indexing
Introduction
Generating an inverted index using Hadoop MapReduce
Getting ready
How to do it...
How it works...
There's more...
Outputting a random accessible indexed InvertedIndex
See also
Intradomain web crawling using Apache Nutch
Getting ready
How to do it...
See also
Indexing and searching web documents using Apache Solr
Getting ready
How to do it...
How it works...
See also
Configuring Apache HBase as the backend data store for Apache Nutch
Getting ready
How to do it...
How it works...
See also
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
Getting ready
How to do it...
How it works...
See also
Elasticsearch for indexing and searching
Getting ready
How to do it...
How it works...
See also
Generating the in-links graph for crawled web pages
Getting ready
How to do it...
How it works...
See also
9. Classifications, Recommendations, and Finding Relationships
Introduction
Performing content-based recommendations
How to do it...
How it works...
There's more...
Classification using the naïve Bayes classifier
How to do it...
How it works...
Assigning advertisements to keywords using the Adwords balance algorithm
How to do it...
How it works...
There's more...
10. Mass Text Data Processing
Introduction
Data preprocessing using Hadoop streaming and Python
Getting ready
How to do it...
How it works...
There's more...
See also
De-duplicating data using Hadoop streaming
Getting ready
How to do it...
How it works...
See also
Loading large datasets to an Apache HBase data store – importtsv and bulkload
Getting ready
How to do it…
How it works...
There's more...
Data de-duplication using HBase
See also
Creating TF and TF-IDF vectors for the text data
Getting ready
How to do it…
How it works…
See also
Clustering text data using Apache Mahout
Getting ready
How to do it...
How it works...
See also
Topic discovery using Latent Dirichlet Allocation (LDA)
Getting ready
How to do it…
How it works…
See also
Document classification using Mahout Naive Bayes Classifier
Getting ready
How to do it...
How it works...
See also
Index
www.it-ebooks.info
Hadoop MapReduce v2 Cookbook Second Edition www.it-ebooks.info
Table of Contents Hadoop MapReduce v2 Cookbook Second Edition Credits About the Author Acknowledgments About the Author About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why Subscribe? Free Access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions 1. Getting Started with Hadoop v2 Introduction Hadoop Distributed File System – HDFS Hadoop YARN Hadoop MapReduce Hadoop installation modes Setting up Hadoop v2 on your local machine Getting ready www.it-ebooks.info
How to do it… How it works… Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode Getting ready How to do it… How it works… There’s more… See also Adding a combiner step to the WordCount MapReduce program How to do it… How it works… There’s more… Setting up HDFS Getting ready How to do it… See also Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2 Getting ready How to do it… How it works… See also Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution Getting ready How to do it… There’s more… HDFS command-line file operations Getting ready How to do it… How it works… There’s more… www.it-ebooks.info
Running the WordCount program in a distributed cluster environment Getting ready How to do it… How it works… There’s more… Benchmarking HDFS using DFSIO Getting ready How to do it… How it works… There’s more… Benchmarking Hadoop MapReduce using TeraSort Getting ready How to do it… How it works… 2. Cloud Deployments – Using Hadoop YARN on Cloud Environments Introduction Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce Getting ready How to do it… See also Saving money using Amazon EC2 Spot Instances to execute EMR job flows How to do it… There’s more… See also Executing a Pig script using EMR How to do it… There’s more… Starting a Pig interactive session Executing a Hive script using EMR How to do it… There’s more… www.it-ebooks.info
Starting a Hive interactive session See also Creating an Amazon EMR job flow using the AWS Command Line Interface Getting ready How to do it… There’s more… See also Deploying an Apache HBase cluster on Amazon EC2 using EMR Getting ready How to do it… See also Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs How to do it… There’s more… Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment How to do it… How it works… See also 3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs Introduction Optimizing Hadoop YARN and MapReduce configurations for cluster deployments Getting ready How to do it… How it works… There’s more… Shared user Hadoop clusters – using Fair and Capacity schedulers How to do it… How it works… There’s more… Setting classpath precedence to user-provided JARs How to do it… www.it-ebooks.info
How it works… Speculative execution of straggling tasks How to do it… There’s more… Unit testing Hadoop MapReduce applications using MRUnit Getting ready How to do it… See also Integration testing Hadoop MapReduce applications using MiniYarnCluster Getting ready How to do it… See also Adding a new DataNode Getting ready How to do it… There’s more… Rebalancing HDFS See also Decommissioning DataNodes How to do it… How it works… See also Using multiple disks/volumes and limiting HDFS disk usage How to do it… Setting the HDFS block size How to do it… There’s more… See also Setting the file replication factor How to do it… How it works… www.it-ebooks.info
分享到:
收藏