logo资料库

Apache Hadoop YARN.pdf完整电子版

第1页 / 共337页
第2页 / 共337页
第3页 / 共337页
第4页 / 共337页
第5页 / 共337页
第6页 / 共337页
第7页 / 共337页
第8页 / 共337页
资料共337页,剩余部分请下载后查看
Contents
Foreword by Raymie Stata
Foreword by Paul Dix
Preface
Acknowledgments
About the Authors
1 Apache Hadoop YARN: A Brief History and Rationale
Introduction
Apache Hadoop
Phase 0: The Era of Ad Hoc Clusters
Phase 1: Hadoop on Demand
HDFS in the HOD World
Features and Advantages of HOD
Shortcomings of Hadoop on Demand
Phase 2: Dawn of the Shared Compute Clusters
Evolution of Shared Clusters
Issues with Shared MapReduce Clusters
Phase 3: Emergence of YARN
Conclusion
2 Apache Hadoop YARN Install Quick Start
Getting Started
Steps to Configure a Single-Node YARN Cluster
Step 1: Download Apache Hadoop
Step 2: Set JAVA_HOME
Step 3: Create Users and Groups
Step 4: Make Data and Log Directories
Step 5: Configure core-site.xml
Step 6: Configure hdfs-site.xml
Step 7: Configure mapred-site.xml
Step 8: Configure yarn-site.xml
Step 9: Modify Java Heap Sizes
Step 10: Format HDFS
Step 11: Start the HDFS Services
Step 12: Start YARN Services
Step 13: Verify the Running Services Using the Web Interface
Run Sample MapReduce Examples
Wrap-up
3 Apache Hadoop YARN Core Concepts
Beyond MapReduce
The MapReduce Paradigm
Apache Hadoop MapReduce
The Need for Non-MapReduce Workloads
Addressing Scalability
Improved Utilization
User Agility
Apache Hadoop YARN
YARN Components
ResourceManager
ApplicationMaster
Resource Model
ResourceRequests and Containers
Container Specification
Wrap-up
4 Functional Overview of YARN Components
Architecture Overview
ResourceManager
YARN Scheduling Components
FIFO Scheduler
Capacity Scheduler
Fair Scheduler
Containers
NodeManager
ApplicationMaster
YARN Resource Model
Client Resource Request
ApplicationMaster Container Allocation
ApplicationMaster–Container Manager Communication
Managing Application Dependencies
LocalResources Definitions
LocalResource Timestamps
LocalResource Types
LocalResource Visibilities
Lifetime of LocalResources
Wrap-up
5 Installing Apache Hadoop YARN
The Basics
System Preparation
Step 1: Install EPEL and pdsh
Step 2: Generate and Distribute ssh Keys
Script-based Installation of Hadoop 2
JDK Options
Step 1: Download and Extract the Scripts
Step 2: Set the Script Variables
Step 3: Provide Node Names
Step 4: Run the Script
Step 5: Verify the Installation
Script-based Uninstall
Configuration File Processing
Configuration File Settings
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
Start-up Scripts
Installing Hadoop with Apache Ambari
Performing an Ambari-based Hadoop Installation
Step 1: Check Requirements
Step 2: Install the Ambari Server
Step 3: Install and Start Ambari Agents
Step 4: Start the Ambari Server
Step 5: Install an HDP2.X Cluster
Wrap-up
6 Apache Hadoop YARN Administration
Script-based Configuration
Monitoring Cluster Health: Nagios
Monitoring Basic Hadoop Services
Monitoring the JVM
Real-time Monitoring: Ganglia
Administration with Ambari
JVM Analysis
Basic YARN Administration
YARN Administrative Tools
Adding and Decommissioning YARN Nodes
Capacity Scheduler Configuration
YARN WebProxy
Using the JobHistoryServer
Refreshing User-to-Groups Mappings
Refreshing Superuser Proxy Groups Mappings
Refreshing ACLs for Administration of ResourceManager
Reloading the Service-level Authorization Policy File
Managing YARN Jobs
Setting Container Memory
Setting Container Cores
Setting MapReduce Properties
User Log Management
Wrap-up
7 Apache Hadoop YARN Architecture Guide
Overview
ResourceManager
Overview of the ResourceManager Components
Client Interaction with the ResourceManager
Application Interaction with the ResourceManager
Interaction of Nodes with the ResourceManager
Core ResourceManager Components
Security-related Components in the ResourceManager
NodeManager
Overview of the NodeManager Components
NodeManager Components
NodeManager Security Components
Important NodeManager Functions
ApplicationMaster
Overview
Liveliness
Resource Requirements
Scheduling
Scheduling Protocol and Locality
Launching Containers
Completed Containers
ApplicationMaster Failures and Recovery
Coordination and Output Commit
Information for Clients
Security
Cleanup on ApplicationMaster Exit
YARN Containers
Container Environment
Communication with the ApplicationMaster
Summary for Application-writers
Wrap-up
8 Capacity Scheduler in YARN
Introduction to the Capacity Scheduler
Elasticity with Multitenancy
Security
Resource Awareness
Granular Scheduling
Locality
Scheduling Policies
Capacity Scheduler Configuration
Queues
Hierarchical Queues
Key Characteristics
Scheduling Among Queues
Defining Hierarchical Queues
Queue Access Control
Capacity Management with Queues
User Limits
Reservations
State of the Queues
Limits on Applications
User Interface
Wrap-up
9 MapReduce with Apache Hadoop YARN
Running Hadoop YARN MapReduce Examples
Listing Available Examples
Running the Pi Example
Using the Web GUI to Monitor Examples
Running the Terasort Test
Run the TestDFSIO Benchmark
MapReduce Compatibility
The MapReduce ApplicationMaster
Enabling Application Master Restarts
Enabling Recovery of Completed Tasks
The JobHistory Server
Calculating the Capacity of a Node
Changes to the Shuffle Service
Running Existing Hadoop Version 1 Applications
Binary Compatibility of org.apache.hadoop.mapred APIs
Source Compatibility of org.apache.hadoop. mapreduce APIs
Compatibility of Command-line Scripts
Compatibility Tradeoff Between MRv1 and Early MRv2 (0.23.x) Applications
Running MapReduce Version 1 Existing Code
Running Apache Pig Scripts on YARN
Running Apache Hive Queries on YARN
Running Apache Oozie Workflows on YARN
Advanced Features
Uber Jobs
Pluggable Shuffle and Sort
Wrap-up
10 Apache Hadoop YARN Application Example
The YARN Client
The ApplicationMaster
Wrap-up
11 Using Apache Hadoop YARN Distributed-Shell
Using the YARN Distributed-Shell
A Simple Example
Using More Containers
Distributed-Shell Examples with Shell Arguments
Internals of the Distributed-Shell
Application Constants
Client
ApplicationMaster
Final Containers
Wrap-up
12 Apache Hadoop YARN Frameworks
Distributed-Shell
Hadoop MapReduce
Apache Tez
Apache Giraph
Hoya: HBase on YARN
Dryad on YARN
Apache Spark
Apache Storm
REEF: Retainable Evaluator Execution Framework
Hamster: Hadoop and MPI on the Same Cluster
Wrap-up
A: Supplemental Content and Code Downloads
Available Downloads
B: YARN Installation Scripts
install-hadoop2.sh
uninstall-hadoop2.sh
hadoop-xml-conf.sh
C: YARN Administration Scripts
configure-hadoop2.sh
D: Nagios Modules
check_resource_manager.sh
check_data_node.sh
check_resource_manager_old_space_pct.sh
E: Resources and Additional Information
F: HDFS Quick Reference
Quick Command Reference
Starting HDFS and the HDFS Web GUI
Get an HDFS Status Report
Perform an FSCK on HDFS
General HDFS Commands
List Files in HDFS
Make a Directory in HDFS
Copy Files to HDFS
Copy Files from HDFS
Copy Files within HDFS
Delete a File within HDFS
Delete a Directory in HDFS
Decommissioning HDFS Nodes
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
www.it-ebooks.info ptg12441863
Apache Hadoop™ YARN
The Addison-Wesley Data and Analytics Series Visit informit.com/awdataseries for a complete list of available publications. T he Addison-Wesley Data and Analytics Series provides readers with practical knowledge for solving problems and answering questions with data. Titles in this series primarily focus on three areas: 1. Infrastructure: how to store, move, and manage data 2. Algorithms: how to mine intelligence or make predictions based on data 3. Visualizations: how to represent data and insights in a meaningful and compelling way The series aims to tie all three of these areas together to help the reader build end-to-end systems for fighting spam; making recommendations; building personalization; detecting trends, patterns, or problems; and gaining insight from the data exhaust of systems and user interactions. Make sure to connect with us! informit.com/socialconnect www.it-ebooks.info ptg12441863
Apache Hadoop™ YARN Moving beyond MapReduce and Batch Processing with Apache Hadoop™ 2 Arun C. Murthy Vinod Kumar Vavilapalli Doug Eadline Joseph Niemiec Jeff Markham Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City www.it-ebooks.info ptg12441863
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales depart- ment at corpsales@pearsoned.com or (800) 382-3419. For government sales inquiries, please contact governmentsales@pearsoned.com. For questions about sales outside the United States, please contact international@pearsoned.com. Visit us on the Web: informit.com/aw Library of Congress Cataloging-in-Publication Data Murthy, Arun C. Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2 / Arun C. Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec, Jeff Markham. pages cm Includes index. ISBN 978-0-321-93450-5 (pbk. : alk. paper) 1. Apache Hadoop. 2. Electronic data processing—Distributed processing. I. Title. QA76.9.D5M97 2014 004'.36—dc23 2014003391 Copyright © 2014 Hortonworks Inc. Apache, Apache Hadoop, Hadoop, and the Hadoop elephant logo are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks. Hortonworks is a trademark of Hortonworks, Inc., registered in the U.S. and other countries. All rights reserved. Printed in the United States of America. This publication is protected by copy- right, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission to use material from this work, please submit a written request to Pearson Education, Inc., Permissions Department, One Lake Street, Upper Saddle River, New Jersey 07458, or you may fax your request to (201) 236-3290. ISBN-13: 978-0-321-93450-5 ISBN-10: 0-321-93450-4 Text printed in the United States on recycled paper at RR Donnelley in Crawfordsville, Indiana. First printing, March 2014 www.it-ebooks.info ptg12441863
Contents xiii Foreword by Raymie Stata Foreword by Paul Dix xv Preface Acknowledgments About the Authors xxi xxv xvii 1 Apache Hadoop YARN: 1 1 A Brief History and Rationale Introduction Apache Hadoop 2 Phase 0: The Era of Ad Hoc Clusters Phase 1: Hadoop on Demand 3 3 HDFS in the HOD World Features and Advantages of HOD Shortcomings of Hadoop on Demand 5 6 7 Phase 2: Dawn of the Shared Compute Clusters 9 Evolution of Shared Clusters Issues with Shared MapReduce Clusters 9 15 Phase 3: Emergence of YARN Conclusion 20 18 2 Apache Hadoop YARN Install Quick Start 21 Getting Started Steps to Configure a Single-Node YARN Cluster 22 22 22 23 23 Step 1: Download Apache Hadoop Step 2: Set JAVA_HOME 23 Step 3: Create Users and Groups Step 4: Make Data and Log Directories 24 Step 5: Configure core-site.xml 24 Step 6: Configure hdfs-site.xml Step 7: Configure mapred-site.xml Step 8: Configure yarn-site.xml Step 9: Modify Java Heap Sizes Step 10: Format HDFS Step 11: Start the HDFS Services 25 26 26 27 25 www.it-ebooks.info ptg12441863
vi Step 12: Start YARN Services Step 13: Verify the Running Services Using the Web Interface 28 28 Run Sample MapReduce Examples Wrap-up 31 30 3 Apache Hadoop YARN Core Concepts 33 Beyond MapReduce 33 The MapReduce Paradigm Apache Hadoop MapReduce 35 35 37 The Need for Non-MapReduce Workloads Addressing Scalability Improved Utilization User Agility 38 37 38 Apache Hadoop YARN YARN Components 39 38 39 40 ResourceManager ApplicationMaster Resource Model ResourceRequests and Containers Container Specification 42 41 41 Wrap-up 42 4 Functional Overview of YARN Components 43 Architecture Overview ResourceManager YARN Scheduling Components 45 43 46 46 47 47 FIFO Scheduler Capacity Scheduler Fair Scheduler 49 Containers NodeManager ApplicationMaster YARN Resource Model 49 50 50 Client Resource Request ApplicationMaster Container Allocation ApplicationMaster–Container Manager Communication 51 52 51 www.it-ebooks.info ptg12441863Contents
vii 53 Managing Application Dependencies 54 55 LocalResources Definitions LocalResource Timestamps 55 LocalResource Types LocalResource Visibilities Lifetime of LocalResources 56 57 Wrap-up 57 5 Installing Apache Hadoop YARN 59 The Basics System Preparation 59 60 Step 1: Install EPEL and pdsh Step 2: Generate and Distribute ssh Keys 62 Script-based Installation of Hadoop 2 60 62 JDK Options Step 1: Download and Extract the Scripts Step 2: Set the Script Variables Step 3: Provide Node Names 64 Step 4: Run the Script Step 5: Verify the Installation 63 64 65 61 63 68 Script-based Uninstall Configuration File Processing Configuration File Settings 68 68 68 69 core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml Start-up Scripts Installing Hadoop with Apache Ambari 70 71 69 71 72 Performing an Ambari-based Hadoop Installation Step 1: Check Requirements 73 Step 2: Install the Ambari Server Step 3: Install and Start Ambari Agents 74 Step 4: Start the Ambari Server Step 5: Install an HDP2.X Cluster 73 75 73 Wrap-up 84 www.it-ebooks.info ptg12441863Contents
分享到:
收藏