logo资料库

Big Data Analytics with Hadoop 3.pdf

第1页 / 共470页
第2页 / 共470页
第3页 / 共470页
第4页 / 共470页
第5页 / 共470页
第6页 / 共470页
第7页 / 共470页
第8页 / 共470页
资料共470页,剩余部分请下载后查看
Cover
Title Page
Copyright and Credits
Contributors
Table of Contents
Preface
Chapter 1: Introduction to Hadoop
Hadoop Distributed File System
High availability
Intra-DataNode balancer
Erasure coding
Port numbers
MapReduce framework
Task-level native optimization
YARN
Opportunistic containers
Types of container execution 
YARN timeline service v.2
Enhancing scalability and reliability
Usability improvements
Architecture
Other changes
Minimum required Java version 
Shell script rewrite
Shaded-client JARs
Installing Hadoop 3 
Prerequisites
Downloading
Installation
Setup password-less ssh
Setting up the NameNode
Starting HDFS
Setting up the YARN service
Erasure Coding
Intra-DataNode balancer
Installing YARN timeline service v.2
Setting up the HBase cluster
Simple deployment for HBase
Enabling the co-processor
Enabling timeline service v.2
Running timeline service v.2
Enabling MapReduce to write to timeline service v.2
Summary
Chapter 2: Overview of Big Data Analytics
Introduction to data analytics
Inside the data analytics process
Introduction to big data
Variety of data
Velocity of data
Volume of data
Veracity of data
Variability of data
Visualization
Value
Distributed computing using Apache Hadoop
The MapReduce framework
Hive
Downloading and extracting the Hive binaries
Installing Derby
Using Hive
Creating a database
Creating a table
SELECT statement syntax
WHERE clauses
INSERT statement syntax
Primitive types
Complex types
Built-in operators and functions
Built-in operators
Built-in functions
Language capabilities
A cheat sheet on retrieving information 
Apache Spark
Visualization using Tableau
Summary
Chapter 3: Big Data Processing with MapReduce
The MapReduce framework
Dataset
Record reader
Map
Combiner
Partitioner
Shuffle and sort
Reduce
Output format
MapReduce job types
Single mapper job
Single mapper reducer job
Multiple mappers reducer job
SingleMapperCombinerReducer job
Scenario
MapReduce patterns
Aggregation patterns
Average temperature by city
Record count
Min/max/count
Average/median/standard deviation
Filtering patterns
Join patterns
Inner join
Left anti join
Left outer join
Right outer join
Full outer join
Left semi join
Cross join
Summary
Chapter 4: Scientific Computing and Big Data Analysis with Python and Hadoop
Installation
Installing standard Python
Installing Anaconda
Using Conda
Data analysis
Summary
Chapter 5: Statistical Big Data Computing with R and Hadoop
Introduction
Install R on workstations and connect to the data in Hadoop
Install R on a shared server and connect to Hadoop
Utilize Revolution R Open
Execute R inside of MapReduce using RMR2
Summary and outlook for pure open source options
Methods of integrating R and Hadoop
RHADOOP – install R on workstations and connect to data in Hadoop
RHIPE – execute R inside Hadoop MapReduce
R and Hadoop Streaming
RHIVE – install R on workstations and connect to data in Hadoop
ORCH – Oracle connector for Hadoop
Data analytics
Summary
Chapter 6: Batch Analytics with Apache Spark
SparkSQL and DataFrames
DataFrame APIs and the SQL API
Pivots
Filters
User-defined functions
Schema – structure of data
Implicit schema
Explicit schema
Encoders
Loading datasets
Saving datasets
Aggregations
Aggregate functions
count
first
last
approx_count_distinct
min
max
avg
sum
kurtosis
skewness
Variance
Standard deviation
Covariance
groupBy
Rollup
Cube
Window functions
ntiles
Joins
Inner workings of join
Shuffle join
Broadcast join
Join types
Inner join
Left outer join
Right outer join
Outer join
Left anti join
Left semi join
Cross join
Performance implications of join
Summary
Chapter 7: Real-Time Analytics with Apache Spark
Streaming
At-least-once processing
At-most-once processing
Exactly-once processing
Spark Streaming
StreamingContext
Creating StreamingContext
Starting StreamingContext
Stopping StreamingContext
Input streams
receiverStream
socketTextStream
rawSocketStream
fileStream
textFileStream
binaryRecordsStream
queueStream
textFileStream example
twitterStream example
Discretized Streams
Transformations
Windows operations
Stateful/stateless transformations
Stateless transformations
Stateful transformations
Checkpointing
Metadata checkpointing
Data checkpointing
Driver failure recovery
Interoperability with streaming platforms (Apache Kafka)
Receiver-based
Direct Stream
Structured Streaming
Getting deeper into Structured Streaming
Handling event time and late date
Fault-tolerance semantics
Summary
Chapter 8: Batch Analytics with Apache Flink
Introduction to Apache Flink
Continuous processing for unbounded datasets
Flink, the streaming model, and bounded datasets
Installing Flink
Downloading Flink
Installing Flink
Starting a local Flink cluster
Using the Flink cluster UI
Batch analytics
Reading file
File-based
Collection-based
Generic
Transformations
GroupBy
Aggregation
Joins
Inner join
Left outer join
Right outer join
Full outer join
Writing to a file
Summary
Chapter 9: Stream Processing with Apache Flink
Introduction to streaming execution model
Data processing using the DataStream API
Execution environment
Data sources
Socket-based
File-based
Transformations
map
flatMap
filter
keyBy
reduce
fold
Aggregations
window
Global windows
Tumbling windows
Sliding windows
Session windows
windowAll
union
Window join
split
Select
Project
Physical partitioning
Custom partitioning
Random partitioning
Rebalancing partitioning
Rescaling
Broadcasting
Event time and watermarks
Connectors
Kafka connector
Twitter connector
RabbitMQ connector
Elasticsearch connector
Cassandra connector
Summary
Chapter 10: Visualizing Big Data
Introduction
Tableau
Chart types
Line charts
Pie chart
Bar chart
Heat map
Using Python to visualize data
Using R to visualize data
Big data visualization tools
Summary
Chapter 11: Introduction to Cloud Computing
Concepts and terminology
Cloud
IT resource
On-premise
Cloud consumers and Cloud providers
Scaling
 Types of scaling
Horizontal scaling
Vertical scaling
Cloud service
Cloud service consumer
Goals and benefits
Increased scalability
Increased availability and reliability
Risks and challenges
Increased security vulnerabilities
Reduced operational governance control
Limited portability between Cloud providers
Roles and boundaries
Cloud provider
Cloud consumer
Cloud service owner
Cloud resource administrator
Additional roles
Organizational boundary
Trust boundary
Cloud characteristics
On-demand usage
Ubiquitous access
Multi-tenancy (and resource pooling)
Elasticity
Measured usage
Resiliency
Cloud delivery models
Infrastructure as a Service
Platform as a Service
Software as a Service
Combining Cloud delivery models
IaaS + PaaS
IaaS + PaaS + SaaS
Cloud deployment models
Public Clouds
Community Clouds
Private Clouds
Hybrid Clouds
Summary
Chapter 12: Using Amazon Web Services
Amazon Elastic Compute Cloud
Elastic web-scale computing
Complete control of operations
Flexible Cloud hosting services
Integration
High reliability
Security
Inexpensive
Easy to start
Instances and Amazon Machine Images
Launching multiple instances of an AMI
Instances
AMIs
Regions and availability zones
Region and availability zone concepts
Regions
Availability zones
Available regions
Regions and endpoints
Instance types
Tag basics
Amazon EC2 key pairs
Amazon EC2 security groups for Linux instances
Elastic IP addresses
Amazon EC2 and Amazon Virtual Private Cloud
Amazon Elastic Block Store
Amazon EC2 instance store
What is AWS Lambda?
When should I use AWS Lambda?
Introduction to Amazon S3
Getting started with Amazon S3
Comprehensive security and compliance capabilities
Query in place
Flexible management
Most supported platform with the largest ecosystem
Easy and flexible data transfer
Backup and recovery
Data archiving
Data lakes and big data analytics
Hybrid Cloud storage
Cloud-native application data
Disaster recovery
Amazon DynamoDB
Amazon Kinesis Data Streams
What can I do with Kinesis Data Streams?
Accelerated log and data feed intake and processing
Real-time metrics and reporting
Real-time data analytics
Complex stream processing
Benefits of using Kinesis Data Streams
AWS Glue
When should I use AWS Glue?
Amazon EMR
Practical AWS EMR cluster
Summary
Index
Big Data Analytics with Hadoop 3 Build highly effective analytics solutions to gain valuable insight into your big data Sridhar Alla BIRMINGHAM - MUMBAI
Big Data Analytics with Hadoop 3 Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Amey Varangaonkar Acquisition Editor: Varsha Shetty Content Development Editor: Cheryl Dsa Technical Editor: Sagar Sawant Copy Editors: Vikrant Phadke, Safis Editing Project Coordinator: Nidhi Joshi Proofreader: Safis Editing Indexer: Rekha Nair Graphics: Tania Dutta Production Coordinator: Arvindkumar Gupta First published: May 2018 Production reference: 1280518 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78862-884-6 www.packtpub.com
Contributors About the author Sridhar Alla is a big data expert helping companies solve complex problems in distributed computing, large scale data science and analytics practice. He presents regularly at several prestigious conferences and provides training and consulting to companies. He holds a bachelor's in computer science from JNTU, India. He loves writing code in Python, Scala, and Java. He also has extensive hands-on knowledge of several Hadoop-based technologies, TensorFlow, NoSQL, IoT, and deep learning. I thank my loving wife, Rosie Sarkaria for all the love and patience during the many months I spent writing this book. I thank my parents Ravi and Lakshmi Alla for all the support and encouragement. I am very grateful to my wonderful niece Niharika and nephew Suman Kalyan who helped me with screenshots, proof reading and testing the code snippets.
About the reviewers V. Naresh Kumar has more than a decade of professional experience in designing, implementing, and running very large-scale internet applications in Fortune 500 Companies. He is a full-stack architect with hands-on experience in e-commerce, web hosting, healthcare, big data, analytics, data streaming, advertising, and databases. He admires open source and contributes to it actively. He keeps himself updated with emerging technologies, from Linux system internals to frontend technologies. He studied in BITS- Pilani, Rajasthan, with a joint degree in computer science and economics. Manoj R. Patil is a big data architect at TatvaSoft—an IT services and consulting firm. He has a bachelor's degree in engineering from COEP, Pune. He is a proven and highly skilled business intelligence professional with 18 years, experience in IT. He is a seasoned BI and big data consultant with exposure to all the leading platforms. Previously, he worked for numerous organizations, including Tech Mahindra and Persistent Systems. Apart from authoring a book on Pentaho and big data, he has been an avid reviewer of various titles in the respective fields from Packt and other leading publishers. Manoj would like to thank his entire family, especially his two beautiful angels, Ayushee and Ananyaa for understanding during the review process. He would also like to thank Packt for giving this opportunity, the project co-ordinator and the author. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Table of Contents Preface Chapter 1: Introduction to Hadoop Hadoop Distributed File System High availability Intra-DataNode balancer Erasure coding Port numbers MapReduce framework YARN Task-level native optimization Opportunistic containers YARN timeline service v.2 Types of container execution Enhancing scalability and reliability Usability improvements Architecture Other changes Minimum required Java version Shell script rewrite Shaded-client JARs Installing Hadoop 3 Prerequisites Downloading Installation Setup password-less ssh Setting up the NameNode Starting HDFS Setting up the YARN service Erasure Coding Intra-DataNode balancer Installing YARN timeline service v.2 Simple deployment for HBase Setting up the HBase cluster Enabling the co-processor Enabling timeline service v.2 Running timeline service v.2 Enabling MapReduce to write to timeline service v.2 Summary Chapter 2: Overview of Big Data Analytics 1 7 8 9 10 10 11 12 12 13 14 15 15 15 15 16 17 17 18 18 18 19 19 20 21 21 22 27 28 31 31 32 32 35 37 38 38 39 40
Distributed computing using Apache Hadoop The MapReduce framework Hive Downloading and extracting the Hive binaries Installing Derby Using Hive Creating a database Creating a table WHERE clauses SELECT statement syntax INSERT statement syntax Primitive types Complex types Built-in operators and functions Built-in operators Built-in functions Language capabilities A cheat sheet on retrieving information Apache Spark Visualization using Tableau Summary Chapter 3: Big Data Processing with MapReduce The MapReduce framework Table of Contents Introduction to data analytics Introduction to big data Inside the data analytics process Variety of data Velocity of data Volume of data Veracity of data Variability of data Visualization Value Dataset Record reader Map Combiner Partitioner Shuffle and sort Reduce Output format MapReduce job types Single mapper job Single mapper reducer job [ ii ] 40 41 42 43 44 44 44 45 45 45 46 47 48 50 51 52 53 54 55 57 58 59 59 60 60 63 66 66 67 68 70 71 71 73 75 75 76 76 77 77 78 78 80 89
Table of Contents Multiple mappers reducer job SingleMapperCombinerReducer job Scenario MapReduce patterns Aggregation patterns Average temperature by city Record count Min/max/count Average/median/standard deviation Filtering patterns Join patterns Inner join Left anti join Left outer join Right outer join Full outer join Left semi join Cross join Summary Installation Installing standard Python Installing Anaconda Using Conda Data analysis Summary Introduction Chapter 4: Scientific Computing and Big Data Analysis with Python and Hadoop Chapter 5: Statistical Big Data Computing with R and Hadoop Install R on workstations and connect to the data in Hadoop Install R on a shared server and connect to Hadoop Utilize Revolution R Open Execute R inside of MapReduce using RMR2 Summary and outlook for pure open source options Methods of integrating R and Hadoop RHADOOP – install R on workstations and connect to data in Hadoop RHIPE – execute R inside Hadoop MapReduce R and Hadoop Streaming RHIVE – install R on workstations and connect to data in Hadoop ORCH – Oracle connector for Hadoop Data analytics Summary Chapter 6: Batch Analytics with Apache Spark SparkSQL and DataFrames [ iii ] 94 100 102 107 107 108 108 108 108 109 110 112 114 115 116 117 119 119 120 121 121 122 124 127 134 163 164 164 165 166 166 166 168 169 169 170 170 171 171 172 201 202 203
分享到:
收藏