Big Data Analytics with Hadoop 3.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：33.04M 资料格式：pdf 举报版权申诉

demorngel-10636682-4744302542902769942.pdf-第1页.png

第1页 / 共470页

demorngel-10636682-4744302542902769942.pdf-第2页.png

第2页 / 共470页

demorngel-10636682-4744302542902769942.pdf-第3页.png

第3页 / 共470页

demorngel-10636682-4744302542902769942.pdf-第4页.png

第4页 / 共470页

demorngel-10636682-4744302542902769942.pdf-第5页.png

第5页 / 共470页

demorngel-10636682-4744302542902769942.pdf-第6页.png

第6页 / 共470页

demorngel-10636682-4744302542902769942.pdf-第7页.png

第7页 / 共470页

demorngel-10636682-4744302542902769942.pdf-第8页.png

第8页 / 共470页

Cover

Title Page

Contributors

Table of Contents

Preface

Chapter 1: Introduction to Hadoop

Hadoop Distributed File System

High availability

Intra-DataNode balancer

Erasure coding

Port numbers

MapReduce framework

Task-level native optimization

YARN

Opportunistic containers

Types of container execution

YARN timeline service v.2

Enhancing scalability and reliability

Usability improvements

Architecture

Other changes

Minimum required Java version

Shell script rewrite

Shaded-client JARs

Installing Hadoop 3

Prerequisites

Downloading

Installation

Setup password-less ssh

Setting up the NameNode

Starting HDFS

Setting up the YARN service

Erasure Coding

Intra-DataNode balancer

Installing YARN timeline service v.2

Setting up the HBase cluster

Simple deployment for HBase

Enabling the co-processor

Enabling timeline service v.2

Running timeline service v.2

Enabling MapReduce to write to timeline service v.2

Summary

Chapter 2: Overview of Big Data Analytics

Introduction to data analytics

Inside the data analytics process

Introduction to big data

Variety of data

Velocity of data

Volume of data

Veracity of data

Variability of data

Visualization

Value

Distributed computing using Apache Hadoop

The MapReduce framework

Hive

Downloading and extracting the Hive binaries

Installing Derby

Using Hive

Creating a database

Creating a table

SELECT statement syntax

WHERE clauses

INSERT statement syntax

Primitive types

Complex types

Built-in operators and functions

Built-in operators

Built-in functions

Language capabilities

A cheat sheet on retrieving information

Apache Spark

Visualization using Tableau

Summary

Chapter 3: Big Data Processing with MapReduce

The MapReduce framework

Dataset

Record reader

Map

Combiner

Partitioner

Shuffle and sort

Reduce

Output format

MapReduce job types

Single mapper job

Single mapper reducer job

Multiple mappers reducer job

SingleMapperCombinerReducer job

Scenario

MapReduce patterns

Aggregation patterns

Average temperature by city

Record count

Min/max/count

Average/median/standard deviation

Filtering patterns

Join patterns

Inner join

Left anti join

Left outer join

Right outer join

Full outer join

Left semi join

Cross join

Summary

Chapter 4: Scientific Computing and Big Data Analysis with Python and Hadoop

Installation

Installing standard Python

Installing Anaconda

Using Conda

Data analysis

Summary

Chapter 5: Statistical Big Data Computing with R and Hadoop

Introduction

Install R on workstations and connect to the data in Hadoop

Install R on a shared server and connect to Hadoop

Utilize Revolution R Open

Execute R inside of MapReduce using RMR2

Summary and outlook for pure open source options

Methods of integrating R and Hadoop

RHADOOP – install R on workstations and connect to data in Hadoop

RHIPE – execute R inside Hadoop MapReduce

R and Hadoop Streaming

RHIVE – install R on workstations and connect to data in Hadoop

ORCH – Oracle connector for Hadoop

Data analytics

Summary

Chapter 6: Batch Analytics with Apache Spark

SparkSQL and DataFrames

DataFrame APIs and the SQL API

Pivots

Filters

User-defined functions

Schema – structure of data

Implicit schema

Explicit schema

Encoders

Loading datasets

Saving datasets

Aggregations

Aggregate functions

count

first

last

approx_count_distinct

min

max

avg

sum

kurtosis

skewness

Variance

Standard deviation

Covariance

groupBy

Rollup

Cube

Window functions

ntiles

Joins

Inner workings of join

Shuffle join

Broadcast join

Join types

Inner join

Left outer join

Right outer join

Outer join

Left anti join

Left semi join

Cross join

Performance implications of join

Summary

Chapter 7: Real-Time Analytics with Apache Spark

Streaming

At-least-once processing

At-most-once processing

Exactly-once processing

Spark Streaming

StreamingContext

Creating StreamingContext

Starting StreamingContext

Stopping StreamingContext

Input streams

receiverStream

socketTextStream

rawSocketStream

fileStream

textFileStream

binaryRecordsStream

queueStream

textFileStream example

twitterStream example

Discretized Streams

Transformations

Windows operations

Stateful/stateless transformations

Stateless transformations

Stateful transformations

Checkpointing

Metadata checkpointing

Data checkpointing

Driver failure recovery

Interoperability with streaming platforms (Apache Kafka)

Receiver-based

Direct Stream

Structured Streaming

Getting deeper into Structured Streaming

Handling event time and late date

Fault-tolerance semantics

Summary

Chapter 8: Batch Analytics with Apache Flink

Introduction to Apache Flink

Continuous processing for unbounded datasets

Flink, the streaming model, and bounded datasets

Installing Flink

Downloading Flink

Installing Flink

Starting a local Flink cluster

Using the Flink cluster UI

Batch analytics

Reading file

File-based

Collection-based

Generic

Transformations

GroupBy

Aggregation

Joins

Inner join

Left outer join

Right outer join

Full outer join

Writing to a file

Summary

Chapter 9: Stream Processing with Apache Flink

Introduction to streaming execution model

Data processing using the DataStream API

Execution environment

Data sources

Socket-based

File-based

Transformations

map

flatMap

filter

keyBy

reduce

fold

Aggregations

window

Global windows

Tumbling windows

Sliding windows

Session windows

windowAll

union

Window join

split

Select

Project

Physical partitioning

Custom partitioning

Random partitioning

Rebalancing partitioning

Rescaling

Broadcasting

Event time and watermarks

Connectors

Kafka connector

Twitter connector

RabbitMQ connector

Elasticsearch connector

Cassandra connector

Summary

Chapter 10: Visualizing Big Data

Introduction

Tableau

Chart types

Line charts

Pie chart

Bar chart

Heat map

Using Python to visualize data

Using R to visualize data

Big data visualization tools

Summary

Chapter 11: Introduction to Cloud Computing

Concepts and terminology

Cloud

IT resource

On-premise

Cloud consumers and Cloud providers

Scaling

Types of scaling

Horizontal scaling

Vertical scaling

Cloud service

Cloud service consumer

Goals and benefits

Increased scalability

Increased availability and reliability

Risks and challenges

Increased security vulnerabilities

Reduced operational governance control

Limited portability between Cloud providers

Roles and boundaries

Cloud provider

Cloud consumer

Cloud service owner

Cloud resource administrator

Additional roles

Organizational boundary

Trust boundary

Cloud characteristics

On-demand usage

Ubiquitous access

Multi-tenancy (and resource pooling)

Elasticity

Measured usage

Resiliency

Cloud delivery models

Infrastructure as a Service

Platform as a Service

Software as a Service

Combining Cloud delivery models

IaaS + PaaS

IaaS + PaaS + SaaS

Cloud deployment models

Public Clouds

Community Clouds

Private Clouds

Hybrid Clouds

Summary

Chapter 12: Using Amazon Web Services

Amazon Elastic Compute Cloud

Elastic web-scale computing

Complete control of operations

Flexible Cloud hosting services

Integration

High reliability

Security

Inexpensive

Easy to start

Instances and Amazon Machine Images

Launching multiple instances of an AMI

Instances

AMIs

Regions and availability zones

Region and availability zone concepts

Regions

Availability zones

Available regions

Regions and endpoints

Instance types

Tag basics

Amazon EC2 key pairs

Amazon EC2 security groups for Linux instances

Elastic IP addresses

Amazon EC2 and Amazon Virtual Private Cloud

Amazon Elastic Block Store

Amazon EC2 instance store

What is AWS Lambda?

When should I use AWS Lambda?

Introduction to Amazon S3

Getting started with Amazon S3

Comprehensive security and compliance capabilities

Query in place

Flexible management

Most supported platform with the largest ecosystem

Easy and flexible data transfer

Backup and recovery

Data archiving

Data lakes and big data analytics

Hybrid Cloud storage

Cloud-native application data

Disaster recovery

Amazon DynamoDB

Amazon Kinesis Data Streams

What can I do with Kinesis Data Streams?

Accelerated log and data feed intake and processing

Real-time metrics and reporting

Real-time data analytics

Complex stream processing

Benefits of using Kinesis Data Streams

AWS Glue

When should I use AWS Glue?

Amazon EMR

Practical AWS EMR cluster

Summary

Index

Big Data Analytics with Hadoop 3 Build highly eﬀective analytics solutions to gain valuable insight into your big data Sridhar Alla BIRMINGHAM - MUMBAI

Big Data Analytics with Hadoop 3 Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Amey Varangaonkar Acquisition Editor: Varsha Shetty Content Development Editor: Cheryl Dsa Technical Editor: Sagar Sawant Copy Editors: Vikrant Phadke, Safis Editing Project Coordinator: Nidhi Joshi Proofreader: Safis Editing Indexer: Rekha Nair Graphics: Tania Dutta Production Coordinator: Arvindkumar Gupta First published: May 2018 Production reference: 1280518 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78862-884-6 www.packtpub.com

Contributors About the author Sridhar Alla is a big data expert helping companies solve complex problems in distributed computing, large scale data science and analytics practice. He presents regularly at several prestigious conferences and provides training and consulting to companies. He holds a bachelor's in computer science from JNTU, India. He loves writing code in Python, Scala, and Java. He also has extensive hands-on knowledge of several Hadoop-based technologies, TensorFlow, NoSQL, IoT, and deep learning. I thank my loving wife, Rosie Sarkaria for all the love and patience during the many months I spent writing this book. I thank my parents Ravi and Lakshmi Alla for all the support and encouragement. I am very grateful to my wonderful niece Niharika and nephew Suman Kalyan who helped me with screenshots, proof reading and testing the code snippets.

About the reviewers V. Naresh Kumar has more than a decade of professional experience in designing, implementing, and running very large-scale internet applications in Fortune 500 Companies. He is a full-stack architect with hands-on experience in e-commerce, web hosting, healthcare, big data, analytics, data streaming, advertising, and databases. He admires open source and contributes to it actively. He keeps himself updated with emerging technologies, from Linux system internals to frontend technologies. He studied in BITS- Pilani, Rajasthan, with a joint degree in computer science and economics. Manoj R. Patil is a big data architect at TatvaSoft—an IT services and consulting firm. He has a bachelor's degree in engineering from COEP, Pune. He is a proven and highly skilled business intelligence professional with 18 years, experience in IT. He is a seasoned BI and big data consultant with exposure to all the leading platforms. Previously, he worked for numerous organizations, including Tech Mahindra and Persistent Systems. Apart from authoring a book on Pentaho and big data, he has been an avid reviewer of various titles in the respective fields from Packt and other leading publishers. Manoj would like to thank his entire family, especially his two beautiful angels, Ayushee and Ananyaa for understanding during the review process. He would also like to thank Packt for giving this opportunity, the project co-ordinator and the author. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents Preface Chapter 1: Introduction to Hadoop Hadoop Distributed File System High availability Intra-DataNode balancer Erasure coding Port numbers MapReduce framework YARN Task-level native optimization Opportunistic containers YARN timeline service v.2 Types of container execution Enhancing scalability and reliability Usability improvements Architecture Other changes Minimum required Java version Shell script rewrite Shaded-client JARs Installing Hadoop 3 Prerequisites Downloading Installation Setup password-less ssh Setting up the NameNode Starting HDFS Setting up the YARN service Erasure Coding Intra-DataNode balancer Installing YARN timeline service v.2 Simple deployment for HBase Setting up the HBase cluster Enabling the co-processor Enabling timeline service v.2 Running timeline service v.2 Enabling MapReduce to write to timeline service v.2 Summary Chapter 2: Overview of Big Data Analytics 1 7 8 9 10 10 11 12 12 13 14 15 15 15 15 16 17 17 18 18 18 19 19 20 21 21 22 27 28 31 31 32 32 35 37 38 38 39 40

Distributed computing using Apache Hadoop The MapReduce framework Hive Downloading and extracting the Hive binaries Installing Derby Using Hive Creating a database Creating a table WHERE clauses SELECT statement syntax INSERT statement syntax Primitive types Complex types Built-in operators and functions Built-in operators Built-in functions Language capabilities A cheat sheet on retrieving information Apache Spark Visualization using Tableau Summary Chapter 3: Big Data Processing with MapReduce The MapReduce framework Table of Contents Introduction to data analytics Introduction to big data Inside the data analytics process Variety of data Velocity of data Volume of data Veracity of data Variability of data Visualization Value Dataset Record reader Map Combiner Partitioner Shuffle and sort Reduce Output format MapReduce job types Single mapper job Single mapper reducer job [ ii ] 40 41 42 43 44 44 44 45 45 45 46 47 48 50 51 52 53 54 55 57 58 59 59 60 60 63 66 66 67 68 70 71 71 73 75 75 76 76 77 77 78 78 80 89

Table of Contents Multiple mappers reducer job SingleMapperCombinerReducer job Scenario MapReduce patterns Aggregation patterns Average temperature by city Record count Min/max/count Average/median/standard deviation Filtering patterns Join patterns Inner join Left anti join Left outer join Right outer join Full outer join Left semi join Cross join Summary Installation Installing standard Python Installing Anaconda Using Conda Data analysis Summary Introduction Chapter 4: Scientific Computing and Big Data Analysis with Python and Hadoop Chapter 5: Statistical Big Data Computing with R and Hadoop Install R on workstations and connect to the data in Hadoop Install R on a shared server and connect to Hadoop Utilize Revolution R Open Execute R inside of MapReduce using RMR2 Summary and outlook for pure open source options Methods of integrating R and Hadoop RHADOOP – install R on workstations and connect to data in Hadoop RHIPE – execute R inside Hadoop MapReduce R and Hadoop Streaming RHIVE – install R on workstations and connect to data in Hadoop ORCH – Oracle connector for Hadoop Data analytics Summary Chapter 6: Batch Analytics with Apache Spark SparkSQL and DataFrames [ iii ] 94 100 102 107 107 108 108 108 108 109 110 112 114 115 116 117 119 119 120 121 121 122 124 127 134 163 164 164 165 166 166 166 168 169 169 170 170 171 171 172 201 202 203

分享到：

赞收藏

资料库

Big Data Analytics with Hadoop 3.pdf

相关推荐

人工智能

热门标签

最新资料