Data-Algorithms-Recipes-for-Scaling-Up-with-Hadoop-and-Spark.pdf

发布时间：2022-06-20 发布人：admin 分类：说明书资料大小：9.74M 资料格式：pdf 举报版权申诉

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第1页.png

第1页 / 共778页

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第2页.png

第2页 / 共778页

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第3页.png

第3页 / 共778页

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第4页.png

第4页 / 共778页

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第5页.png

第5页 / 共778页

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第6页.png

第6页 / 共778页

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第7页.png

第7页 / 共778页

7c8cce42-d0eb-4a4b-8b65-c29e1e1a22d7.pdf-第8页.png

第8页 / 共778页

Cover

Table of Contents

Foreword

Preface

What Is MapReduce?

Simple Explanation of MapReduce

When to Use MapReduce

What MapReduce Isn’t

Why Use MapReduce?

Hadoop and Spark

What Is in This Book?

What Is the Focus of This Book?

Who Is This Book For?

Online Resources

What Software Is Used in This Book?

Conventions Used in This Book

Using Code Examples

Safari® Books Online

How to Contact Us

Acknowledgments

Comments and Questions for This Book

Chapter 1. Secondary Sort: Introduction

Solutions to the Secondary Sort Problem

Implementation Details

Data Flow Using Plug-in Classes

MapReduce/Hadoop Solution to Secondary Sort

Input

Expected Output

map() Function

reduce() Function

Hadoop Implementation Classes

Sample Run of Hadoop Implementation

How to Sort in Ascending or Descending Order

Spark Solution to Secondary Sort

Time Series as Input

Expected Output

Option 1: Secondary Sorting in Memory

Spark Sample Run

Option #2: Secondary Sorting Using the Spark Framework

Further Reading on Secondary Sorting

Chapter 2. Secondary Sort: A Detailed Example

Secondary Sorting Technique

Complete Example of Secondary Sorting

Input Format

Output Format

Composite Key

Sample Run—Old Hadoop API

Input

Running the MapReduce Job

Output

Sample Run—New Hadoop API

Input

Running the MapReduce Job

Output

Chapter 3. Top 10 List

Top N, Formalized

MapReduce/Hadoop Implementation: Unique Keys

Implementation Classes in MapReduce/Hadoop

Top 10 Sample Run

Finding the Top 5

Finding the Bottom 10

Spark Implementation: Unique Keys

RDD Refresher

Spark’s Function Classes

Review of the Top N Pattern for Spark

Complete Spark Top 10 Solution

Sample Run: Finding the Top 10

Parameterizing Top N

Finding the Bottom N

Spark Implementation: Nonunique Keys

Complete Spark Top 10 Solution

Sample Run

Spark Top 10 Solution Using takeOrdered()

Complete Spark Implementation

Finding the Bottom N

Alternative to Using takeOrdered()

MapReduce/Hadoop Top 10 Solution: Nonunique Keys

Sample Run

Chapter 4. Left Outer Join

Left Outer Join Example

Example Queries

Implementation of Left Outer Join in MapReduce

MapReduce Phase 1: Finding Product Locations

MapReduce Phase 2: Counting Unique Locations

Implementation Classes in Hadoop

Sample Run

Spark Implementation of Left Outer Join

Spark Program

Running the Spark Solution

Running Spark on YARN

Spark Implementation with leftOuterJoin()

Spark Program

Sample Run on YARN

Chapter 5. Order Inversion

Example of the Order Inversion Pattern

MapReduce/Hadoop Implementation of the Order Inversion Pattern

Custom Partitioner

Relative Frequency Mapper

Relative Frequency Reducer

Implementation Classes in Hadoop

Sample Run

Input

Running the MapReduce Job

Generated Output

Chapter 6. Moving Average

Example 1: Time Series Data (Stock Prices)

Example 2: Time Series Data (URL Visits)

Formal Definition

POJO Moving Average Solutions

Solution 1: Using a Queue

Solution 2: Using an Array

Testing the Moving Average

Sample Run

MapReduce/Hadoop Moving Average Solution

Input

Output

Option #1: Sorting in Memory

Sample Run

Option #2: Sorting Using the MapReduce Framework

Sample Run

Chapter 7. Market Basket Analysis

MBA Goals

Application Areas for MBA

Market Basket Analysis Using MapReduce

Input

Expected Output for Tuple2 (Order of 2)

Expected Output for Tuple3 (Order of 3)

Informal Mapper

Formal Mapper

Reducer

MapReduce/Hadoop Implementation Classes

Sample Run

Spark Solution

MapReduce Algorithm Workflow

Input

Spark Implementation

YARN Script for Spark

Creating Item Sets from Transactions

Chapter 8. Common Friends

Input

POJO Common Friends Solution

MapReduce Algorithm

The MapReduce Algorithm in Action

Solution 1: Hadoop Implementation Using Text

Sample Run for Solution 1

Solution 2: Hadoop Implementation Using ArrayListOfLongsWritable

Sample Run for Solution 2

Spark Solution

Spark Program

Sample Run of Spark Program

Chapter 9. Recommendation Engines Using MapReduce

Customers Who Bought This Item Also Bought

Input

Expected Output

MapReduce Solution

Frequently Bought Together

Input and Expected Output

MapReduce Solution

Recommend Connection

Input

Output

MapReduce Solution

Spark Implementation

Sample Run of Spark Program

Chapter 10. Content-Based Recommendation: Movies

Input

MapReduce Phase 1

MapReduce Phases 2 and 3

MapReduce Phase 2: Mapper

MapReduce Phase 2: Reducer

MapReduce Phase 3: Mapper

MapReduce Phase 3: Reducer

Similarity Measures

Movie Recommendation Implementation in Spark

High-Level Solution in Spark

Sample Run of Spark Program

Chapter 11. Smarter Email Marketing with the Markov Model

Markov Chains in a Nutshell

Markov Model Using MapReduce

Generating Time-Ordered Transactions with MapReduce

Hadoop Solution 1: Time-Ordered Transactions

Hadoop Solution 2: Time-Ordered Transactions

Generating State Sequences

Generating a Markov State Transition Matrix with MapReduce

Using the Markov Model to Predict the Next Smart Email Marketing Date

Spark Solution

Input Format

High-Level Steps

Spark Program

Script to Run the Spark Program

Sample Run

Chapter 12. K-Means Clustering

What Is K-Means Clustering?

Application Areas for Clustering

Informal K-Means Clustering Method: Partitioning Approach

K-Means Distance Function

K-Means Clustering Formalized

MapReduce Solution for K-Means Clustering

MapReduce Solution: map()

MapReduce Solution: combine()

MapReduce Solution: reduce()

K-Means Implementation by Spark

Sample Run of Spark K-Means Implementation

Chapter 13. k-Nearest Neighbors

kNN Classification

Distance Functions

kNN Example

An Informal kNN Algorithm

Formal kNN Algorithm

Java-like Non-MapReduce Solution for kNN

kNN Implementation in Spark

Formalizing kNN for the Spark Implementation

Input Data Set Formats

Spark Implementation

YARN shell script

Chapter 14. Naive Bayes

Training and Learning Examples

Numeric Training Data

Symbolic Training Data

Conditional Probability

The Naive Bayes Classifier in Depth

Naive Bayes Classifier Example

The Naive Bayes Classifier: MapReduce Solution for Symbolic Data

Stage 1: Building a Classifier Using Symbolic Training Data

Stage 2: Using the Classifier to Classify New Symbolic Data

The Naive Bayes Classifier: MapReduce Solution for Numeric Data

Naive Bayes Classifier Implementation in Spark

Stage 1: Building a Classifier Using Training Data

Stage 2: Using the Classifier to Classify New Data

Using Spark and Mahout

Apache Spark

Apache Mahout

Chapter 15. Sentiment Analysis

Sentiment Examples

Sentiment Scores: Positive or Negative

A Simple MapReduce Sentiment Analysis Example

map() Function for Sentiment Analysis

reduce() Function for Sentiment Analysis

Sentiment Analysis in the Real World

Chapter 16. Finding, Counting, and Listing All Triangles in Large Graphs

Basic Graph Concepts

Importance of Counting Triangles

MapReduce/Hadoop Solution

Step 1: MapReduce in Action

Step 2: Identify Triangles

Step 3: Remove Duplicate Triangles

Hadoop Implementation Classes

Sample Run

Spark Solution

High-Level Steps

Sample Run

Chapter 17. K-mer Counting

Input Data for K-mer Counting

Sample Data for K-mer Counting

Applications of K-mer Counting

K-mer Counting Solution in MapReduce/Hadoop

The map() Function

The reduce() Function

Hadoop Implementation Classes

K-mer Counting Solution in Spark

Spark Solution

Sample Run

Chapter 18. DNA Sequencing

Input Data for DNA Sequencing

Input Data Validation

DNA Sequence Alignment

MapReduce Algorithms for DNA Sequencing

Step 1: Alignment

Step 2: Recalibration

Step 3: Variant Detection

Chapter 19. Cox Regression

The Cox Model in a Nutshell

Cox Regression Basic Terminology

Cox Regression Using R

Expression Data

Cox Regression Application

Cox Regression POJO Solution

Input for MapReduce

Input Format

Cox Regression Using MapReduce

Cox Regression Phase 1: map()

Cox Regression Phase 1: reduce()

Cox Regression Phase 2: map()

Sample Output Generated by Phase 1 reduce() Function

Sample Output Generated by the Phase 2 map() Function

Cox Regression Script for MapReduce

Chapter 20. Cochran-Armitage Test for Trend

Cochran-Armitage Algorithm

Application of Cochran-Armitage

MapReduce Solution

Input

Expected Output

Mapper

Reducer

MapReduce/Hadoop Implementation Classes

Sample Run

Chapter 21. Allelic Frequency

Basic Definitions

Chromosome

Bioset

Allele and Allelic Frequency

Source of Data for Allelic Frequency

Allelic Frequency Analysis Using Fisher’s Exact Test

Fisher’s Exact Test

Formal Problem Statement

MapReduce Solution for Allelic Frequency

MapReduce Solution, Phase 1

Input

Output/Result

Phase 1 Mapper

Phase 1 Reducer

Sample Run of Phase 1 MapReduce/Hadoop Implementation

Sample Plot of P-Values

MapReduce Solution, Phase 2

Phase 2 Mapper for Bottom 100 P-Values

Phase 2 Reducer for Bottom 100 P-Values

Is Our Bottom 100 List a Monoid?

Hadoop Implementation Classes for Bottom 100 List

MapReduce Solution, Phase 3

Phase 3 Mapper for Bottom 100 P-Values

Phase 3 Reducer for Bottom 100 P-Values

Hadoop Implementation Classes for Bottom 100 List for Each Chromosome

Special Handling of Chromosomes X and Y

Chapter 22. The T-Test

Performing the T-Test on Biosets

MapReduce Problem Statement

Input

Expected Output

MapReduce Solution

Hadoop Implementation Classes

Spark Implementation

High-Level Steps

T-Test Algorithm

Sample Run

Chapter 23. Pearson Correlation

Pearson Correlation Formula

Pearson Correlation Example

Data Set for Pearson Correlation

POJO Solution for Pearson Correlation

MapReduce Solution for Pearson Correlation

map() Function for Pearson Correlation

reduce() Function for Pearson Correlation

Spark Solution for Pearson Correlation

Input

Output

Spark Solution

High-Level Steps

Pearson Correlation Wrapper Class

Testing the Pearson Class

Pearson Correlation Using R

YARN Script to Run Spark Program

Spearman Correlation Using Spark

Spearman Correlation Wrapper Class

Testing the Spearman Correlation Wrapper Class

Chapter 24. DNA Base Count

FASTA Format

FASTA Format Example

FASTQ Format

FASTQ Format Example

MapReduce Solution: FASTA Format

Reading FASTA Files

MapReduce FASTA Solution: map()

MapReduce FASTA Solution: reduce()

Sample Run

Log of sample run

Generated output

Custom Sorting

Custom Partitioning

MapReduce Solution: FASTQ Format

Reading FASTQ Files

MapReduce FASTQ Solution: map()

MapReduce FASTQ Solution: reduce()

Hadoop Implementation Classes: FASTQ Format

Sample Run

Spark Solution: FASTA Format

High-Level Steps

Sample Run

Spark Solution: FASTQ Format

High-Level Steps

Step 1: Import required classes and interfaces

Step 2: Handle input parameters

Step 3: Create a JavaPairRDD from FASTQ input

Step 4: Map partitions

Step 5: Collect all DNA base counts

Step 6: Emit Final Counts

Sample Run

Chapter 25. RNA Sequencing

Data Size and Format

MapReduce Workflow

Input Data Validation

RNA Sequencing Analysis Overview

MapReduce Algorithms for RNA Sequencing

Step 1: MapReduce TopHat Mapping

Step 2: MapReduce Calling Cuffdiff

Chapter 26. Gene Aggregation

Input

Output

MapReduce Solutions (Filter by Individual and by Average)

Mapper: Filter by Individual

Reducer: Filter by Individual

Mapper: Filter by Average

Reducer: Filter by Average

Computing Gene Aggregation

Hadoop Implementation Classes

Analysis of Output

Gene Aggregation in Spark

Spark Solution: Filter by Individual

Sharing Data Between Cluster Nodes

High-Level Steps

Utility Functions

Sample Run

Spark Solution: Filter by Average

High-Level Steps

Utility Functions

Sample Run

Chapter 27. Linear Regression

Basic Definitions

Simple Example

Problem Statement

Input Data

Expected Output

MapReduce Solution Using SimpleRegression

Hadoop Implementation Classes

MapReduce Solution Using R’s Linear Model

Phase 1

Phase 2

Hadoop Implementation Using Classes

Chapter 28. MapReduce and Monoids

Introduction

Definition of Monoid

How to Form a Monoid

Monoidic and Non-Monoidic Examples

Maximum over a Set of Integers

Subtraction over a Set of Integers

Addition over a Set of Integers

Multiplication over a Set of Integers

Mean over a Set of Integers

Non-Commutative Example

Median over a Set of Integers

Concatenation over Lists

Union/Intersection over Integers

Functional Example

Matrix Example

MapReduce Example: Not a Monoid

MapReduce Example: Monoid

Hadoop Implementation Classes

Sample Run

View Hadoop output

Spark Example Using Monoids

High-Level Steps

Sample Run

Conclusion on Using Monoids

Functors and Monoids

Chapter 29. The Small Files Problem

Solution 1: Merging Small Files Client-Side

Input Data

Solution with SmallFilesConsolidator

Solution Without SmallFilesConsolidator

Solution 2: Solving the Small Files Problem with CombineFileInputFormat

Custom CombineFileInputFormat

Sample Run Using CustomCFIF

Alternative Solutions

Chapter 30. Huge Cache for MapReduce

Implementation Options

Formalizing the Cache Problem

An Elegant, Scalable Solution

Implementing the LRUMap Cache

Extending the LRUMap Class

Testing the Custom Class

The MapDBEntry Class

Using MapDB

Testing MapDB: put()

Testing MapDB: get()

MapReduce Using the LRUMap Cache

CacheManager Definition

Initializing the Cache

Using the Cache

Closing the Cache

Chapter 31. The Bloom Filter

Bloom Filter Properties

A Simple Bloom Filter Example

Bloom Filters in Guava Library

Using Bloom Filters in MapReduce

Appendix A. Bioset

Appendix B. Spark RDDs

Spark Operations

Tuple

RDDs

How to Create RDDs

Creating RDDs Using Collection Objects

Collecting Elements of an RDD

Transforming an Existing RDD into a New RDD

Creating RDDs by Reading Files

Grouping by Key

Mapping Values

Reducing by Key

Combining by Key

Filtering an RDD

Saving an RDD as an HDFS Text File

Saving an RDD as an HDFS Sequence File

Reading an RDD from an HDFS Sequence File

Counting RDD Items

Spark RDD Examples in Scala

PySpark Examples

How to Package and Run Spark Jobs

Creating the JAR for Data Algorithms

Running a Job in a Spark Cluster

Running a Job in Hadoop’s YARN Environment

Bibliography

Index

About the Author

Data Algorithms RECIPES FOR SCALING UP WITH HADOOP AND SPARK Mahmoud Parsian www.it-ebooks.info

Data Algorithms If you are ready to dive into the MapReduce framework for processing large datasets, this practical book takes you step by step through the algorithms and tools you need to build distributed MapReduce applications with Apache Hadoop or Apache Spark. Each chapter provides a recipe for solving a massive computational problem, such as building a recommendation system. You’ll learn how to implement the appropriate MapReduce solution with code that you can use in your projects. Dr. Mahmoud Parsian covers basic design patterns, optimization techniques, and data mining and machine learning solutions for problems in bioinformatics, genomics, statistics, and social network analysis. This book also includes an overview of MapReduce, Hadoop, and Spark. Topics include: ■ Market basket analysis for a large set of transactions ■ Data mining algorithms (K-means, KNN, and Naive Bayes) ■ Using huge genomic data to sequence DNA and RNA ■ Naive Bayes theorem and Markov chains for data and market prediction ■ Recommendation algorithms and pairwise document similarity ■ Linear regression, Cox regression, and Pearson correlation ■ Allelic frequency and mining DNA ■ Social network analysis (recommendation systems, counting triangles, sentiment analysis) Mahmoud Parsian, PhD in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. Currently the leader of Illumina’s Big Data team, he’s spent the past 15 years working with Java (server-side), databases, MapReduce, and distributed computing. Mahmoud is the author of JDBC Recipes and JDBC Metadata, MySQL, and Oracle Recipes (both Apress). DATA/MATH US $69.99 CAN $80.99 ISBN: 978-1-491-90618-7 www.it-ebooks.info D a t a l A g o r i t h m s Twitter: @oreillymedia facebook.com/oreilly P a r s i a n Data Algorithms RECIPES FOR SCALING UP WITH HADOOP AND SPARK Mahmoud Parsian

Data Algorithms Mahmoud Parsian www.it-ebooks.info Boston

Data Algorithms by Mahmoud Parsian Copyright © 2015 Mahmoud Parsian. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Ann Spencer and Marie Beaugureau Production Editor: Matthew Hacker Copyeditor: Rachel Monaghan Proofreader: Rachel Head Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator: Rebecca Demarest July 2015: First Edition Revision History for the First Edition 2015-07-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491906187 for release details. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-90618-7 [LSI] www.it-ebooks.info

This book is dedicated to my dear family: wife, Behnaz, daughter, Maral, son, Yaseen www.it-ebooks.info

www.it-ebooks.info

Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1. Secondary Sort: Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Solutions to the Secondary Sort Problem 3 Implementation Details 3 Data Flow Using Plug-in Classes 6 MapReduce/Hadoop Solution to Secondary Sort 7 Input 7 Expected Output 7 map() Function 8 reduce() Function 8 Hadoop Implementation Classes 9 Sample Run of Hadoop Implementation 10 How to Sort in Ascending or Descending Order 12 Spark Solution to Secondary Sort 12 Time Series as Input 12 Expected Output 13 Option 1: Secondary Sorting in Memory 13 Spark Sample Run 20 Option #2: Secondary Sorting Using the Spark Framework 24 Further Reading on Secondary Sorting 25 2. Secondary Sort: A Detailed Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Secondary Sorting Technique 28 Complete Example of Secondary Sorting 32 Input Format 32 www.it-ebooks.info v

Output Format 33 Composite Key 33 Sample Run—Old Hadoop API 36 Input 36 Running the MapReduce Job 37 Output 37 Sample Run—New Hadoop API 37 Input 38 Running the MapReduce Job 38 Output 39 3. Top 10 List. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Top N, Formalized 42 MapReduce/Hadoop Implementation: Unique Keys 43 Implementation Classes in MapReduce/Hadoop 47 Top 10 Sample Run 47 Finding the Top 5 49 Finding the Bottom 10 49 Spark Implementation: Unique Keys 50 RDD Refresher 50 Spark’s Function Classes 51 Review of the Top N Pattern for Spark 52 Complete Spark Top 10 Solution 53 Sample Run: Finding the Top 10 58 Parameterizing Top N 59 Finding the Bottom N 61 Spark Implementation: Nonunique Keys 62 Complete Spark Top 10 Solution 64 Sample Run 72 Spark Top 10 Solution Using takeOrdered() 73 Complete Spark Implementation 74 Finding the Bottom N 79 Alternative to Using takeOrdered() 80 MapReduce/Hadoop Top 10 Solution: Nonunique Keys 81 Sample Run 82 4. Left Outer Join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Left Outer Join Example 85 Example Queries 87 Implementation of Left Outer Join in MapReduce 88 MapReduce Phase 1: Finding Product Locations 88 MapReduce Phase 2: Counting Unique Locations 92 vi | Table of Contents www.it-ebooks.info

分享到：

赞收藏

资料库

Data-Algorithms-Recipes-for-Scaling-Up-with-Hadoop-and-Spark.pdf

相关推荐

大数据

热门标签

最新资料