logo资料库

Data-Algorithms-Recipes-for-Scaling-Up-with-Hadoop-and-Spark.pdf

第1页 / 共778页
第2页 / 共778页
第3页 / 共778页
第4页 / 共778页
第5页 / 共778页
第6页 / 共778页
第7页 / 共778页
第8页 / 共778页
资料共778页,剩余部分请下载后查看
Cover
Copyright
Table of Contents
Foreword
Preface
What Is MapReduce?
Simple Explanation of MapReduce
When to Use MapReduce
What MapReduce Isn’t
Why Use MapReduce?
Hadoop and Spark
What Is in This Book?
What Is the Focus of This Book?
Who Is This Book For?
Online Resources
What Software Is Used in This Book?
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Comments and Questions for This Book
Chapter 1. Secondary Sort: Introduction
Solutions to the Secondary Sort Problem
Implementation Details
Data Flow Using Plug-in Classes
MapReduce/Hadoop Solution to Secondary Sort
Input
Expected Output
map() Function
reduce() Function
Hadoop Implementation Classes
Sample Run of Hadoop Implementation
How to Sort in Ascending or Descending Order
Spark Solution to Secondary Sort
Time Series as Input
Expected Output
Option 1: Secondary Sorting in Memory
Spark Sample Run
Option #2: Secondary Sorting Using the Spark Framework
Further Reading on Secondary Sorting
Chapter 2. Secondary Sort: A Detailed Example
Secondary Sorting Technique
Complete Example of Secondary Sorting
Input Format
Output Format
Composite Key
Sample Run—Old Hadoop API
Input
Running the MapReduce Job
Output
Sample Run—New Hadoop API
Input
Running the MapReduce Job
Output
Chapter 3. Top 10 List
Top N, Formalized
MapReduce/Hadoop Implementation: Unique Keys
Implementation Classes in MapReduce/Hadoop
Top 10 Sample Run
Finding the Top 5
Finding the Bottom 10
Spark Implementation: Unique Keys
RDD Refresher
Spark’s Function Classes
Review of the Top N Pattern for Spark
Complete Spark Top 10 Solution
Sample Run: Finding the Top 10
Parameterizing Top N
Finding the Bottom N
Spark Implementation: Nonunique Keys
Complete Spark Top 10 Solution
Sample Run
Spark Top 10 Solution Using takeOrdered()
Complete Spark Implementation
Finding the Bottom N
Alternative to Using takeOrdered()
MapReduce/Hadoop Top 10 Solution: Nonunique Keys
Sample Run
Chapter 4. Left Outer Join
Left Outer Join Example
Example Queries
Implementation of Left Outer Join in MapReduce
MapReduce Phase 1: Finding Product Locations
MapReduce Phase 2: Counting Unique Locations
Implementation Classes in Hadoop
Sample Run
Spark Implementation of Left Outer Join
Spark Program
Running the Spark Solution
Running Spark on YARN
Spark Implementation with leftOuterJoin()
Spark Program
Sample Run on YARN
Chapter 5. Order Inversion
Example of the Order Inversion Pattern
MapReduce/Hadoop Implementation of the Order Inversion Pattern
Custom Partitioner
Relative Frequency Mapper
Relative Frequency Reducer
Implementation Classes in Hadoop
Sample Run
Input
Running the MapReduce Job
Generated Output
Chapter 6. Moving Average
Example 1: Time Series Data (Stock Prices)
Example 2: Time Series Data (URL Visits)
Formal Definition
POJO Moving Average Solutions
Solution 1: Using a Queue
Solution 2: Using an Array
Testing the Moving Average
Sample Run
MapReduce/Hadoop Moving Average Solution
Input
Output
Option #1: Sorting in Memory
Sample Run
Option #2: Sorting Using the MapReduce Framework
Sample Run
Chapter 7. Market Basket Analysis
MBA Goals
Application Areas for MBA
Market Basket Analysis Using MapReduce
Input
Expected Output for Tuple2 (Order of 2)
Expected Output for Tuple3 (Order of 3)
Informal Mapper
Formal Mapper
Reducer
MapReduce/Hadoop Implementation Classes
Sample Run
Spark Solution
MapReduce Algorithm Workflow
Input
Spark Implementation
YARN Script for Spark
Creating Item Sets from Transactions
Chapter 8. Common Friends
Input
POJO Common Friends Solution
MapReduce Algorithm
The MapReduce Algorithm in Action
Solution 1: Hadoop Implementation Using Text
Sample Run for Solution 1
Solution 2: Hadoop Implementation Using ArrayListOfLongsWritable
Sample Run for Solution 2
Spark Solution
Spark Program
Sample Run of Spark Program
Chapter 9. Recommendation Engines Using MapReduce
Customers Who Bought This Item Also Bought
Input
Expected Output
MapReduce Solution
Frequently Bought Together
Input and Expected Output
MapReduce Solution
Recommend Connection
Input
Output
MapReduce Solution
Spark Implementation
Sample Run of Spark Program
Chapter 10. Content-Based Recommendation: Movies
Input
MapReduce Phase 1
MapReduce Phases 2 and 3
MapReduce Phase 2: Mapper
MapReduce Phase 2: Reducer
MapReduce Phase 3: Mapper
MapReduce Phase 3: Reducer
Similarity Measures
Movie Recommendation Implementation in Spark
High-Level Solution in Spark
Sample Run of Spark Program
Chapter 11. Smarter Email Marketing with the Markov Model
Markov Chains in a Nutshell
Markov Model Using MapReduce
Generating Time-Ordered Transactions with MapReduce
Hadoop Solution 1: Time-Ordered Transactions
Hadoop Solution 2: Time-Ordered Transactions
Generating State Sequences
Generating a Markov State Transition Matrix with MapReduce
Using the Markov Model to Predict the Next Smart Email Marketing Date
Spark Solution
Input Format
High-Level Steps
Spark Program
Script to Run the Spark Program
Sample Run
Chapter 12. K-Means Clustering
What Is K-Means Clustering?
Application Areas for Clustering
Informal K-Means Clustering Method: Partitioning Approach
K-Means Distance Function
K-Means Clustering Formalized
MapReduce Solution for K-Means Clustering
MapReduce Solution: map()
MapReduce Solution: combine()
MapReduce Solution: reduce()
K-Means Implementation by Spark
Sample Run of Spark K-Means Implementation
Chapter 13. k-Nearest Neighbors
kNN Classification
Distance Functions
kNN Example
An Informal kNN Algorithm
Formal kNN Algorithm
Java-like Non-MapReduce Solution for kNN
kNN Implementation in Spark
Formalizing kNN for the Spark Implementation
Input Data Set Formats
Spark Implementation
YARN shell script
Chapter 14. Naive Bayes
Training and Learning Examples
Numeric Training Data
Symbolic Training Data
Conditional Probability
The Naive Bayes Classifier in Depth
Naive Bayes Classifier Example
The Naive Bayes Classifier: MapReduce Solution for Symbolic Data
Stage 1: Building a Classifier Using Symbolic Training Data
Stage 2: Using the Classifier to Classify New Symbolic Data
The Naive Bayes Classifier: MapReduce Solution for Numeric Data
Naive Bayes Classifier Implementation in Spark
Stage 1: Building a Classifier Using Training Data
Stage 2: Using the Classifier to Classify New Data
Using Spark and Mahout
Apache Spark
Apache Mahout
Chapter 15. Sentiment Analysis
Sentiment Examples
Sentiment Scores: Positive or Negative
A Simple MapReduce Sentiment Analysis Example
map() Function for Sentiment Analysis
reduce() Function for Sentiment Analysis
Sentiment Analysis in the Real World
Chapter 16. Finding, Counting, and Listing All Triangles in Large Graphs
Basic Graph Concepts
Importance of Counting Triangles
MapReduce/Hadoop Solution
Step 1: MapReduce in Action
Step 2: Identify Triangles
Step 3: Remove Duplicate Triangles
Hadoop Implementation Classes
Sample Run
Spark Solution
High-Level Steps
Sample Run
Chapter 17. K-mer Counting
Input Data for K-mer Counting
Sample Data for K-mer Counting
Applications of K-mer Counting
K-mer Counting Solution in MapReduce/Hadoop
The map() Function
The reduce() Function
Hadoop Implementation Classes
K-mer Counting Solution in Spark
Spark Solution
Sample Run
Chapter 18. DNA Sequencing
Input Data for DNA Sequencing
Input Data Validation
DNA Sequence Alignment
MapReduce Algorithms for DNA Sequencing
Step 1: Alignment
Step 2: Recalibration
Step 3: Variant Detection
Chapter 19. Cox Regression
The Cox Model in a Nutshell
Cox Regression Basic Terminology
Cox Regression Using R
Expression Data
Cox Regression Application
Cox Regression POJO Solution
Input for MapReduce
Input Format
Cox Regression Using MapReduce
Cox Regression Phase 1: map()
Cox Regression Phase 1: reduce()
Cox Regression Phase 2: map()
Sample Output Generated by Phase 1 reduce() Function
Sample Output Generated by the Phase 2 map() Function
Cox Regression Script for MapReduce
Chapter 20. Cochran-Armitage Test for Trend
Cochran-Armitage Algorithm
Application of Cochran-Armitage
MapReduce Solution
Input
Expected Output
Mapper
Reducer
MapReduce/Hadoop Implementation Classes
Sample Run
Chapter 21. Allelic Frequency
Basic Definitions
Chromosome
Bioset
Allele and Allelic Frequency
Source of Data for Allelic Frequency
Allelic Frequency Analysis Using Fisher’s Exact Test
Fisher’s Exact Test
Formal Problem Statement
MapReduce Solution for Allelic Frequency
MapReduce Solution, Phase 1
Input
Output/Result
Phase 1 Mapper
Phase 1 Reducer
Sample Run of Phase 1 MapReduce/Hadoop Implementation
Sample Plot of P-Values
MapReduce Solution, Phase 2
Phase 2 Mapper for Bottom 100 P-Values
Phase 2 Reducer for Bottom 100 P-Values
Is Our Bottom 100 List a Monoid?
Hadoop Implementation Classes for Bottom 100 List
MapReduce Solution, Phase 3
Phase 3 Mapper for Bottom 100 P-Values
Phase 3 Reducer for Bottom 100 P-Values
Hadoop Implementation Classes for Bottom 100 List for Each Chromosome
Special Handling of Chromosomes X and Y
Chapter 22. The T-Test
Performing the T-Test on Biosets
MapReduce Problem Statement
Input
Expected Output
MapReduce Solution
Hadoop Implementation Classes
Spark Implementation
High-Level Steps
T-Test Algorithm
Sample Run
Chapter 23. Pearson Correlation
Pearson Correlation Formula
Pearson Correlation Example
Data Set for Pearson Correlation
POJO Solution for Pearson Correlation
MapReduce Solution for Pearson Correlation
map() Function for Pearson Correlation
reduce() Function for Pearson Correlation
Spark Solution for Pearson Correlation
Input
Output
Spark Solution
High-Level Steps
Pearson Correlation Wrapper Class
Testing the Pearson Class
Pearson Correlation Using R
YARN Script to Run Spark Program
Spearman Correlation Using Spark
Spearman Correlation Wrapper Class
Testing the Spearman Correlation Wrapper Class
Chapter 24. DNA Base Count
FASTA Format
FASTA Format Example
FASTQ Format
FASTQ Format Example
MapReduce Solution: FASTA Format
Reading FASTA Files
MapReduce FASTA Solution: map()
MapReduce FASTA Solution: reduce()
Sample Run
Log of sample run
Generated output
Custom Sorting
Custom Partitioning
MapReduce Solution: FASTQ Format
Reading FASTQ Files
MapReduce FASTQ Solution: map()
MapReduce FASTQ Solution: reduce()
Hadoop Implementation Classes: FASTQ Format
Sample Run
Spark Solution: FASTA Format
High-Level Steps
Sample Run
Spark Solution: FASTQ Format
High-Level Steps
Step 1: Import required classes and interfaces
Step 2: Handle input parameters
Step 3: Create a JavaPairRDD from FASTQ input
Step 4: Map partitions
Step 5: Collect all DNA base counts
Step 6: Emit Final Counts
Sample Run
Chapter 25. RNA Sequencing
Data Size and Format
MapReduce Workflow
Input Data Validation
RNA Sequencing Analysis Overview
MapReduce Algorithms for RNA Sequencing
Step 1: MapReduce TopHat Mapping
Step 2: MapReduce Calling Cuffdiff
Chapter 26. Gene Aggregation
Input
Output
MapReduce Solutions (Filter by Individual and by Average)
Mapper: Filter by Individual
Reducer: Filter by Individual
Mapper: Filter by Average
Reducer: Filter by Average
Computing Gene Aggregation
Hadoop Implementation Classes
Analysis of Output
Gene Aggregation in Spark
Spark Solution: Filter by Individual
Sharing Data Between Cluster Nodes
High-Level Steps
Utility Functions
Sample Run
Spark Solution: Filter by Average
High-Level Steps
Utility Functions
Sample Run
Chapter 27. Linear Regression
Basic Definitions
Simple Example
Problem Statement
Input Data
Expected Output
MapReduce Solution Using SimpleRegression
Hadoop Implementation Classes
MapReduce Solution Using R’s Linear Model
Phase 1
Phase 2
Hadoop Implementation Using Classes
Chapter 28. MapReduce and Monoids
Introduction
Definition of Monoid
How to Form a Monoid
Monoidic and Non-Monoidic Examples
Maximum over a Set of Integers
Subtraction over a Set of Integers
Addition over a Set of Integers
Multiplication over a Set of Integers
Mean over a Set of Integers
Non-Commutative Example
Median over a Set of Integers
Concatenation over Lists
Union/Intersection over Integers
Functional Example
Matrix Example
MapReduce Example: Not a Monoid
MapReduce Example: Monoid
Hadoop Implementation Classes
Sample Run
View Hadoop output
Spark Example Using Monoids
High-Level Steps
Sample Run
Conclusion on Using Monoids
Functors and Monoids
Chapter 29. The Small Files Problem
Solution 1: Merging Small Files Client-Side
Input Data
Solution with SmallFilesConsolidator
Solution Without SmallFilesConsolidator
Solution 2: Solving the Small Files Problem with CombineFileInputFormat
Custom CombineFileInputFormat
Sample Run Using CustomCFIF
Alternative Solutions
Chapter 30. Huge Cache for MapReduce
Implementation Options
Formalizing the Cache Problem
An Elegant, Scalable Solution
Implementing the LRUMap Cache
Extending the LRUMap Class
Testing the Custom Class
The MapDBEntry Class
Using MapDB
Testing MapDB: put()
Testing MapDB: get()
MapReduce Using the LRUMap Cache
CacheManager Definition
Initializing the Cache
Using the Cache
Closing the Cache
Chapter 31. The Bloom Filter
Bloom Filter Properties
A Simple Bloom Filter Example
Bloom Filters in Guava Library
Using Bloom Filters in MapReduce
Appendix A. Bioset
Appendix B. Spark RDDs
Spark Operations
Tuple
RDDs
How to Create RDDs
Creating RDDs Using Collection Objects
Collecting Elements of an RDD
Transforming an Existing RDD into a New RDD
Creating RDDs by Reading Files
Grouping by Key
Mapping Values
Reducing by Key
Combining by Key
Filtering an RDD
Saving an RDD as an HDFS Text File
Saving an RDD as an HDFS Sequence File
Reading an RDD from an HDFS Sequence File
Counting RDD Items
Spark RDD Examples in Scala
PySpark Examples
How to Package and Run Spark Jobs
Creating the JAR for Data Algorithms
Running a Job in a Spark Cluster
Running a Job in Hadoop’s YARN Environment
Bibliography
Index
About the Author
Data Algorithms RECIPES FOR SCALING UP WITH HADOOP AND SPARK Mahmoud Parsian www.it-ebooks.info
Data Algorithms If you are ready to dive into the MapReduce framework for processing large datasets, this practical book takes you step by step through the algorithms and tools you need to build distributed MapReduce applications with Apache Hadoop or Apache Spark. Each chapter provides a recipe for solving a massive computational problem, such as building a recommendation system. You’ll learn how to implement the appropriate MapReduce solution with code that you can use in your projects. Dr. Mahmoud Parsian covers basic design patterns, optimization techniques, and data mining and machine learning solutions for problems in bioinformatics, genomics, statistics, and social network analysis. This book also includes an overview of MapReduce, Hadoop, and Spark. Topics include: ■ Market basket analysis for a large set of transactions ■ Data mining algorithms (K-means, KNN, and Naive Bayes) ■ Using huge genomic data to sequence DNA and RNA ■ Naive Bayes theorem and Markov chains for data and market prediction ■ Recommendation algorithms and pairwise document similarity ■ Linear regression, Cox regression, and Pearson correlation ■ Allelic frequency and mining DNA ■ Social network analysis (recommendation systems, counting triangles, sentiment analysis) Mahmoud Parsian, PhD in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. Currently the leader of Illumina’s Big Data team, he’s spent the past 15 years working with Java (server-side), databases, MapReduce, and distributed computing. Mahmoud is the author of JDBC Recipes and JDBC Metadata, MySQL, and Oracle Recipes (both Apress). DATA/MATH US $69.99 CAN $80.99 ISBN: 978-1-491-90618-7 www.it-ebooks.info D a t a l A g o r i t h m s Twitter: @oreillymedia facebook.com/oreilly P a r s i a n Data Algorithms RECIPES FOR SCALING UP WITH HADOOP AND SPARK Mahmoud Parsian
Data Algorithms Mahmoud Parsian www.it-ebooks.info Boston
Data Algorithms by Mahmoud Parsian Copyright © 2015 Mahmoud Parsian. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Ann Spencer and Marie Beaugureau Production Editor: Matthew Hacker Copyeditor: Rachel Monaghan Proofreader: Rachel Head Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Ellie Volckhausen Illustrator: Rebecca Demarest July 2015: First Edition Revision History for the First Edition 2015-07-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491906187 for release details. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-90618-7 [LSI] www.it-ebooks.info
This book is dedicated to my dear family: wife, Behnaz, daughter, Maral, son, Yaseen www.it-ebooks.info
www.it-ebooks.info
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1. Secondary Sort: Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Solutions to the Secondary Sort Problem 3 Implementation Details 3 Data Flow Using Plug-in Classes 6 MapReduce/Hadoop Solution to Secondary Sort 7 Input 7 Expected Output 7 map() Function 8 reduce() Function 8 Hadoop Implementation Classes 9 Sample Run of Hadoop Implementation 10 How to Sort in Ascending or Descending Order 12 Spark Solution to Secondary Sort 12 Time Series as Input 12 Expected Output 13 Option 1: Secondary Sorting in Memory 13 Spark Sample Run 20 Option #2: Secondary Sorting Using the Spark Framework 24 Further Reading on Secondary Sorting 25 2. Secondary Sort: A Detailed Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Secondary Sorting Technique 28 Complete Example of Secondary Sorting 32 Input Format 32 www.it-ebooks.info v
Output Format 33 Composite Key 33 Sample Run—Old Hadoop API 36 Input 36 Running the MapReduce Job 37 Output 37 Sample Run—New Hadoop API 37 Input 38 Running the MapReduce Job 38 Output 39 3. Top 10 List. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Top N, Formalized 42 MapReduce/Hadoop Implementation: Unique Keys 43 Implementation Classes in MapReduce/Hadoop 47 Top 10 Sample Run 47 Finding the Top 5 49 Finding the Bottom 10 49 Spark Implementation: Unique Keys 50 RDD Refresher 50 Spark’s Function Classes 51 Review of the Top N Pattern for Spark 52 Complete Spark Top 10 Solution 53 Sample Run: Finding the Top 10 58 Parameterizing Top N 59 Finding the Bottom N 61 Spark Implementation: Nonunique Keys 62 Complete Spark Top 10 Solution 64 Sample Run 72 Spark Top 10 Solution Using takeOrdered() 73 Complete Spark Implementation 74 Finding the Bottom N 79 Alternative to Using takeOrdered() 80 MapReduce/Hadoop Top 10 Solution: Nonunique Keys 81 Sample Run 82 4. Left Outer Join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Left Outer Join Example 85 Example Queries 87 Implementation of Left Outer Join in MapReduce 88 MapReduce Phase 1: Finding Product Locations 88 MapReduce Phase 2: Counting Unique Locations 92 vi | Table of Contents www.it-ebooks.info
分享到:
收藏