Copyright
Table of Contents
Foreword
Preface
What’s in This Book
The Second Edition
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
Chapter 1. Analyzing Big Data
The Challenges of Data Science
Introducing Apache Spark
About This Book
The Second Edition
Chapter 2. Introduction to Data Analysis with Scala and Spark
Scala for Data Scientists
The Spark Programming Model
Record Linkage
Getting Started: The Spark Shell and SparkContext
Bringing Data from the Cluster to the Client
Shipping Code from the Client to the Cluster
From RDDs to Data Frames
Analyzing Data with the DataFrame API
Fast Summary Statistics for DataFrames
Pivoting and Reshaping DataFrames
Joining DataFrames and Selecting Features
Preparing Models for Production Environments
Model Evaluation
Where to Go from Here
Chapter 3. Recommending Music and the Audioscrobbler Data Set
Data Set
The Alternating Least Squares Recommender Algorithm
Preparing the Data
Building a First Model
Spot Checking Recommendations
Evaluating Recommendation Quality
Computing AUC
Hyperparameter Selection
Making Recommendations
Where to Go from Here
Chapter 4. Predicting Forest Cover with Decision Trees
Fast Forward to Regression
Vectors and Features
Training Examples
Decision Trees and Forests
Covtype Data Set
Preparing the Data
A First Decision Tree
Decision Tree Hyperparameters
Tuning Decision Trees
Categorical Features Revisited
Random Decision Forests
Making Predictions
Where to Go from Here
Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering
Anomaly Detection
K-means Clustering
Network Intrusion
KDD Cup 1999 Data Set
A First Take on Clustering
Choosing k
Visualization with SparkR
Feature Normalization
Categorical Variables
Using Labels with Entropy
Clustering in Action
Where to Go from Here
Chapter 6. Understanding Wikipedia with Latent Semantic Analysis
The Document-Term Matrix
Getting the Data
Parsing and Preparing the Data
Lemmatization
Computing the TF-IDFs
Singular Value Decomposition
Finding Important Concepts
Querying and Scoring with a Low-Dimensional Representation
Term-Term Relevance
Document-Document Relevance
Document-Term Relevance
Multiple-Term Queries
Where to Go from Here
Chapter 7. Analyzing Co-Occurrence Networks with GraphX
The MEDLINE Citation Index: A Network Analysis
Getting the Data
Parsing XML Documents with Scala’s XML Library
Analyzing the MeSH Major Topics and Their Co-Occurrences
Constructing a Co-Occurrence Network with GraphX
Understanding the Structure of Networks
Connected Components
Degree Distribution
Filtering Out Noisy Edges
Processing EdgeTriplets
Analyzing the Filtered Graph
Small-World Networks
Cliques and Clustering Coefficients
Computing Average Path Length with Pregel
Where to Go from Here
Chapter 8. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data
Getting the Data
Working with Third-Party Libraries in Spark
Geospatial Data with the Esri Geometry API and Spray
Exploring the Esri Geometry API
Intro to GeoJSON
Preparing the New York City Taxi Trip Data
Handling Invalid Records at Scale
Geospatial Analysis
Sessionization in Spark
Building Sessions: Secondary Sorts in Spark
Where to Go from Here
Chapter 9. Estimating Financial Risk Through Monte Carlo Simulation
Terminology
Methods for Calculating VaR
Variance-Covariance
Historical Simulation
Monte Carlo Simulation
Our Model
Getting the Data
Preprocessing
Determining the Factor Weights
Sampling
The Multivariate Normal Distribution
Running the Trials
Visualizing the Distribution of Returns
Evaluating Our Results
Where to Go from Here
Chapter 10. Analyzing Genomics Data and the BDG Project
Decoupling Storage from Modeling
Ingesting Genomics Data with the ADAM CLI
Parquet Format and Columnar Storage
Predicting Transcription Factor Binding Sites from ENCODE Data
Querying Genotypes from the 1000 Genomes Project
Where to Go from Here
Chapter 11. Analyzing Neuroimaging Data with PySpark and Thunder
Overview of PySpark
PySpark Internals
Overview and Installation of the Thunder Library
Loading Data with Thunder
Thunder Core Data Types
Categorizing Neuron Types with Thunder
Where to Go from Here
Index
About the Authors
Colophon