Preface
Who Should Read This Book
Why I Wrote This Book
A Word on Data Science Today
Navigating This Book
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
1. Data I/O
What Is Data, Anyway?
Data Models
Univariate Arrays
Multivariate Arrays
Data Objects
Matrices and Vectors
JSON
Dealing with Real Data
Nulls
Blank Spaces
Parse Errors
Outliers
Managing Data Files
Understanding File Contents First
Reading from a Text File
Parsing big strings
Parsing delimited strings
Parsing JSON strings
Reading from a JSON File
Reading from an Image File
Writing to a Text File
Mastering Database Operations
Command-Line Clients
Structured Query Language
Create
Select
Insert
Update
Delete
Drop
Java Database Connectivity
Connections
Statements
Prepared statements
Result sets
Visualizing Data with Plots
Creating Simple Plots
Scatter plots
Bar charts
Plotting multiple series
Basic formatting
Plotting Mixed Chart Types
Saving a Plot to a File
2. Linear Algebra
Building Vectors and Matrices
Array Storage
Block Storage
Map Storage
Accessing Elements
Working with Submatrices
Randomization
Operating on Vectors and Matrices
Scaling
Transposing
Addition and Subtraction
Length
Distances
Multiplication
Inner Product
Outer Product
Entrywise Product
Compound Operations
Affine Transformation
Mapping a Function
Decomposing Matrices
Cholesky Decomposition
LU Decomposition
QR Decomposition
Singular Value Decomposition
Eigen Decomposition
Determinant
Inverse
Solving Linear Systems
3. Statistics
The Probabilistic Origins of Data
Probability Density
Cumulative Probability
Statistical Moments
Entropy
Continuous Distributions
Uniform
Normal
Multivariate normal
Log normal
Empirical
Discrete Distributions
Bernoulli
Binomial
Poisson
Characterizing Datasets
Calculating Moments
Sample moments
Updating moments
Descriptive Statistics
Count
Sum
Min
Max
Mean
Median
Mode
Variance
Standard deviation
Error on the mean
Skewness
Kurtosis
Multivariate Statistics
Covariance and Correlation
Covariance
Pearson’s correlation
Regression
Simple regression
Multiple regression
Working with Large Datasets
Accumulating Statistics
Merging Statistics
Regression
Using Built-in Database Functions
4. Data Operations
Transforming Text Data
Extracting Tokens from a Document
Utilizing Dictionaries
Vectorizing a Document
Scaling and Regularizing Numeric Data
Scaling Columns
Min-max scaling
Centering the data
Unit normal scaling
Scaling Rows
L1 regularization
L2 regularization
Matrix Scaling Operator
Reducing Data to Principal Components
Covariance Method
SVD Method
Creating Training, Validation, and Test Sets
Index-Based Resampling
List-Based Resampling
Mini-Batches
Encoding Labels
A Generic Encoder
One-Hot Encoding
5. Learning and Prediction
Learning Algorithms
Iterative Learning Procedure
Gradient Descent Optimizer
Evaluating Learning Processes
Minimizing a Loss Function
Linear loss
Quadratic loss
Cross-entropy loss
Bernoulli
Multinomial
Two-Point
Minimizing the Sum of Variances
Silhouette Coefficient
Log-Likelihood
Classifier Accuracy
Unsupervised Learning
k-Means Clustering
DBSCAN
Dealing with outliers
Optimizing radius of capture and minPoints
Inference from DBSCAN
Gaussian Mixtures
Gaussian mixture model
Fitting with the EM algorithm
Optimizing the number of clusters
Supervised Learning
Naive Bayes
Gaussian
Multinomial
Bernoulli
Iris example
Linear Models
Linear
Logistic
Softmax
Tanh
Linear model estimator
Iris example
Deep Networks
A network layer
Feed forward
Back propagation
Deep network estimator
MNIST example
6. Hadoop MapReduce
Hadoop Distributed File System
MapReduce Architecture
Writing MapReduce Applications
Anatomy of a MapReduce Job
Hadoop Data Types
Writable and WritableComparable types
Custom Writable and WritableComparable types
Writable
WritableComparable
Mappers
Generic mappers
Customizing a mapper
Reducers
Generic reducers
Customizing a reducer
The Simplicity of a JSON String as Text
Deployment Wizardry
Running a standalone program
Deploying a JAR application
Including dependencies
Simplifying with a BASH script
MapReduce Examples
Word Count
Custom Word Count
Sparse Linear Algebra
A. Datasets
Anscombe’s Quartet
Sentiment
Gaussian Mixtures
Iris
MNIST
Index