How to Go Really Big in AI:
Strategies & Principles for Distributed Machine Learning
Eric Xing
epxing@cs.cmu.edu
School of Computer Science
Carnegie Mellon University
Wei Dai, Qirong Ho, Jin Kyu Kim, Abhimanu Kumar, Seunghak Lee, Jinliang Wei, Pengtao Xie, Yaoliang Yu, Hao Zhang, Xun Zheng
Acknowledgement:
James Cipar, Henggang Cui,
and, Phil Gibbons, Greg Ganger, Garth Gibson
1
Machine Learning:
-- a view from outside
2
Inside ML …
• Graphical
Models
• Nonparametric
Bayesian Models
• Regularized
Bayesian Methods
• Large-Margin
• Deep Learning
• Sparse Coding
• Sparse Structured
I/O Regression
• Spectral/Matrix
Methods
• Network switches
• Infiniband
• Network attached storage
• Flash storage
• Server machines
• Desktops/Laptops
• NUMA machines
• GPUs
• Cloud compute
(e.g. Amazon EC2)
• Virtual Machines
3
Hardware and infrastructure
Massive Data
1B+ USERS
30+ PETABYTES
32 million
pages
100+ hours video
uploaded every minute
645 million users
500 million tweets / day
4
Growing Model Complexity
Google Brain
Deep Learning
for images:
1~10 Billion
model parameters
Multi-task Regression
for simplest whole-
genome analysis:
100 million ~ 1 Billion
model
parameters
Topic Models
for news article
analysis:
Up to 1 Trillion
model
parameters
Collaborative filtering
for Video recommendation:
1~10 Billion
model
parameters
5
The Scalability Challenge
g
n
i
s
s
e
c
o
r
P
d
e
e
p
s
/
r
e
w
o
p
Pathetic
Good!
Number of “Machines”
6
Why need new Big ML systems?
Today’s AI & ML imposes high CAPEX and OPEX
Example: The Google Brain AI & ML system
High CAPEX
1000 machines
$10m+ capital cost (hardware)
$500k+/yr electricity and other costs
High OPEX
3 key scientists ($1m/year)
10+ engineers ($2.5m/year)
Total 3yr-cost = $20m+
Small to mid companies and the Academic
do not have such luxury
1000 machines only 100x as good as 1 machine!
7
Why need some new thinking?
MLer’s view
Focus on
Compute vs Network
LDA 32 machines (256 cores)
Correctness
fewer iteration to converge,
but assuming an ideal system, e.g.,
zero-cost sync,
uniform local progress
s
d
n
o
c
e
S
8000
7000
6000
5000
4000
3000
2000
1000
0
Network waiting time
Compute time
0
8
16
24
32
for (t = 1 to T) {
doThings()
parallelUpdate(x,θ)
doOtherThings()
}
θ
θ
θ
θ
θ
θ
θ
θ
θ
θ
θ
θ
θ
Parallelize over
worker threads
Share global model
parameters via RAM
8