Eric Xing 在ACML的分布式机器学习的ppt.pdf

发布时间：2022-06-24 发布人：admin 分类：说明书资料大小：4.15M 资料格式：pdf 举报版权申诉

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第1页.png

第1页 / 共96页

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第2页.png

第2页 / 共96页

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第3页.png

第3页 / 共96页

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第4页.png

第4页 / 共96页

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第5页.png

第5页 / 共96页

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第6页.png

第6页 / 共96页

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第7页.png

第7页 / 共96页

1f10fed9-53c8-4731-b07c-eba8844a878b.pdf-第8页.png

第8页 / 共96页

文本预览

How to Go Really Big in AI: Strategies & Principles for Distributed Machine Learning Eric Xing epxing@cs.cmu.edu School of Computer Science Carnegie Mellon University Wei Dai, Qirong Ho, Jin Kyu Kim, Abhimanu Kumar, Seunghak Lee, Jinliang Wei, Pengtao Xie, Yaoliang Yu, Hao Zhang, Xun Zheng Acknowledgement: James Cipar, Henggang Cui, and, Phil Gibbons, Greg Ganger, Garth Gibson 1

Machine Learning: -- a view from outside 2

Inside ML … • Graphical Models • Nonparametric Bayesian Models • Regularized Bayesian Methods • Large-Margin • Deep Learning • Sparse Coding • Sparse Structured I/O Regression • Spectral/Matrix Methods • Network switches • Infiniband • Network attached storage • Flash storage • Server machines • Desktops/Laptops • NUMA machines • GPUs • Cloud compute (e.g. Amazon EC2) • Virtual Machines 3 Hardware and infrastructure

Massive Data 1B+ USERS 30+ PETABYTES 32 million pages 100+ hours video uploaded every minute 645 million users 500 million tweets / day 4

Growing Model Complexity Google Brain Deep Learning for images: 1~10 Billion model parameters Multi-task Regression for simplest whole- genome analysis: 100 million ~ 1 Billion model parameters Topic Models for news article analysis: Up to 1 Trillion model parameters Collaborative filtering for Video recommendation: 1~10 Billion model parameters 5

The Scalability Challenge g n i s s e c o r P d e e p s / r e w o p Pathetic Good! Number of “Machines” 6

Why need new Big ML systems? Today’s AI & ML imposes high CAPEX and OPEX  Example: The Google Brain AI & ML system  High CAPEX  1000 machines  $10m+ capital cost (hardware)  $500k+/yr electricity and other costs  High OPEX  3 key scientists ($1m/year)  10+ engineers ($2.5m/year)  Total 3yr-cost = $20m+  Small to mid companies and the Academic do not have such luxury  1000 machines only 100x as good as 1 machine! 7

Why need some new thinking? MLer’s view  Focus on Compute vs Network LDA 32 machines (256 cores)  Correctness  fewer iteration to converge,  but assuming an ideal system, e.g.,  zero-cost sync,  uniform local progress s d n o c e S 8000 7000 6000 5000 4000 3000 2000 1000 0 Network waiting time Compute time 0 8 16 24 32 for (t = 1 to T) { doThings() parallelUpdate(x,θ) doOtherThings() } θ θ θ θ θ θ θ θ θ θ θ θ θ Parallelize over worker threads Share global model parameters via RAM 8

分享到：

赞收藏

资料库

Eric Xing 在ACML的分布式机器学习的ppt.pdf

相关推荐

数据库

热门标签

最新资料