数据科学的企业应用及其自动化
© Cloudera, Inc. All rights reserved.
1
数据科学改变传统行业
Connected Car
Smart Industry
Smart Cities & Ports
Environment Sensing
Usage Based Insurance
Predictive Maintenance
Aerospace & Aviation
Smart Healthcare
© Cloudera, Inc. All rights reserved.
2
Advanced Analytics & Machine Learning
© Cloudera, Inc. All rights reserved.
3
Data Science: It’s all about Curiosity & Passion
Curiosity
Passion
Jeff Hammerbacher,
Cloudera创始人以及首席科学家
© Cloudera, Inc. All rights reserved.
4
Hierarchy of needs of Data Science
© Cloudera, Inc. All rights reserved.
5
Exploratory Data Science
Required Capabilities:
● Unified Platform: 
○Enables more workloads in a single platform at an 
integrated data science and engineering 
environment.
● Performance:
○Developers and data scientists have access to high 
scale and high performance query engines for 
distributed analytics.
● Ease of Use: 
○Not for hacker users but data scientists. 
“Cloudera’s approach was more aligned 
with our own philosophies, such as 
building simpler, more prescriptive 
libraries that broaden the audience for 
the platform. ”
“Cloudera Enterprise expedites 
round-trips to access and compute 
data for data discovery, translating 
into significant reductions in R&D 
time. This will have a very 
meaningful scientific upside.”
© Cloudera, Inc. All rights reserved.
6
Machine Learning for Enterprise
Required Capabilities:
●Expanded Data Access
○Expand your empirical data
●Test and Train Faster
○Iterative development 
●Use Familiar Tools
○Increase developer productivity with familiar 
API’s
●Integrated Batch and Streaming
○unified batch and streaming programming model
“Cloudera, using complex machine 
learning algorithms, analyzes large 
amounts of data in real time and allows 
personalization of game interaction with 
players through recommendations.,”
“Machine learning and big data is like a 
marriage made in heaven. I mean it 
works really well with some other tooling 
that's already there on the stack. It used 
to take time. It was cumbersome.”
“CDSW + Spark is perfect combination 
for machine learning on Hadoop. It 
simplified our work and save the cost of 
our data industry.”
© Cloudera, Inc. All rights reserved.
7
Developing of Data Science
大量开源的机器学习框架
Data	Science	team(s)	wants	to	use	
the	latest	open	source	tools
IT/business	don’t	know	how	to	
support	with	existing	analytics	
investments
Hadoop平台 - 避免数据科学
的”孤岛”
- 数据供给和可视化加工问题
- 数据治理是数据科学的重要基
础
-- 数据科学需要大量的计算资源,
分布式资源调度、数据湖、私
有云和多租户技术成为必然
数据科学的相关组织架构
n 自助服务 – 数据服务、分析服
务、建模服务
n 个性化的数据环境
n 业务部门和科技部门的价值平等,
需要紧密协作
© Cloudera, Inc. All rights reserved.
8