数据科学的企业应用及其自动化
© Cloudera, Inc. All rights reserved.
1
数据科学改变传统行业
Connected Car
Smart Industry
Smart Cities & Ports
Environment Sensing
Usage Based Insurance
Predictive Maintenance
Aerospace & Aviation
Smart Healthcare
© Cloudera, Inc. All rights reserved.
2
Advanced Analytics & Machine Learning
© Cloudera, Inc. All rights reserved.
3
Data Science: It’s all about Curiosity & Passion
Curiosity
Passion
Jeff Hammerbacher,
Cloudera创始人以及首席科学家
© Cloudera, Inc. All rights reserved.
4
Hierarchy of needs of Data Science
© Cloudera, Inc. All rights reserved.
5
Exploratory Data Science
Required Capabilities:
● Unified Platform:
○Enables more workloads in a single platform at an
integrated data science and engineering
environment.
● Performance:
○Developers and data scientists have access to high
scale and high performance query engines for
distributed analytics.
● Ease of Use:
○Not for hacker users but data scientists.
“Cloudera’s approach was more aligned
with our own philosophies, such as
building simpler, more prescriptive
libraries that broaden the audience for
the platform. ”
“Cloudera Enterprise expedites
round-trips to access and compute
data for data discovery, translating
into significant reductions in R&D
time. This will have a very
meaningful scientific upside.”
© Cloudera, Inc. All rights reserved.
6
Machine Learning for Enterprise
Required Capabilities:
●Expanded Data Access
○Expand your empirical data
●Test and Train Faster
○Iterative development
●Use Familiar Tools
○Increase developer productivity with familiar
API’s
●Integrated Batch and Streaming
○unified batch and streaming programming model
“Cloudera, using complex machine
learning algorithms, analyzes large
amounts of data in real time and allows
personalization of game interaction with
players through recommendations.,”
“Machine learning and big data is like a
marriage made in heaven. I mean it
works really well with some other tooling
that's already there on the stack. It used
to take time. It was cumbersome.”
“CDSW + Spark is perfect combination
for machine learning on Hadoop. It
simplified our work and save the cost of
our data industry.”
© Cloudera, Inc. All rights reserved.
7
Developing of Data Science
大量开源的机器学习框架
Data Science team(s) wants to use
the latest open source tools
IT/business don’t know how to
support with existing analytics
investments
Hadoop平台 - 避免数据科学
的”孤岛”
- 数据供给和可视化加工问题
- 数据治理是数据科学的重要基
础
-- 数据科学需要大量的计算资源,
分布式资源调度、数据湖、私
有云和多租户技术成为必然
数据科学的相关组织架构
n 自助服务 – 数据服务、分析服
务、建模服务
n 个性化的数据环境
n 业务部门和科技部门的价值平等,
需要紧密协作
© Cloudera, Inc. All rights reserved.
8