Hive基础介绍
2018.04.19
陈彬
1
目录
CONTENT
1
2
3
4
Hadoop & Hive概述
Hive SQL基础
常见问题及规范
Hive SQL优化
2
第一章
CHAPTER ONE
1
2
3
4
Hadoop & Hive概述
u Hadoop 与 Hive
u 离线分析平台的软件版本
u Hive的访问客户端
Hive SQL基础
常见问题及规范
Hive SQL优化
3
1
Hadoop & Hive概述 | Hadoop与Hive
Hadoop is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is designed
to scale up from single servers to thousands of machines, each offering local
computation and storage.
u• Hadoop HDFS: A distributed file system that provides high-throughput
access to application data. 分布式文件系统
u• Hadoop YARN: A framework for job scheduling and cluster resource
management. 资源管控系统
u• Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets. 并行计算框架
4
1
Hadoop & Hive概述 | Hadoop与Hive
Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner.
A MapReduce job usually splits the input data-set into independent chunks which
are processed by the map tasks in a completely parallel manner. The framework
sorts the outputs of the maps, which are then input to the reduce tasks. Typically
both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.
书本单词计数作业如何转换为一个分布式作业?
分类统计作业如何转换为一个分布式作业?
(input) -> map -> -> combine -> -> reduce -> (output)
5
1
Hadoop & Hive概述 | Hadoop与Hive
The Apache Hive data warehouse software facilitates reading, writing, and
managing large datasets residing in distributed storage using SQL. Structure can
be projected onto data already in storage. A command line tool and JDBC driver
are provided to connect users to Hive.
Hive通过把SQL语句翻译成MapReduce作业完成数据处理流程。
Hive is not designed for online transaction processing (OLTP) workloads. It is best
used for traditional data warehousing tasks.
Hive并非设计于进行OLTP任务场景的计算组件,它最适用于传统的数据仓库的任务
场景。
高IO和跑批
6
1
Hadoop & Hive概述 | Hadoop与Hive
使用场景
实时查询
联机事务处理(OLTP)
数仓场景与分析处理?
大数据量批处理
即席查询
多维数据分析
数据挖掘
RDMS
NoSQL
Hive Kylin
Spark
7
1
Hadoop & Hive概述 | 离线分析平台的软件版本
离线分析平台采用华为FusionInsight产品搭建平台,现有上海(OFA)及深圳(OFB)
两个集群,上海集群共65个节点,深圳集群共计105个结点,相关软件版本如下:
软件名称
RHEL
FusionInsight
Hadoop
Hive
Spark
OFB
7.2
C70
2.7.2
1.2.1
说明
操作系统
华为版本
分布式平台
Sql数仓工具
2.1.0/1.5.1
分布式处理框架
版本号
OFA
6.4
C60
2.7.2
1.2.1
1.5.1
8