logo资料库

Hive基础介绍(201804).pptx

第1页 / 共44页
第2页 / 共44页
第3页 / 共44页
第4页 / 共44页
第5页 / 共44页
第6页 / 共44页
第7页 / 共44页
第8页 / 共44页
资料共44页,剩余部分请下载后查看
Hive基础介绍 2018.04.19 陈彬 1
目录 CONTENT 1 2 3 4 Hadoop & Hive概述 Hive SQL基础 常见问题及规范 Hive SQL优化 2
第一章 CHAPTER ONE 1 2 3 4 Hadoop & Hive概述 u Hadoop 与 Hive u 离线分析平台的软件版本 u Hive的访问客户端 Hive SQL基础 常见问题及规范 Hive SQL优化 3
1 Hadoop & Hive概述 | Hadoop与Hive Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. u• Hadoop HDFS: A distributed file system that provides high-throughput access to application data. 分布式文件系统 u• Hadoop YARN: A framework for job scheduling and cluster resource management. 资源管控系统 u• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. 并行计算框架 4
1 Hadoop & Hive概述 | Hadoop与Hive Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. 书本单词计数作业如何转换为一个分布式作业? 分类统计作业如何转换为一个分布式作业? (input) -> map -> -> combine -> -> reduce -> (output) 5
1 Hadoop & Hive概述 | Hadoop与Hive The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Hive通过把SQL语句翻译成MapReduce作业完成数据处理流程。 Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks. Hive并非设计于进行OLTP任务场景的计算组件,它最适用于传统的数据仓库的任务 场景。 高IO和跑批 6
1 Hadoop & Hive概述 | Hadoop与Hive 使用场景 实时查询 联机事务处理(OLTP) 数仓场景与分析处理? 大数据量批处理 即席查询 多维数据分析 数据挖掘 RDMS NoSQL Hive Kylin Spark 7
1 Hadoop & Hive概述 | 离线分析平台的软件版本 离线分析平台采用华为FusionInsight产品搭建平台,现有上海(OFA)及深圳(OFB) 两个集群,上海集群共65个节点,深圳集群共计105个结点,相关软件版本如下: 软件名称 RHEL FusionInsight Hadoop Hive Spark OFB 7.2 C70 2.7.2 1.2.1 说明 操作系统 华为版本 分布式平台 Sql数仓工具 2.1.0/1.5.1 分布式处理框架 版本号 OFA 6.4 C60 2.7.2 1.2.1 1.5.1 8
分享到:
收藏