logo资料库

Big Data 2.0.pdf

第1页 / 共111页
第2页 / 共111页
第3页 / 共111页
第4页 / 共111页
第5页 / 共111页
第6页 / 共111页
第7页 / 共111页
第8页 / 共111页
资料共111页,剩余部分请下载后查看
Foreword
Preface
Acknowledgements
Contents
About the Author
1 Introduction
1.1 The Big Data Phenomenon
1.2 Big Data and Cloud Computing
1.3 Big Data Storage Systems
1.4 Big Data Processing and Analytics Systems
1.5 Book Roadmap
2 General-Purpose Big Data Processing Systems
2.1 The Big Data Star: The Hadoop Framework
2.1.1 The Original Architecture
2.1.2 Enhancements of the MapReduce Framework
2.1.3 Hadoop's Ecosystem
2.2 Spark
2.3 Flink
2.4 Hyracks/ASTERIX
3 Large-Scale Processing Systems of Structured Data
3.1 Why SQL-On-Hadoop?
3.2 Hive
3.3 Impala
3.4 IBM Big SQL
3.5 SPARK SQL
3.6 HadoopDB
3.7 Presto
3.8 Tajo
3.9 Google Big Query
3.10 Phoenix
3.11 Polybase
4 Large-Scale Graph Processing Systems
4.1 The Challenges of Big Graphs
4.2 Does Hadoop Work Well for Big Graphs?
4.3 Pregel Family of Systems
4.3.1 The Original Architecture
4.3.2 Giraph: BSP + Hadoop for Graph Processing
4.3.3 Pregel Extensions
4.4 GraphLab Family of Systems
4.4.1 GraphLab
4.4.2 PowerGraph
4.4.3 GraphChi
4.5 Other Systems
4.6 Large-Scale RDF Processing Systems
5 Large-Scale Stream Processing Systems
5.1 The Big Data Streaming Problem
5.2 Hadoop for Big Streams?!
5.3 Storm
5.4 Infosphere Streams
5.5 Other Big Stream Processing Systems
5.6 Big Data Pipelining Frameworks
5.6.1 Pig Latin
5.6.2 Tez
5.6.3 Other Pipelining Systems
6 Conclusions and Outlook
References
SPRINGER BRIEFS IN COMPUTER SCIENCE Sherif Sakr Big Data 2.0 Processing Systems A Survey 123
SpringerBriefs in Computer Science
More information about this series at http://www.springer.com/series/10028
Sherif Sakr Big Data 2.0 Processing Systems A Survey 123
Sherif Sakr University of New South Wales Sydney, NSW Australia ISSN 2191-5768 SpringerBriefs in Computer Science ISBN 978-3-319-38775-8 DOI 10.1007/978-3-319-38776-5 ISBN 978-3-319-38776-5 (eBook) ISSN 2191-5776 (electronic) Library of Congress Control Number: 2016941097 © The Author(s) 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. trademarks, service marks, etc. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG Switzerland
To my wife, Radwa, my daughter, Jana, and my son, Shehab for their love, encouragement, and support. Sherif Sakr
Foreword Big Data has become a core topic in different industries and research disciplines as well as for society as a whole. This is because the ability to generate, collect, dis- tribute, process, and analyze unprecedented amounts of diverse data has almost universal utility and helps to change fundamentally the way industries operate, how research can be done, and how people live and use modern technology. Different industries such as automotive, finance, healthcare, or manufacturing can dramatically benefit from improved and faster data analysis, for example, as illustrated by current industry trends such as “Industry 4.0” and “Internet-of-Things.” Data-driven research approaches utilizing Big Data technology and analysis have become increasingly commonplace, for example, in the life sciences, geosciences, or in astronomy. Users utilizing smartphones, social media, and Web resources spend increasing amounts of time online, generate and consume enormous amounts of data, and are the target for personalized services, recommendations, and advertisements. Most of the possible developments related to Big Data are still in an early stage but there is great promise if the diverse technological and application-specific challenges in managing and using Big Data are successfully addressed. Some of the technical challenges have been associated with different “V” characteristics, in particular Volume, Velocity, Variety, and Veracity that are also discussed in this book. Other challenges relate to the protection of personal and sensitive data to ensure a high degree of privacy and the ability to turn the huge amount of data into useful insights or improved operation. A key enabler for the Big Data movement is the increasingly powerful and relatively inexpensive computing platforms allowing fault-tolerant storage and processing of petabytes of data within large computing clusters typically equipped with thousands of processors and terabytes of main memory. The utilization of such infrastructures was pioneered by Internet giants such as Google and Amazon but has become generally possible by open-source system software such as the Hadoop ecosystem. Initially there have been only a few core Hadoop components, in par- ticular its distributed file system HDFS and the MapReduce framework for the vii
viii Foreword relatively easy development and execution of highly parallel applications to process massive amounts of data on cluster infrastructures. The initial Hadoop has been highly successful but also reached its limits in different areas, for example, to support the processing of fast changing data such as datastreams or to process highly iterative algorithms, for example, for machine learning or graph processing. Furthermore, the Hadoop world has been largely decoupled from the widespread data management and analysis approaches based on relational databases and SQL. These aspects have led to a large number of addi- tional components within the Hadoop ecosystem, both general-purpose processing frameworks such as Apache Spark and Flink as well as specific components, such as for data streams, graph data, or machine learning. Furthermore, there are now numerous approaches to combine Hadoop-like data processing with relational database processing (“SQL on Hadoop”). The net effect of all these developments is that the current technological land- scape for Big Data is not yet consolidated but there are many possible approaches within the Hadoop ecosystem and also within the product portfolio of different database vendors and other IT companies (Google, IBM, Microsoft, Oracle, etc.). The book Big Data 2.0 Processing Systems by Sherif Sakr is a valuable and up-to-date guide through this technological “jungle” and provides the reader with a comprehensible and concise overview of the main developments after the initial MapReduce-focused version of Hadoop. I am confident that this information is useful for many practitioners, scientists, and students interested in Big Data technology. University of Leipzig, Germany Erhard Rahm
分享到:
收藏