SPRINGER BRIEFS IN COMPUTER SCIENCE
Sherif Sakr
Big Data 2.0
Processing
Systems
A Survey
123
SpringerBriefs in Computer Science
More information about this series at http://www.springer.com/series/10028
Sherif Sakr
Big Data 2.0
Processing Systems
A Survey
123
Sherif Sakr
University of New South Wales
Sydney, NSW
Australia
ISSN 2191-5768
SpringerBriefs in Computer Science
ISBN 978-3-319-38775-8
DOI 10.1007/978-3-319-38776-5
ISBN 978-3-319-38776-5
(eBook)
ISSN 2191-5776
(electronic)
Library of Congress Control Number: 2016941097
© The Author(s) 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names,
in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.
trademarks, service marks, etc.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG Switzerland
To my wife, Radwa,
my daughter, Jana,
and my son, Shehab
for their love, encouragement,
and support.
Sherif Sakr
Foreword
Big Data has become a core topic in different industries and research disciplines as
well as for society as a whole. This is because the ability to generate, collect, dis-
tribute, process, and analyze unprecedented amounts of diverse data has almost
universal utility and helps to change fundamentally the way industries operate, how
research can be done, and how people live and use modern technology. Different
industries such as automotive, finance, healthcare, or manufacturing can dramatically
benefit from improved and faster data analysis, for example, as illustrated by current
industry trends such as “Industry 4.0” and “Internet-of-Things.” Data-driven research
approaches utilizing Big Data technology and analysis have become increasingly
commonplace, for example, in the life sciences, geosciences, or in astronomy. Users
utilizing smartphones, social media, and Web resources spend increasing amounts of
time online, generate and consume enormous amounts of data, and are the target for
personalized services, recommendations, and advertisements.
Most of the possible developments related to Big Data are still in an early stage
but there is great promise if the diverse technological and application-specific
challenges in managing and using Big Data are successfully addressed. Some of the
technical challenges have been associated with different “V” characteristics, in
particular Volume, Velocity, Variety, and Veracity that are also discussed in this
book. Other challenges relate to the protection of personal and sensitive data to
ensure a high degree of privacy and the ability to turn the huge amount of data into
useful insights or improved operation.
A key enabler for the Big Data movement is the increasingly powerful and
relatively inexpensive computing platforms allowing fault-tolerant storage and
processing of petabytes of data within large computing clusters typically equipped
with thousands of processors and terabytes of main memory. The utilization of such
infrastructures was pioneered by Internet giants such as Google and Amazon but
has become generally possible by open-source system software such as the Hadoop
ecosystem. Initially there have been only a few core Hadoop components, in par-
ticular its distributed file system HDFS and the MapReduce framework for the
vii
viii
Foreword
relatively easy development and execution of highly parallel applications to process
massive amounts of data on cluster infrastructures.
The initial Hadoop has been highly successful but also reached its limits in
different areas, for example, to support the processing of fast changing data such as
datastreams or to process highly iterative algorithms, for example, for machine
learning or graph processing. Furthermore, the Hadoop world has been largely
decoupled from the widespread data management and analysis approaches based on
relational databases and SQL. These aspects have led to a large number of addi-
tional components within the Hadoop ecosystem, both general-purpose processing
frameworks such as Apache Spark and Flink as well as specific components, such
as for data streams, graph data, or machine learning. Furthermore, there are now
numerous approaches to combine Hadoop-like data processing with relational
database processing (“SQL on Hadoop”).
The net effect of all these developments is that the current technological land-
scape for Big Data is not yet consolidated but there are many possible approaches
within the Hadoop ecosystem and also within the product portfolio of different
database vendors and other IT companies (Google, IBM, Microsoft, Oracle, etc.).
The book Big Data 2.0 Processing Systems by Sherif Sakr is a valuable and
up-to-date guide through this technological “jungle” and provides the reader with a
comprehensible and concise overview of the main developments after the initial
MapReduce-focused version of Hadoop. I am confident that this information is
useful for many practitioners, scientists, and students interested in Big Data
technology.
University of Leipzig, Germany
Erhard Rahm