Download from finelybook 7450911@qq.com
The MIT Press Essential Knowledge Series
Auctions, Timothy P. Hubbard and Harry J. Paarsch
The Book, Amaranth Borsuk
Cloud Computing, Nayan Ruparelia
Computing: A Concise History, Paul E. Ceruzzi
The Conscious Mind, Zoltan L. Torey
Crowdsourcing, Daren C. Brabham
Data Science, John D. Kelleher and Brendan Tierney
Free Will, Mark Balaguer
The Future, Nick Montfort
Information and Society, Michael Buckland
Information and the Modern Corporation, James W. Cortada
Intellectual Property Strategy, John Palfrey
The Internet of Things, Samuel Greengard
Machine Learning: The New AI, Ethem Alpaydin
Machine Translation, Thierry Poibeau
Memes in Digital Culture, Limor Shifman
Metadata, Jeffrey Pomerantz
The Mind–Body Problem, Jonathan Westphal
MOOCs, Jonathan Haber
Neuroplasticity, Moheb Costandi
Open Access, Peter Suber
2
Download from finelybook 7450911@qq.com
Paradox, Margaret Cuonzo
Post-Truth, Lee McIntyre
Robots, John Jordan
Self-Tracking, Gina Neff and Dawn Nafus
Sustainability, Kent E. Portney
Synesthesia, Richard E. Cytowic
The Technological Singularity, Murray Shanahan
Understanding Beliefs, Nils J. Nilsson
Waves, Frederic Raichlen
3
Download from finelybook 7450911@qq.com
Data Science
John D. Kelleher and Brendan Tierney
The MIT Press
Cambridge, Massachusetts
London, England
4
Download from finelybook 7450911@qq.com
© 2018 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any
electronic or mechanical means (including photocopying, recording, or
information storage and retrieval) without permission in writing from the
publisher.
This book was set in Chaparral Pro by Toppan Best-set Premedia Limited.
Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
Names: Kelleher, John D., 1974- author. | Tierney, Brendan, 1970- author.
Title: Data science / John D. Kelleher and Brendan Tierney.
Description: Cambridge, MA : The MIT Press, [2018] | Series: The MIT Press
essential knowledge series | Includes bibliographical references and index.
Identifiers: LCCN 2017043665 | ISBN 9780262535434 (pbk. : alk. paper)
eISBN 9780262347013
Subjects: LCSH: Big data. | Machine learning. | Data mining. | Quantitative
research.
Classification: LCC QA76.9.B45 K45 2018 | DDC 005.7--dc23 LC record
available at https://lccn.loc.gov/2017043665
ePub Version 1.0
5
Download from finelybook 7450911@qq.com
Table of Contents
Series page
Title page
Copyright page
Series Foreword
Preface
Acknowledgments
1 What Is Data Science?
2 What Are Data, and What Is a Data Set?
3 A Data Science Ecosystem
4 Machine Learning 101
5 Standard Data Science Tasks
6 Privacy and Ethics
7 Future Trends and Principles of Success
Glossary
Further Readings
References
Index
About Author
List of Tables
Table 1 A Data Set of Classic Books
6
Download from finelybook 7450911@qq.com
Table 2 Diabetes Study Data Set
Table 3 A Data Set of Emails: Spam or Not Spam?
List of Illustrations
Figure 1 A skills-set desideratum for a data scientist.
Figure 2 The DIKW pyramid (adapted from Kitchin 2014a).
Figure 3 Data science pyramid (adapted from Han, Kamber, and
Pei 2011).
Figure 4 The CRISP-DM life cycle (based on figure 2 in Chapman,
Clinton, Kerber, et al. 1999).
Figure 5 The CRISP-DM stages and tasks (based on figure 3 in
Chapman, Clinton, Kerber, et al. 1999).
Figure 6 A typical small-data and big-data architecture for data
science (inspired by a figure from the Hortonworks newsletter,
April 23, 2013, https://hortonworks.com/blog/hadoop-and-the-
data-warehouse-when-to-use-which).
Figure 7 The traditional process for building predictive models and
scoring data.
Figure 8 Databases, data warehousing, and Hadoop working
together (inspired by a figure in the Gluent data platform white
paper, 2017, https://gluent.com/wp-
content/uploads/2017/09/Gluent-Overview.pdf).
Figure 9 Scatterplots of shoe size and height, weight and exercise,
and shoe size and exercise.
Figure 10 Scatterplots of the likelihood of diabetes with respect to
height, weight, and BMI.
Figure 11 (a) The best-fit regression line for the model “Diabetes =
−7.38431 + 0.55593 BMI.” (b) The dashed vertical lines illustrate
the residual for each instance.
7
Download from finelybook 7450911@qq.com
Figure 12 Mapping the logistic and tanh functions as applied to the
input x.
Figure 13 A simple neural network.
Figure 14 A neural network that predicts a person’s fitness level.
Figure 15 A deep neural network.
Figure 16 A decision tree for determining whether an email is spam
or not.
Figure 17 Creating the root node in the tree.
Figure 18 Adding the second node to the tree.
8