logo资料库

Data Science for Business.pdf

第1页 / 共409页
第2页 / 共409页
第3页 / 共409页
第4页 / 共409页
第5页 / 共409页
第6页 / 共409页
第7页 / 共409页
第8页 / 共409页
资料共409页,剩余部分请下载后查看
Copyright
Table of Contents
Preface
Our Conceptual Approach to Data Science
To the Instructor
Other Skills and Concepts
Sections and Notation
Using Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Chapter 1. Introduction: Data-Analytic Thinking
The Ubiquity of Data Opportunities
Example: Hurricane Frances
Example: Predicting Customer Churn
Data Science, Engineering, and Data-Driven Decision Making
Data Processing and “Big Data”
From Big Data 1.0 to Big Data 2.0
Data and Data Science Capability as a Strategic Asset
Data-Analytic Thinking
This Book
Data Mining and Data Science, Revisited
Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist
Summary
Chapter 2. Business Problems and Data Science Solutions
From Business Problems to Data Mining Tasks
Supervised Versus Unsupervised Methods
Data Mining and Its Results
The Data Mining Process
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Implications for Managing the Data Science Team
Other Analytics Techniques and Technologies
Statistics
Database Querying
Data Warehousing
Regression Analysis
Machine Learning and Data Mining
Answering Business Questions with These Techniques
Summary
Chapter 3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Models, Induction, and Prediction
Supervised Segmentation
Selecting Informative Attributes
Example: Attribute Selection with Information Gain
Supervised Segmentation with Tree-Structured Models
Visualizing Segmentations
Trees as Sets of Rules
Probability Estimation
Example: Addressing the Churn Problem with Tree Induction
Summary
Chapter 4. Fitting a Model to Data
Classification via Mathematical Functions
Linear Discriminant Functions
Optimizing an Objective Function
An Example of Mining a Linear Discriminant from Data
Linear Discriminant Functions for Scoring and Ranking Instances
Support Vector Machines, Briefly
Regression via Mathematical Functions
Class Probability Estimation and Logistic “Regression”
* Logistic Regression: Some Technical Details
Example: Logistic Regression versus Tree Induction
Nonlinear Functions, Support Vector Machines, and Neural Networks
Summary
Chapter 5. Overfitting and Its Avoidance
Generalization
Overfitting
Overfitting Examined
Holdout Data and Fitting Graphs
Overfitting in Tree Induction
Overfitting in Mathematical Functions
Example: Overfitting Linear Functions
* Example: Why Is Overfitting Bad?
From Holdout Evaluation to Cross-Validation
The Churn Dataset Revisited
Learning Curves
Overfitting Avoidance and Complexity Control
Avoiding Overfitting with Tree Induction
A General Method for Avoiding Overfitting
* Avoiding Overfitting for Parameter Optimization
Summary
Chapter 6. Similarity, Neighbors, and Clusters
Similarity and Distance
Nearest-Neighbor Reasoning
Example: Whiskey Analytics
Nearest Neighbors for Predictive Modeling
How Many Neighbors and How Much Influence?
Geometric Interpretation, Overfitting, and Complexity Control
Issues with Nearest-Neighbor Methods
Some Important Technical Details Relating to Similarities and Neighbors
Heterogeneous Attributes
* Other Distance Functions
* Combining Functions: Calculating Scores from Neighbors
Clustering
Example: Whiskey Analytics Revisited
Hierarchical Clustering
Nearest Neighbors Revisited: Clustering Around Centroids
Example: Clustering Business News Stories
Understanding the Results of Clustering
* Using Supervised Learning to Generate Cluster Descriptions
Stepping Back: Solving a Business Problem Versus Data Exploration
Summary
Chapter 7. Decision Analytic Thinking I: What Is a Good Model?
Evaluating Classifiers
Plain Accuracy and Its Problems
The Confusion Matrix
Problems with Unbalanced Classes
Problems with Unequal Costs and Benefits
Generalizing Beyond Classification
A Key Analytical Framework: Expected Value
Using Expected Value to Frame Classifier Use
Using Expected Value to Frame Classifier Evaluation
Evaluation, Baseline Performance, and Implications for Investments in Data
Summary
Chapter 8. Visualizing Model Performance
Ranking Instead of Classifying
Profit Curves
ROC Graphs and Curves
The Area Under the ROC Curve (AUC)
Cumulative Response and Lift Curves
Example: Performance Analytics for Churn Modeling
Summary
Chapter 9. Evidence and Probabilities
Example: Targeting Online Consumers With Advertisements
Combining Evidence Probabilistically
Joint Probability and Independence
Bayes’ Rule
Applying Bayes’ Rule to Data Science
Conditional Independence and Naive Bayes
Advantages and Disadvantages of Naive Bayes
A Model of Evidence “Lift”
Example: Evidence Lifts from Facebook “Likes”
Evidence in Action: Targeting Consumers with Ads
Summary
Chapter 10. Representing and Mining Text
Why Text Is Important
Why Text Is Difficult
Representation
Bag of Words
Term Frequency
Measuring Sparseness: Inverse Document Frequency
Combining Them: TFIDF
Example: Jazz Musicians
* The Relationship of IDF to Entropy
Beyond Bag of Words
N-gram Sequences
Named Entity Extraction
Topic Models
Example: Mining News Stories to Predict Stock Price Movement
The Task
The Data
Data Preprocessing
Results
Summary
Chapter 11. Decision Analytic Thinking II: Toward Analytical Engineering
Targeting the Best Prospects for a Charity Mailing
The Expected Value Framework: Decomposing the Business Problem and Recomposing the Solution Pieces
A Brief Digression on Selection Bias
Our Churn Example Revisited with Even More Sophistication
The Expected Value Framework: Structuring a More Complicated Business Problem
Assessing the Influence of the Incentive
From an Expected Value Decomposition to a Data Science Solution
Summary
Chapter 12. Other Data Science Tasks and Techniques
Co-occurrences and Associations: Finding Items That Go Together
Measuring Surprise: Lift and Leverage
Example: Beer and Lottery Tickets
Associations Among Facebook Likes
Profiling: Finding Typical Behavior
Link Prediction and Social Recommendation
Data Reduction, Latent Information, and Movie Recommendation
Bias, Variance, and Ensemble Methods
Data-Driven Causal Explanation and a Viral Marketing Example
Summary
Chapter 13. Data Science and Business Strategy
Thinking Data-Analytically, Redux
Achieving Competitive Advantage with Data Science
Sustaining Competitive Advantage with Data Science
Formidable Historical Advantage
Unique Intellectual Property
Unique Intangible Collateral Assets
Superior Data Scientists
Superior Data Science Management
Attracting and Nurturing Data Scientists and Their Teams
Examine Data Science Case Studies
Be Ready to Accept Creative Ideas from Any Source
Be Ready to Evaluate Proposals for Data Science Projects
Example Data Mining Proposal
Flaws in the Big Red Proposal
A Firm’s Data Science Maturity
Chapter 14. Conclusion
The Fundamental Concepts of Data Science
Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data
Changing the Way We Think about Solutions to Business Problems
What Data Can’t Do: Humans in the Loop, Revisited
Privacy, Ethics, and Mining Data About Individuals
Is There More to Data Science?
Final Example: From Crowd-Sourcing to Cloud-Sourcing
Final Words
Appendix A. Proposal Review Guide
Business and Data Understanding
Data Preparation
Modeling
Evaluation and Deployment
Appendix B. Another Sample Proposal
Scenario and Proposal
Flaws in the GGC Proposal
Glossary
Bibliography
Index
About the Authors
www.it-ebooks.info
Praise “A must-read resource for anyone who is serious about embracing the opportunity of big data.” — Craig Vaughan Global Vice President at SAP “This timely book says out loud what has finally become apparent: in the modern world, Data is Business, and you can no longer think business without thinking data. Read this book and you will understand the Science behind thinking data.” — Ron Bekkerman Chief Data Officer at Carmel Ventures “A great book for business managers who lead or interact with data scientists, who wish to better understand the principals and algorithms available without the technical details of single-disciplinary books.” — Ronny Kohavi Partner Architect at Microsoft Online Services Division “Provost and Fawcett have distilled their mastery of both the art and science of real-world data analysis into an unrivalled introduction to the field.” —Geoff Webb Editor-in-Chief of Data Mining and Knowledge Discovery Journal “I would love it if everyone I had to work with had read this book.” — Claudia Perlich Chief Scientist of M6D (Media6Degrees) and Advertising Research Foundation Innovation Award Grand Winner (2013) www.it-ebooks.info
“A foundational piece in the fast developing world of Data Science. A must read for anyone interested in the Big Data revolution." —Justin Gapper Business Unit Analytics Manager at Teledyne Scientific and Imaging “The authors, both renowned experts in data science before it had a name, have taken a complex topic and made it accessible to all levels, but mostly helpful to the budding data scientist. As far as I know, this is the first book of its kind—with a focus on data science concepts as applied to practical business problems. It is liberally sprinkled with compelling real-world examples outlining familiar, accessible problems in the business world: customer churn, targeted marking, even whiskey analytics! The book is unique in that it does not give a cookbook of algorithms, rather it helps the reader understand the underlying concepts behind data science, and most importantly how to approach and be successful at problem solving. Whether you are looking for a good comprehensive overview of data science or are a budding data scientist in need of the basics, this is a must-read.” — Chris Volinsky Director of Statistics Research at AT&T Labs and Winning Team Member for the $1 Million Netflix Challenge “This book goes beyond data analytics 101. It’s the essential guide for those of us (all of us?) whose businesses are built on the ubiquity of data opportunities and the new mandate for data-driven decision-making.” —Tom Phillips CEO of Media6Degrees and Former Head of Google Search and Analytics “Intelligent use of data has become a force powering business to new levels of competitiveness. To thrive in this data-driven ecosystem, engineers, analysts, and managers alike must understand the options, design choices, and tradeoffs before them. With motivating examples, clear exposition, and a breadth of details covering not only the “hows” but the “whys”, Data Science for Business is the perfect primer for those wishing to become involved in the development and application of data-driven systems.” —Josh Attenberg Data Science Lead at Etsy www.it-ebooks.info
“Data is the foundation of new waves of productivity growth, innovation, and richer customer insight. Only recently viewed broadly as a source of competitive advantage, dealing well with data is rapidly becoming table stakes to stay in the game. The authors’ deep applied experience makes this a must read—a window into your competitor’s strategy.” — Alan Murray Serial Entrepreneur; Partner at Coriolis Ventures “One of the best data mining books, which helped me think through various ideas on liquidity analysis in the FX business. The examples are excellent and help you take a deep dive into the subject! This one is going to be on my shelf for lifetime!” — Nidhi Kathuria Vice President of FX at Royal Bank of Scotland www.it-ebooks.info
www.it-ebooks.info
Data Science for Business Foster Provost and Tom Fawcett www.it-ebooks.info
Data Science for Business by Foster Provost and Tom Fawcett Copyright © 2013 Foster Provost and Tom Fawcett. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Mike Loukides and Meghan Blanchette Production Editor: Christopher Hearse Proofreader: Kiel Van Horn Indexer: WordCo Indexing Services, Inc. Cover Designer: Mark Paglietti Interior Designer: David Futato Illustrator: Rebecca Demarest July 2013: First Edition Revision History for the First Edition: 2013-07-25: First release See http://oreilly.com/catalog/errata.csp?isbn=9781449361327 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Many of the designations used by man‐ ufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. Data Science for Business is a trademark of Foster Provost and Tom Fawcett. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-36132-7 [LSI] www.it-ebooks.info
分享到:
收藏