logo资料库

Bad.Data.Handbook.pdf

第1页 / 共264页
第2页 / 共264页
第3页 / 共264页
第4页 / 共264页
第5页 / 共264页
第6页 / 共264页
第7页 / 共264页
第8页 / 共264页
资料共264页,剩余部分请下载后查看
Copyright
Table of Contents
About the Authors
Preface
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Chapter 1. Setting the Pace: What Is Bad Data?
Chapter 2. Is It Just Me, or Does This Data Smell Funny?
Understand the Data Structure
Field Validation
Value Validation
Physical Interpretation of Simple Statistics
Visualization
Keyword PPC Example
Search Referral Example
Recommendation Analysis
Time Series Data
Conclusion
Chapter 3. Data Intended for Human Consumption, Not Machine Consumption
The Data
The Problem: Data Formatted for Human Consumption
The Arrangement of Data
Data Spread Across Multiple Files
The Solution: Writing Code
Reading Data from an Awkward Format
Reading Data Spread Across Several Files
Postscript
Other Formats
Summary
Chapter 4. Bad Data Lurking in Plain Text
Which Plain Text Encoding?
Guessing Text Encoding
Normalizing Text
Problem: Application-Specific Characters Leaking into Plain Text
Text Processing with Python
Exercises
Chapter 5. (Re)Organizing the Web’s Data
Can You Get That?
General Workflow Example
robots.txt
Identifying the Data Organization Pattern
Store Offline Version for Parsing
Scrape the Information Off the Page
The Real Difficulties
Download the Raw Content If Possible
Forms, Dialog Boxes, and New Windows
Flash
The Dark Side
Conclusion
Chapter 6. Detecting Liars and the Confused in Contradictory Online Reviews
Weotta
Getting Reviews
Sentiment Classification
Polarized Language
Corpus Creation
Training a Classifier
Validating the Classifier
Designing with Data
Lessons Learned
Summary
Resources
Chapter 7. Will the Bad Data Please Stand Up?
Example 1: Defect Reduction in Manufacturing
Example 2: Who’s Calling?
Example 3: When “Typical” Does Not Mean “Average”
Lessons Learned
Will This Be on the Test?
Chapter 8. Blood, Sweat, and Urine
A Very Nerdy Body Swap Comedy
How Chemists Make Up Numbers
All Your Database Are Belong to Us
Check, Please
Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository
Rehab for Chemists (and Other Spreadsheet Abusers)
tl;dr
Chapter 9. When Data and Reality Don’t Match
Whose Ticker Is It Anyway?
Splits, Dividends, and Rescaling
Bad Reality
Conclusion
Chapter 10. Subtle Sources of Bias and Error
Imputation Bias: General Issues
Reporting Errors: General Issues
Other Sources of Bias
Topcoding/Bottomcoding
Seam Bias
Proxy Reporting
Sample Selection
Conclusions
References
Chapter 11. Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
But First, Let’s Reflect on Graduate School …
Moving On to the Professional World
Moving into Government Work
Government Data Is Very Real
Service Call Data as an Applied Example
Moving Forward
Lessons Learned and Looking Ahead
Chapter 12. When Databases Attack: A Guide for When to Stick to Files
History
Building My Toolset
The Roadblock: My Datastore
Consider Files as Your Datastore
Files Are Simple!
Files Work with Everything
Files Can Contain Any Data Type
Data Corruption Is Local
They Have Great Tooling
There’s No Install Tax
File Concepts
Encoding
Text Files
Binary Data
Memory-Mapped Files
File Formats
Delimiters
A Web Framework Backed by Files
Motivation
Implementation
Reflections
Chapter 13. Crouching Table, Hidden Network
A Relational Cost Allocations Model
The Delicate Sound of a Combinatorial Explosion…
The Hidden Network Emerges
Storing the Graph
Navigating the Graph with Gremlin
Finding Value in Network Properties
Think in Terms of Multiple Data Models and Use the Right Tool for the Job
Acknowledgments
Chapter 14. Myths of Cloud Computing
Introduction to the Cloud
What Is “The Cloud”?
The Cloud and Big Data
Introducing Fred
At First Everything Is Great
They Put 100% of Their Infrastructure in the Cloud
As Things Grow, They Scale Easily at First
Then Things Start Having Trouble
They Need to Improve Performance
Higher IO Becomes Critical
A Major Regional Outage Causes Massive Downtime
Higher IO Comes with a Cost
Data Sizes Increase
Geo Redundancy Becomes a Priority
Horizontal Scale Isn’t as Easy as They Hoped
Costs Increase Dramatically
Fred’s Follies
Myth 1: Cloud Is a Great Solution for All Infrastructure Components
How This Myth Relates to Fred’s Story
Myth 2: Cloud Will Save Us Money
How This Myth Relates to Fred’s Story
Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID
How This Myth Relates to Fred’s Story
Myth 4: Cloud Computing Makes Horizontal Scaling Easy
How This Myth Relates to Fred’s Story
Conclusion and Recommendations
Chapter 15. The Dark Side of Data Science
Avoid These Pitfalls
Know Nothing About Thy Data
Be Inconsistent in Cleaning and Organizing the Data
Assume Data Is Correct and Complete
Spillover of Time-Bound Data
Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks
Using a Production Environment for Ad-Hoc Analysis
The Ideal Data Science Environment
Thou Shalt Analyze for Analysis’ Sake Only
Thou Shalt Compartmentalize Learnings
Thou Shalt Expect Omnipotence from Data Scientists
Where Do Data Scientists Live Within the Organization?
Final Thoughts
Chapter 16. How to Feed and Care for Your Machine-Learning Experts
Define the Problem
Fake It Before You Make It
Create a Training Set
Pick the Features
Encode the Data
Split Into Training, Test, and Solution Sets
Describe the Problem
Respond to Questions
Integrate the Solutions
Conclusion
Chapter 17. Data Traceability
Why?
Personal Experience
Snapshotting
Saving the Source
Weighting Sources
Backing Out Data
Separating Phases (and Keeping them Pure)
Identifying the Root Cause
Finding Areas for Improvement
Immutability: Borrowing an Idea from Functional Programming
An Example
Crawlers
Change
Clustering
Popularity
Conclusion
Chapter 18. Social Media: Erasable Ink?
Social Media: Whose Data Is This Anyway?
Control
Commercial Resyndication
Expectations Around Communication and Expression
Technical Implications of New End User Expectations
What Does the Industry Do?
Validation API
Update Notification API
What Should End Users Do?
How Do We Work Together?
Chapter 19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
Framework Introduction: The Four Cs of Data Quality Analysis
Complete
Coherent
Correct
aCcountable
Conclusion
Index
About the Author
Bad Data Handbook Q. Ethan McCallum
Bad Data Handbook by Q. Ethan McCallum Copyright © 2013 Q. McCallum. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Mike Loukides and Meghan Blanchette Production Editor: Melanie Yarbrough Copyeditor: Gillian McGarvey Proofreader: Melanie Yarbrough Indexer: Angela Howard Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano November 2012: First Edition Revision History for the First Edition: 2012-11-05 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449321888 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Bad Data Handbook, the cover image of a short-legged goose, and related trade dress are trade‐ marks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-32188-8 [LSI]
Table of Contents About the Authors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Setting the Pace: What Is Bad Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. Is It Just Me, or Does This Data Smell Funny?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Understand the Data Structure 6 Field Validation 9 Value Validation 10 Physical Interpretation of Simple Statistics 11 Visualization 12 Keyword PPC Example 14 Search Referral Example 19 Recommendation Analysis 21 Time Series Data 24 Conclusion 29 3. Data Intended for Human Consumption, Not Machine Consumption. . . . . . . . . . . . . . . 31 The Data 31 The Problem: Data Formatted for Human Consumption 32 The Arrangement of Data 32 Data Spread Across Multiple Files 37 The Solution: Writing Code 38 Reading Data from an Awkward Format 39 Reading Data Spread Across Several Files 40 Postscript 48 Other Formats 48 Summary 51 4. Bad Data Lurking in Plain Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 iii
Which Plain Text Encoding? 54 Guessing Text Encoding 58 Normalizing Text 61 Problem: Application-Specific Characters Leaking into Plain Text 63 Text Processing with Python 67 Exercises 68 5. (Re)Organizing the Web’s Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Can You Get That? 70 General Workflow Example 71 robots.txt 72 Identifying the Data Organization Pattern 73 Store Offline Version for Parsing 75 Scrape the Information Off the Page 76 The Real Difficulties 79 Download the Raw Content If Possible 80 Forms, Dialog Boxes, and New Windows 80 Flash 81 The Dark Side 82 Conclusion 82 6. Detecting Liars and the Confused in Contradictory Online Reviews. . . . . . . . . . . . . . . . . 83 Weotta 83 Getting Reviews 84 Sentiment Classification 85 Polarized Language 85 Corpus Creation 87 Training a Classifier 88 Validating the Classifier 90 Designing with Data 91 Lessons Learned 92 Summary 92 Resources 93 7. Will the Bad Data Please Stand Up?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Example 1: Defect Reduction in Manufacturing 95 Example 2: Who’s Calling? 98 Example 3: When “Typical” Does Not Mean “Average” 101 Lessons Learned 104 Will This Be on the Test? 105 8. Blood, Sweat, and Urine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 iv | Table of Contents
A Very Nerdy Body Swap Comedy 107 How Chemists Make Up Numbers 108 All Your Database Are Belong to Us 110 Check, Please 113 Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository 114 Rehab for Chemists (and Other Spreadsheet Abusers) 115 tl;dr 117 9. When Data and Reality Don’t Match. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Whose Ticker Is It Anyway? 120 Splits, Dividends, and Rescaling 122 Bad Reality 125 Conclusion 127 10. Subtle Sources of Bias and Error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Imputation Bias: General Issues 131 Reporting Errors: General Issues 133 Other Sources of Bias 135 Topcoding/Bottomcoding 136 Seam Bias 137 Proxy Reporting 138 Sample Selection 139 Conclusions 139 References 140 11. Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?. . . . . . . . . . 143 But First, Let’s Reflect on Graduate School … 143 Moving On to the Professional World 144 Moving into Government Work 146 Government Data Is Very Real 146 Service Call Data as an Applied Example 147 Moving Forward 148 Lessons Learned and Looking Ahead 149 12. When Databases Attack: A Guide for When to Stick to Files. . . . . . . . . . . . . . . . . . . . . . 151 History 151 Building My Toolset 152 The Roadblock: My Datastore 152 Consider Files as Your Datastore 154 Files Are Simple! 154 Files Work with Everything 154 Files Can Contain Any Data Type 154 Table of Contents | v
Data Corruption Is Local 155 They Have Great Tooling 155 There’s No Install Tax 155 File Concepts 156 Encoding 156 Text Files 156 Binary Data 156 Memory-Mapped Files 156 File Formats 156 Delimiters 158 A Web Framework Backed by Files 159 Motivation 160 Implementation 161 Reflections 161 13. Crouching Table, Hidden Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A Relational Cost Allocations Model 164 The Delicate Sound of a Combinatorial Explosion… 167 The Hidden Network Emerges 168 Storing the Graph 169 Navigating the Graph with Gremlin 170 Finding Value in Network Properties 171 Think in Terms of Multiple Data Models and Use the Right Tool for the Job 173 Acknowledgments 173 14. Myths of Cloud Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Introduction to the Cloud 175 What Is “The Cloud”? 175 The Cloud and Big Data 176 Introducing Fred 176 At First Everything Is Great 177 They Put 100% of Their Infrastructure in the Cloud 177 As Things Grow, They Scale Easily at First 177 Then Things Start Having Trouble 177 They Need to Improve Performance 178 Higher IO Becomes Critical 178 A Major Regional Outage Causes Massive Downtime 178 Higher IO Comes with a Cost 179 Data Sizes Increase 179 Geo Redundancy Becomes a Priority 179 Horizontal Scale Isn’t as Easy as They Hoped 180 Costs Increase Dramatically 180 vi | Table of Contents
分享到:
收藏