Bad.Data.Handbook.pdf

发布时间：2022-06-14 发布人：admin 分类：说明书资料大小：10.49M 资料格式：pdf 举报版权申诉

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第1页.png

第1页 / 共264页

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第2页.png

第2页 / 共264页

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第3页.png

第3页 / 共264页

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第4页.png

第4页 / 共264页

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第5页.png

第5页 / 共264页

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第6页.png

第6页 / 共264页

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第7页.png

第7页 / 共264页

7d9cbeb7-f64e-429a-a6bb-c2c397e218c4.pdf-第8页.png

第8页 / 共264页

Table of Contents

About the Authors

Preface

Conventions Used in This Book

Using Code Examples

Safari® Books Online

How to Contact Us

Acknowledgments

Chapter 1. Setting the Pace: What Is Bad Data?

Chapter 2. Is It Just Me, or Does This Data Smell Funny?

Understand the Data Structure

Field Validation

Value Validation

Physical Interpretation of Simple Statistics

Visualization

Keyword PPC Example

Search Referral Example

Recommendation Analysis

Time Series Data

Conclusion

Chapter 3. Data Intended for Human Consumption, Not Machine Consumption

The Data

The Problem: Data Formatted for Human Consumption

The Arrangement of Data

Data Spread Across Multiple Files

The Solution: Writing Code

Reading Data from an Awkward Format

Reading Data Spread Across Several Files

Postscript

Other Formats

Summary

Chapter 4. Bad Data Lurking in Plain Text

Which Plain Text Encoding?

Guessing Text Encoding

Normalizing Text

Problem: Application-Specific Characters Leaking into Plain Text

Text Processing with Python

Exercises

Chapter 5. (Re)Organizing the Web’s Data

Can You Get That?

General Workflow Example

robots.txt

Identifying the Data Organization Pattern

Store Offline Version for Parsing

Scrape the Information Off the Page

The Real Difficulties

Download the Raw Content If Possible

Forms, Dialog Boxes, and New Windows

Flash

The Dark Side

Conclusion

Chapter 6. Detecting Liars and the Confused in Contradictory Online Reviews

Weotta

Getting Reviews

Sentiment Classification

Polarized Language

Corpus Creation

Training a Classifier

Validating the Classifier

Designing with Data

Lessons Learned

Summary

Resources

Chapter 7. Will the Bad Data Please Stand Up?

Example 1: Defect Reduction in Manufacturing

Example 2: Who’s Calling?

Example 3: When “Typical” Does Not Mean “Average”

Lessons Learned

Will This Be on the Test?

Chapter 8. Blood, Sweat, and Urine

A Very Nerdy Body Swap Comedy

How Chemists Make Up Numbers

All Your Database Are Belong to Us

Check, Please

Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository

Rehab for Chemists (and Other Spreadsheet Abusers)

tl;dr

Chapter 9. When Data and Reality Don’t Match

Whose Ticker Is It Anyway?

Splits, Dividends, and Rescaling

Bad Reality

Conclusion

Chapter 10. Subtle Sources of Bias and Error

Imputation Bias: General Issues

Reporting Errors: General Issues

Other Sources of Bias

Topcoding/Bottomcoding

Seam Bias

Proxy Reporting

Sample Selection

Conclusions

References

Chapter 11. Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?

But First, Let’s Reflect on Graduate School …

Moving On to the Professional World

Moving into Government Work

Government Data Is Very Real

Service Call Data as an Applied Example

Moving Forward

Lessons Learned and Looking Ahead

Chapter 12. When Databases Attack: A Guide for When to Stick to Files

History

Building My Toolset

The Roadblock: My Datastore

Consider Files as Your Datastore

Files Are Simple!

Files Work with Everything

Files Can Contain Any Data Type

Data Corruption Is Local

They Have Great Tooling

There’s No Install Tax

File Concepts

Encoding

Text Files

Binary Data

Memory-Mapped Files

File Formats

Delimiters

A Web Framework Backed by Files

Motivation

Implementation

Reflections

Chapter 13. Crouching Table, Hidden Network

A Relational Cost Allocations Model

The Delicate Sound of a Combinatorial Explosion…

The Hidden Network Emerges

Storing the Graph

Navigating the Graph with Gremlin

Finding Value in Network Properties

Think in Terms of Multiple Data Models and Use the Right Tool for the Job

Acknowledgments

Chapter 14. Myths of Cloud Computing

Introduction to the Cloud

What Is “The Cloud”?

The Cloud and Big Data

Introducing Fred

At First Everything Is Great

They Put 100% of Their Infrastructure in the Cloud

As Things Grow, They Scale Easily at First

Then Things Start Having Trouble

They Need to Improve Performance

Higher IO Becomes Critical

A Major Regional Outage Causes Massive Downtime

Higher IO Comes with a Cost

Data Sizes Increase

Geo Redundancy Becomes a Priority

Horizontal Scale Isn’t as Easy as They Hoped

Costs Increase Dramatically

Fred’s Follies

Myth 1: Cloud Is a Great Solution for All Infrastructure Components

How This Myth Relates to Fred’s Story

Myth 2: Cloud Will Save Us Money

How This Myth Relates to Fred’s Story

Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID

How This Myth Relates to Fred’s Story

Myth 4: Cloud Computing Makes Horizontal Scaling Easy

How This Myth Relates to Fred’s Story

Conclusion and Recommendations

Chapter 15. The Dark Side of Data Science

Avoid These Pitfalls

Know Nothing About Thy Data

Be Inconsistent in Cleaning and Organizing the Data

Assume Data Is Correct and Complete

Spillover of Time-Bound Data

Thou Shalt Provide Your Data Scientists with a Single Tool for All Tasks

Using a Production Environment for Ad-Hoc Analysis

The Ideal Data Science Environment

Thou Shalt Analyze for Analysis’ Sake Only

Thou Shalt Compartmentalize Learnings

Thou Shalt Expect Omnipotence from Data Scientists

Where Do Data Scientists Live Within the Organization?

Final Thoughts

Chapter 16. How to Feed and Care for Your Machine-Learning Experts

Define the Problem

Fake It Before You Make It

Create a Training Set

Pick the Features

Encode the Data

Split Into Training, Test, and Solution Sets

Describe the Problem

Respond to Questions

Integrate the Solutions

Conclusion

Chapter 17. Data Traceability

Why?

Personal Experience

Snapshotting

Saving the Source

Weighting Sources

Backing Out Data

Separating Phases (and Keeping them Pure)

Identifying the Root Cause

Finding Areas for Improvement

Immutability: Borrowing an Idea from Functional Programming

An Example

Crawlers

Change

Clustering

Popularity

Conclusion

Chapter 18. Social Media: Erasable Ink?

Social Media: Whose Data Is This Anyway?

Control

Commercial Resyndication

Expectations Around Communication and Expression

Technical Implications of New End User Expectations

What Does the Industry Do?

Validation API

Update Notification API

What Should End Users Do?

How Do We Work Together?

Chapter 19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

Framework Introduction: The Four Cs of Data Quality Analysis

Complete

Coherent

Correct

aCcountable

Conclusion

Index

About the Author

Bad Data Handbook Q. Ethan McCallum

Bad Data Handbook by Q. Ethan McCallum Copyright © 2013 Q. McCallum. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Mike Loukides and Meghan Blanchette Production Editor: Melanie Yarbrough Copyeditor: Gillian McGarvey Proofreader: Melanie Yarbrough Indexer: Angela Howard Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano November 2012: First Edition Revision History for the First Edition: 2012-11-05 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449321888 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Bad Data Handbook, the cover image of a short-legged goose, and related trade dress are trade‐ marks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-32188-8 [LSI]

Table of Contents About the Authors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Setting the Pace: What Is Bad Data?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. Is It Just Me, or Does This Data Smell Funny?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Understand the Data Structure 6 Field Validation 9 Value Validation 10 Physical Interpretation of Simple Statistics 11 Visualization 12 Keyword PPC Example 14 Search Referral Example 19 Recommendation Analysis 21 Time Series Data 24 Conclusion 29 3. Data Intended for Human Consumption, Not Machine Consumption. . . . . . . . . . . . . . . 31 The Data 31 The Problem: Data Formatted for Human Consumption 32 The Arrangement of Data 32 Data Spread Across Multiple Files 37 The Solution: Writing Code 38 Reading Data from an Awkward Format 39 Reading Data Spread Across Several Files 40 Postscript 48 Other Formats 48 Summary 51 4. Bad Data Lurking in Plain Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 iii

Which Plain Text Encoding? 54 Guessing Text Encoding 58 Normalizing Text 61 Problem: Application-Specific Characters Leaking into Plain Text 63 Text Processing with Python 67 Exercises 68 5. (Re)Organizing the Web’s Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Can You Get That? 70 General Workflow Example 71 robots.txt 72 Identifying the Data Organization Pattern 73 Store Offline Version for Parsing 75 Scrape the Information Off the Page 76 The Real Difficulties 79 Download the Raw Content If Possible 80 Forms, Dialog Boxes, and New Windows 80 Flash 81 The Dark Side 82 Conclusion 82 6. Detecting Liars and the Confused in Contradictory Online Reviews. . . . . . . . . . . . . . . . . 83 Weotta 83 Getting Reviews 84 Sentiment Classification 85 Polarized Language 85 Corpus Creation 87 Training a Classifier 88 Validating the Classifier 90 Designing with Data 91 Lessons Learned 92 Summary 92 Resources 93 7. Will the Bad Data Please Stand Up?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Example 1: Defect Reduction in Manufacturing 95 Example 2: Who’s Calling? 98 Example 3: When “Typical” Does Not Mean “Average” 101 Lessons Learned 104 Will This Be on the Test? 105 8. Blood, Sweat, and Urine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 iv | Table of Contents

A Very Nerdy Body Swap Comedy 107 How Chemists Make Up Numbers 108 All Your Database Are Belong to Us 110 Check, Please 113 Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository 114 Rehab for Chemists (and Other Spreadsheet Abusers) 115 tl;dr 117 9. When Data and Reality Don’t Match. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Whose Ticker Is It Anyway? 120 Splits, Dividends, and Rescaling 122 Bad Reality 125 Conclusion 127 10. Subtle Sources of Bias and Error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Imputation Bias: General Issues 131 Reporting Errors: General Issues 133 Other Sources of Bias 135 Topcoding/Bottomcoding 136 Seam Bias 137 Proxy Reporting 138 Sample Selection 139 Conclusions 139 References 140 11. Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?. . . . . . . . . . 143 But First, Let’s Reflect on Graduate School … 143 Moving On to the Professional World 144 Moving into Government Work 146 Government Data Is Very Real 146 Service Call Data as an Applied Example 147 Moving Forward 148 Lessons Learned and Looking Ahead 149 12. When Databases Attack: A Guide for When to Stick to Files. . . . . . . . . . . . . . . . . . . . . . 151 History 151 Building My Toolset 152 The Roadblock: My Datastore 152 Consider Files as Your Datastore 154 Files Are Simple! 154 Files Work with Everything 154 Files Can Contain Any Data Type 154 Table of Contents | v

Data Corruption Is Local 155 They Have Great Tooling 155 There’s No Install Tax 155 File Concepts 156 Encoding 156 Text Files 156 Binary Data 156 Memory-Mapped Files 156 File Formats 156 Delimiters 158 A Web Framework Backed by Files 159 Motivation 160 Implementation 161 Reflections 161 13. Crouching Table, Hidden Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A Relational Cost Allocations Model 164 The Delicate Sound of a Combinatorial Explosion… 167 The Hidden Network Emerges 168 Storing the Graph 169 Navigating the Graph with Gremlin 170 Finding Value in Network Properties 171 Think in Terms of Multiple Data Models and Use the Right Tool for the Job 173 Acknowledgments 173 14. Myths of Cloud Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Introduction to the Cloud 175 What Is “The Cloud”? 175 The Cloud and Big Data 176 Introducing Fred 176 At First Everything Is Great 177 They Put 100% of Their Infrastructure in the Cloud 177 As Things Grow, They Scale Easily at First 177 Then Things Start Having Trouble 177 They Need to Improve Performance 178 Higher IO Becomes Critical 178 A Major Regional Outage Causes Massive Downtime 178 Higher IO Comes with a Cost 179 Data Sizes Increase 179 Geo Redundancy Becomes a Priority 179 Horizontal Scale Isn’t as Easy as They Hoped 180 Costs Increase Dramatically 180 vi | Table of Contents

分享到：

赞收藏

资料库

Bad.Data.Handbook.pdf

相关推荐

开发技术

热门标签

最新资料