logo资料库

[数据之美].Beautiful.Data.文字版.pdf

第1页 / 共383页
第2页 / 共383页
第3页 / 共383页
第4页 / 共383页
第5页 / 共383页
第6页 / 共383页
第7页 / 共383页
第8页 / 共383页
资料共383页,剩余部分请下载后查看
Contents
Preface
How This Book Is Organized
Conventions Used in This Book
Using Code Examples
How to Contact Us
Safari® Books Online
Seeing Your Life in Data
Personal Environmental Impact Report (PEIR)
your.flowingdata (YFD)
Personal Data Collection
Working Data Collection into Routine
Asynchronous data collection
Data Storage
Data Processing
Data Visualization
PEIR
Mapping location-based data
Experimenting with visual cues
Mapping multivariate location traces
Choosing a color scheme
Making trips interactive
Displaying distributions
Sharing personal data
YFD
The Point
How to Participate
The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
Introduction: User Empathy Is the New Black
What Is UX?
The Benefits of Applying UX Best Practices to Data Collection
The Project: Surveying Customers About a New Luxury Product
Specific Challenges to Data Collection
Challenges of Accessibility
Challenges of Perception
Building trust
Length of survey
Accurate data collection
Motivation
Designing Our Solution
Design Philosophy
Designing the Form Layout
Web form typography and accessibility
Giving them some space
Accommodating different browsers and testing for compatibility
Interaction design considerations: Dynamic form length
Designing trust
Designing for accurate data collection
Motivation
Reporting the live data results
Results and Reflection
Embedded Image Data Processing on Mars
Abstract
Introduction
Some Background
To Pack or Not to Pack
The Three Tasks
Slotting the Images
Passing the Image: Communication Among the Three Tasks
Getting the Picture: Image Download and Processing
Image Compression
Downlink, or, It’s All Downhill from Here
Conclusion
Cloud Storage Design in a PNUTShell
Introduction
Updating Data
The Challenge
Our Approach
More on mastership
Supporting ordered data
Trading off consistency for availability
Complex Queries
The Challenge
Our Approach
Comparison with Other Systems
Google’s BigTable
Amazon’s Dynamo
Microsoft Azure SDS
Other Related Systems
Other Systems at Yahoo!
Conclusion
Acknowledgments
References
Information Platforms and the Rise of the Data Scientist
Libraries and Brains
Facebook Becomes Self-Aware
A Business Intelligence System
The Death and Rebirth of a Data Warehouse
Beyond the Data Warehouse
The Cheetah and the Elephant
The Unreasonable Effectiveness of Data
New Tools and Applied Research
MAD Skills and Cosmos
Information Platforms As Dataspaces
The Data Scientist
Conclusion
The Geographic Beauty of a Photographic Archive
Beauty in Data: Geograph
Visualization, Beauty, and Treemaps
What Is Beauty in Visual Data Exploration?
Making Treemaps Beautiful: A Geographic Perspective
A Geographic Perspective on Geograph Term Use
Representing the Term Hierarchy
Representing Absolute Location with Color
Representing Relative Location with Spatial Treemaps
Representing Location Displacement
Beauty in Discovery
Reflection and Conclusion
Acknowledgments
References
Data Finds Data
Introduction
The Benefits of Just-in-Time Discovery
Corruption at the Roulette Wheel
Enterprise Discoverability
Federated Search Ain’t All That
Directories: Priceless
Relevance: What Matters and to Whom?
Components and Special Considerations
The Existence of, and Availability of, Observations
The Ability to Extract and Classify Features from the Observations
The Ability to Efficiently Discover Related Historical Context
The Ability to Make Assertions (Same or Related) About New Observations
The Ability to Recognize When New Observations Reverse Earlier Assertions
The Ability to Accumulate and Persist This Asserted Context
The Ability to Recognize the Formation of Relevance/Insight
The Ability to Notify the Appropriate Entity of Such Insight
Privacy Considerations
Conclusion
Portable Data in Real Time
Introduction
The State of the Art
Transport
XMPP
BitTorrent
Proprietary/P2P
Formats
APIs
Polling
Rate limiting
Getting it right
Zero miles per gallon efficiency
Events
HTML 5 events
WAN Scale Events
Social Data Normalization
Business Value of Data
Public versus private
Conclusion: Mediation via Gnip
Surfacing the Deep Web
What Is the Deep Web?
Alternatives to Offering Deep-Web Access
Basics of HTML Form Processing
Queries and Query Templates
Selecting Input Combinations
Quality of query templates
Informativeness test
Searching for informative query templates
Predicting Input Values
Generic text inputs
Typed text inputs
Conclusion and Future Work
References
Building Radiohead’s House of Cards
How It All Started
The Data Capture Equipment
Velodyne Lidar
Geometric Informatics
The Advantages of Two Data Capture Systems
The Data
Capturing the Data, aka “The Shoot”
The Outdoor Lidar Shoot
The Indoor Lidar Shoot
The Indoor GeoVideo Shoot
Processing the Data
Post-Processing the Data
Launching the Video
Conclusion
Visualizing Urban Data
Introduction
Background
Cracking the Nut
Making It Public
Revisiting
Conclusion
The Design of Sense.us
Visualization and Social Data Analysis
Data
Visualization
Design Considerations
Foster personal relevance
Provide effective visual encodings
Make each display distinct
Support intuitive exploration
Be engaging and playful
Visualization Designs
Job Voyager
Birthplace Voyager
U.S. census state map and scatterplot
Population pyramid
Implementation details
Collaboration
View Sharing
Doubly Linked Discussion
Pointing via Graphical Annotation
Collecting and Linking Views
Awareness and Social Navigation
Unobtrusive Collaboration
Voyagers and Voyeurs
Hunting for Patterns
Making Sense of It All
Crowd Surfing
Conclusion
References
What Data Doesn’t Do
When Doesn’t Data Drive?
1. More Data Isn’t Always Better
2. More Data Isn’t Always Easy
3. Data Alone Doesn’t Explain
4. Data Isn’t Good for a Single Answer
5. Data Doesn’t Predict
6. Probability Isn’t Intuitive
7. Probabilities Aren’t Intuitive
8. The Real World Doesn’t Create Random Variables
9. Data Doesn’t Stand Alone
10. Data Isn’t Free from the Eye of the Beholder
Conclusion
References
Natural Language Corpus Data
Word Segmentation
Secret Codes
Spelling Correction
Other Tasks
Language Identification
Spam Detection and Other Classification Tasks
Author Identification (Stylometry)
Document Unshredding and DNA Sequencing
Machine Translation
Discussion and Conclusion
Acknowledgments
Life in Data: The Story of DNA
DNA As a Data Store
DNA Makes RNA Makes Proteins
Hacking Your DNA Data Store with Drugs
Cancer
Replication
Cracking the Code
DNA As Digital Storage
Evolution As an Algorithm
DNA As a Data Source
A Quantum Leap
“My God, It’s Full of Bases...”
Fighting the Data Deluge
The Sanger Institute’s Sequencing Platform
Project management
Flexible Data Capture
Instrument and Data Management
The Future of DNA
How to Become a Genetic Hacker
Next Next-Gen
The Era of Big Data
Acknowledgments
Beautifying Data in the Real World
The Problem with Real Data
Providing the Raw Data Back to the Notebook
Validating Crowdsourced Data
Representing the Data Online
Unique Identifiers for Chemical Entities
Open Data and Accessible Services Enable a Wide Range of Visualization and Analysis Options
Integrating Data with a Central Aggregation Service
Enabling Data Integration via Unique Identifiers and Self-Describing Data Formats
Closing the Loop: Visualizations to Suggest New Experiments
Building a Data Web from Open Data and Free Services
Acknowledgments
References
Superficial Data Analysis: Exploring Millions of Social Stereotypes
Introduction
Preprocessing the Data
Exploring the Data
Age, Attractiveness, and Gender
Looking at Tags
Which Words Are Gendered?
Clustering
Conclusion
Acknowledgments
References
Bay Area Blues: The Effect of the Housing Crisis
Introduction
How Did We Get the Data?
Geocoding
Data Checking
Analysis
The Influence of Inflation
The Rich Get Richer and the Poor Get Poorer
Geographic Differences
Census Information
Exploring San Francisco
Conclusion
References
Beautiful Political Data
Example 1: Redistricting and Partisan Bias
Example 2: Time Series of Estimates
Example 3: Age and Voting
Example 4: Public Opinion and Senate Voting on Supreme Court Nominees
Example 5: Localized Partisanship in Pennsylvania
Conclusion
References
Connecting Data
What Public Data Is There, Really?
The Possibilities of Connected Data
Within Companies
Impediments to Connecting Data
The Representation Problem
Shared Nouns and Shared Verbs
The Same Thing with Different Names
Different Things with the Same Name
Possible Solutions
Matching on Multiple Fields
Collective Reconciliation
Conclusion
Contributors
Index
Beautiful Data Edited by Toby Segaran and Jeff Hammerbacher Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo
Beautiful Data Edited by Toby Segaran and Jeff Hammerbacher Copyright © 2009 O’Reilly Media, Inc. All rights reserved. Printed in Canada. Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editor: Julie Steele Proofreader: Rachel Monaghan Production Editor: Rachel Monaghan Cover Designer: Mark Paglietti Copyeditor: Genevieve d’Entremont Interior Designer: Marcia Friedman Indexer: Angela Howard Illustrator: Robert Romano Printing History: July 2009: First Edition. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Beautiful Data, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-0-596-15711-1 [F]
All royalties from this book will be donated to Creative Commons and the Sunlight Foundation.
C O N T E N T S PREFACE 1 SEEING YOUR LIFE IN DATA by Nathan Yau Personal Environmental Impact Report (PEIR) your.flowingdata (YFD) Personal Data Collection Data Storage Data Processing Data Visualization The Point How to Participate 2 THE BEAUTIFUL PEOPLE: KEEPING USERS IN MIND WHEN DESIGNING DATA COLLECTION METHODS by Jonathan Follett and Matthew Holm Introduction: User Empathy Is the New Black The Project: Surveying Customers About a New Luxury Product Specific Challenges to Data Collection Designing Our Solution Results and Reflection 3 EMBEDDED IMAGE DATA PROCESSING ON MARS by J. M. Hughes Abstract Introduction Some Background To Pack or Not to Pack The Three Tasks Slotting the Images Passing the Image: Communication Among the Three Tasks Getting the Picture: Image Download and Processing Image Compression Downlink, or, It’s All Downhill from Here Conclusion xi 1 2 3 3 5 6 7 14 15 17 17 19 19 21 31 35 35 35 37 40 42 43 46 48 50 52 52 v
4 CLOUD STORAGE DESIGN IN A PNUTSHELL by Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava Introduction Updating Data Complex Queries Comparison with Other Systems Conclusion 5 INFORMATION PLATFORMS AND THE RISE OF THE DATA SCIENTIST by Jeff Hammerbacher Libraries and Brains Facebook Becomes Self-Aware A Business Intelligence System The Death and Rebirth of a Data Warehouse Beyond the Data Warehouse The Cheetah and the Elephant The Unreasonable Effectiveness of Data New Tools and Applied Research MAD Skills and Cosmos Information Platforms As Dataspaces The Data Scientist Conclusion 6 THE GEOGRAPHIC BEAUTY OF A PHOTOGRAPHIC ARCHIVE by Jason Dykes and Jo Wood Beauty in Data: Geograph Visualization, Beauty, and Treemaps A Geographic Perspective on Geograph Term Use Beauty in Discovery Reflection and Conclusion 7 DATA FINDS DATA by Jeff Jonas and Lisa Sokol Introduction The Benefits of Just-in-Time Discovery Corruption at the Roulette Wheel Enterprise Discoverability Federated Search Ain’t All That Directories: Priceless Relevance: What Matters and to Whom? Components and Special Considerations Privacy Considerations Conclusion 55 55 57 64 68 71 73 73 74 75 77 78 79 80 81 82 83 83 84 85 86 89 91 98 101 105 105 106 107 111 111 113 115 115 118 118 vi C O N T E N T S
8 PORTABLE DATA IN REAL TIME by Jud Valeski Introduction The State of the Art Social Data Normalization Conclusion: Mediation via Gnip 9 SURFACING THE DEEP WEB by Alon Halevy and Jayant Madhaven What Is the Deep Web? Alternatives to Offering Deep-Web Access Conclusion and Future Work 10 BUILDING RADIOHEAD’S HOUSE OF CARDS by Aaron Koblin with Valdean Klump How It All Started The Data Capture Equipment The Advantages of Two Data Capture Systems The Data Capturing the Data, aka “The Shoot” Processing the Data Post-Processing the Data Launching the Video Conclusion 11 VISUALIZING URBAN DATA by Michal Migurski Introduction Background Cracking the Nut Making It Public Revisiting Conclusion 12 THE DESIGN OF SENSE.US by Jeffrey Heer Visualization and Social Data Analysis Data Visualization Collaboration Voyagers and Voyeurs Conclusion 119 119 120 128 131 133 133 135 147 149 149 150 154 154 155 160 160 161 164 167 167 168 169 174 178 181 183 184 186 188 194 199 203 C O N T E N T S vii
分享到:
收藏