Contents
Preface
How This Book Is Organized
Conventions Used in This Book
Using Code Examples
How to Contact Us
Safari® Books Online
Seeing Your Life in Data
Personal Environmental Impact Report (PEIR)
your.flowingdata (YFD)
Personal Data Collection
Working Data Collection into Routine
Asynchronous data collection
Data Storage
Data Processing
Data Visualization
PEIR
Mapping location-based data
Experimenting with visual cues
Mapping multivariate location traces
Choosing a color scheme
Making trips interactive
Displaying distributions
Sharing personal data
YFD
The Point
How to Participate
The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
Introduction: User Empathy Is the New Black
What Is UX?
The Benefits of Applying UX Best Practices to Data Collection
The Project: Surveying Customers About a New Luxury Product
Specific Challenges to Data Collection
Challenges of Accessibility
Challenges of Perception
Building trust
Length of survey
Accurate data collection
Motivation
Designing Our Solution
Design Philosophy
Designing the Form Layout
Web form typography and accessibility
Giving them some space
Accommodating different browsers and testing for compatibility
Interaction design considerations: Dynamic form length
Designing trust
Designing for accurate data collection
Motivation
Reporting the live data results
Results and Reflection
Embedded Image Data Processing on Mars
Abstract
Introduction
Some Background
To Pack or Not to Pack
The Three Tasks
Slotting the Images
Passing the Image: Communication Among the Three Tasks
Getting the Picture: Image Download and Processing
Image Compression
Downlink, or, It’s All Downhill from Here
Conclusion
Cloud Storage Design in a PNUTShell
Introduction
Updating Data
The Challenge
Our Approach
More on mastership
Supporting ordered data
Trading off consistency for availability
Complex Queries
The Challenge
Our Approach
Comparison with Other Systems
Google’s BigTable
Amazon’s Dynamo
Microsoft Azure SDS
Other Related Systems
Other Systems at Yahoo!
Conclusion
Acknowledgments
References
Information Platforms and the Rise of the Data Scientist
Libraries and Brains
Facebook Becomes Self-Aware
A Business Intelligence System
The Death and Rebirth of a Data Warehouse
Beyond the Data Warehouse
The Cheetah and the Elephant
The Unreasonable Effectiveness of Data
New Tools and Applied Research
MAD Skills and Cosmos
Information Platforms As Dataspaces
The Data Scientist
Conclusion
The Geographic Beauty of a Photographic Archive
Beauty in Data: Geograph
Visualization, Beauty, and Treemaps
What Is Beauty in Visual Data Exploration?
Making Treemaps Beautiful: A Geographic Perspective
A Geographic Perspective on Geograph Term Use
Representing the Term Hierarchy
Representing Absolute Location with Color
Representing Relative Location with Spatial Treemaps
Representing Location Displacement
Beauty in Discovery
Reflection and Conclusion
Acknowledgments
References
Data Finds Data
Introduction
The Benefits of Just-in-Time Discovery
Corruption at the Roulette Wheel
Enterprise Discoverability
Federated Search Ain’t All That
Directories: Priceless
Relevance: What Matters and to Whom?
Components and Special Considerations
The Existence of, and Availability of, Observations
The Ability to Extract and Classify Features from the Observations
The Ability to Efficiently Discover Related Historical Context
The Ability to Make Assertions (Same or Related) About New Observations
The Ability to Recognize When New Observations Reverse Earlier Assertions
The Ability to Accumulate and Persist This Asserted Context
The Ability to Recognize the Formation of Relevance/Insight
The Ability to Notify the Appropriate Entity of Such Insight
Privacy Considerations
Conclusion
Portable Data in Real Time
Introduction
The State of the Art
Transport
XMPP
BitTorrent
Proprietary/P2P
Formats
APIs
Polling
Rate limiting
Getting it right
Zero miles per gallon efficiency
Events
HTML 5 events
WAN Scale Events
Social Data Normalization
Business Value of Data
Public versus private
Conclusion: Mediation via Gnip
Surfacing the Deep Web
What Is the Deep Web?
Alternatives to Offering Deep-Web Access
Basics of HTML Form Processing
Queries and Query Templates
Selecting Input Combinations
Quality of query templates
Informativeness test
Searching for informative query templates
Predicting Input Values
Generic text inputs
Typed text inputs
Conclusion and Future Work
References
Building Radiohead’s House of Cards
How It All Started
The Data Capture Equipment
Velodyne Lidar
Geometric Informatics
The Advantages of Two Data Capture Systems
The Data
Capturing the Data, aka “The Shoot”
The Outdoor Lidar Shoot
The Indoor Lidar Shoot
The Indoor GeoVideo Shoot
Processing the Data
Post-Processing the Data
Launching the Video
Conclusion
Visualizing Urban Data
Introduction
Background
Cracking the Nut
Making It Public
Revisiting
Conclusion
The Design of Sense.us
Visualization and Social Data Analysis
Data
Visualization
Design Considerations
Foster personal relevance
Provide effective visual encodings
Make each display distinct
Support intuitive exploration
Be engaging and playful
Visualization Designs
Job Voyager
Birthplace Voyager
U.S. census state map and scatterplot
Population pyramid
Implementation details
Collaboration
View Sharing
Doubly Linked Discussion
Pointing via Graphical Annotation
Collecting and Linking Views
Awareness and Social Navigation
Unobtrusive Collaboration
Voyagers and Voyeurs
Hunting for Patterns
Making Sense of It All
Crowd Surfing
Conclusion
References
What Data Doesn’t Do
When Doesn’t Data Drive?
1. More Data Isn’t Always Better
2. More Data Isn’t Always Easy
3. Data Alone Doesn’t Explain
4. Data Isn’t Good for a Single Answer
5. Data Doesn’t Predict
6. Probability Isn’t Intuitive
7. Probabilities Aren’t Intuitive
8. The Real World Doesn’t Create Random Variables
9. Data Doesn’t Stand Alone
10. Data Isn’t Free from the Eye of the Beholder
Conclusion
References
Natural Language Corpus Data
Word Segmentation
Secret Codes
Spelling Correction
Other Tasks
Language Identification
Spam Detection and Other Classification Tasks
Author Identification (Stylometry)
Document Unshredding and DNA Sequencing
Machine Translation
Discussion and Conclusion
Acknowledgments
Life in Data: The Story of DNA
DNA As a Data Store
DNA Makes RNA Makes Proteins
Hacking Your DNA Data Store with Drugs
Cancer
Replication
Cracking the Code
DNA As Digital Storage
Evolution As an Algorithm
DNA As a Data Source
A Quantum Leap
“My God, It’s Full of Bases...”
Fighting the Data Deluge
The Sanger Institute’s Sequencing Platform
Project management
Flexible Data Capture
Instrument and Data Management
The Future of DNA
How to Become a Genetic Hacker
Next Next-Gen
The Era of Big Data
Acknowledgments
Beautifying Data in the Real World
The Problem with Real Data
Providing the Raw Data Back to the Notebook
Validating Crowdsourced Data
Representing the Data Online
Unique Identifiers for Chemical Entities
Open Data and Accessible Services Enable a Wide Range of Visualization and Analysis Options
Integrating Data with a Central Aggregation Service
Enabling Data Integration via Unique Identifiers and Self-Describing Data Formats
Closing the Loop: Visualizations to Suggest New Experiments
Building a Data Web from Open Data and Free Services
Acknowledgments
References
Superficial Data Analysis: Exploring Millions of Social Stereotypes
Introduction
Preprocessing the Data
Exploring the Data
Age, Attractiveness, and Gender
Looking at Tags
Which Words Are Gendered?
Clustering
Conclusion
Acknowledgments
References
Bay Area Blues: The Effect of the Housing Crisis
Introduction
How Did We Get the Data?
Geocoding
Data Checking
Analysis
The Influence of Inflation
The Rich Get Richer and the Poor Get Poorer
Geographic Differences
Census Information
Exploring San Francisco
Conclusion
References
Beautiful Political Data
Example 1: Redistricting and Partisan Bias
Example 2: Time Series of Estimates
Example 3: Age and Voting
Example 4: Public Opinion and Senate Voting on Supreme Court Nominees
Example 5: Localized Partisanship in Pennsylvania
Conclusion
References
Connecting Data
What Public Data Is There, Really?
The Possibilities of Connected Data
Within Companies
Impediments to Connecting Data
The Representation Problem
Shared Nouns and Shared Verbs
The Same Thing with Different Names
Different Things with the Same Name
Possible Solutions
Matching on Multiple Fields
Collective Reconciliation
Conclusion
Contributors
Index