logo资料库

深入浅出数据分析-英文版(Head first data analysis).pdf

第1页 / 共486页
第2页 / 共486页
第3页 / 共486页
第4页 / 共486页
第5页 / 共486页
第6页 / 共486页
第7页 / 共486页
第8页 / 共486页
资料共486页,剩余部分请下载后查看
Author of Head First Data Analysis
Table of Contents (Summary)
Table of Contents (the real thing)
how to use this book: Intro
Who is this book for?
Who should probably back away from this book?
We know what you’re thinking
We know what your brain is thinking
Here’s what WE did:
Here’s what YOU can do to bend your brain into submission
Read Me
The technical review team
Acknowledgments
Safari® Books Online
1 introduction to data analysis: Break it down
Acme Cosmetics needs your help
The CEO wants data analysis to help increase sales
Data analysis is careful thinking about evidence
Define the problem
Your client will help you define your problem
Acme’s CEO has some feedback for you
Break the problem and data into smaller pieces
Now take another look at what you know
Evaluate the pieces
Analysis begins when you insert yourself
Make a recommendation
Your report is ready
The CEO likes your work
An article just came across the wire
You let the CEO’s beliefs take you down the wrong path
Your assumptions and beliefs about the world are your mental model
Your statistical model depends on your mental model
Mental models should always include what you don’t know
The CEO tells you what he doesn’t know
Acme just sent you a huge list of raw data
Time to drill further into the data
General American Wholesalers confirms your impression
Here’s what you did
Your analysis led your client to a brilliant decision
2 experiments: Test your theories
It’s a coffee recession!
The Starbuzz boardmeeting is in three months
The Starbuzz Survey
Always use the method of comparison
Comparisons are key for observational data
Could value perception be causing the revenue decline?
A typical customer’s thinking
Observational studies are full of confounders
How location might be confounding your results
Manage confounders by breaking the data into chunks
It’s worse than we thought!
You need an experiment to say which strategy will work best
The Starbuzz CEO is in a big hurry
Starbuzz drops its prices
One month later…
Control groups give you a baseline
Not getting fired 101
Let’s experiment again for real!
One month later…
Confounders also plague experiments
Avoid confounders by selecting groups carefully
Randomization selects similar groups
Your experiment is ready to go
The results are in
Starbuzz has an empirically tested sales strategy
3 optimization: Take it to the max
You’re now in the bath toy game
Constraints limit the variables you control
Decision variables are things you can control
You have an optimization problem
Find your objective with the objective function
Your objective function
Show product mixes with your other constraints
Plot multiple constraints on the same chart
Your good options are all in the feasible region
Your new constraint changed the feasible region
Your spreadsheet does optimization
Solver crunched your optimization problem in a snap
Profits fell through the floor
Your model only describes what you put into it
Calibrate your assumptions to your analytical objectives
Watch out for negatively linked variables
Your new plan is working like a charm
Your assumptions are based on an ever-changing reality
4 data visualization: Pictures make you smarter
New Army needs to optimize their website
The results are in, but the information designer is out
The last information designer submitted these three infographics
What data is behind the visualizations?
Show the data!
Here’s some unsolicited advice from the last designer
Too much data is never your problem
Making the data pretty isn’t your problem either
Data visualization is all about making the right comparisons
Your visualization is already more useful than the rejected ones
Use scatterplots to explore causes
The best visualizations are highly multivariate
Show more variables by looking at charts together
The visualization is great, but the web guru’s not satisfied yet
Good visual designs help you think about causes
The experiment designers weigh in
The experiment designers have some hypotheses of their own
The client is pleased with your work
Orders are coming in from everywhere!
5 hypothesis testing: Say it ain’t so
Gimme some skin…
When do we start making new phone skins?
PodPhone doesn’t want you to predict their next move
Here’s everything we know
ElectroSkinny’s analysis does fit the data
ElectroSkinny obtained this confidential strategy memo
Variables can be negatively or positively linked
Causes in the real world are networked, not linear
Hypothesize PodPhone’s options
You have what you need to run a hypothesis test
Falsification is the heart of hypothesis testing
Diagnosticity helps you find the hypothesis with the least disconfirmation
You can’t rule out all the hypotheses,but you can say which is strongest
You just got a picture message…
It’s a launch!
6 bayesian statistics: Get past first base
The doctor has disturbing news
Let’s take the accuracy analysis one claim at a time
How common is lizard flu really?
You’ve been counting false positives
All these terms describe conditional probabilities
You need to count false positives, true positives, false negatives, and true negatives
1 percent of people have lizard flu
Your chances of having lizard flu are still pretty low
Do complex probabilistic thinking with simple whole numbers
Bayes’ rule manages your base rates when you get new data
You can use Bayes’ rule over and over
Your second test result is negative
The new test has different accuracy statistics
New information can change your base rate
What a relief!
7 subjective probabilities: Numerical belief
Backwater Investments needs your help
Their analysts are at each other’s throats
Subjective probabilities describe expert beliefs
Subjective probabilities might show no real disagreement after all
The analysts responded with their subjective probabilities
The CEO doesn’t see what you’re up to
The CEO loves your work
The standard deviation measures how far points are from the average
You were totally blindsided by this news
Bayes’ rule is great for revising subjective probabilities
The CEO knows exactly what to do with this new information
Russian stock owners rejoice!
8 heuristics: Analyze like a human
LitterGitters submitted their report to the city council
The LitterGitters have really cleaned up this town
The LitterGitters have been measuring their campaign’s effectiveness
The mandate is to reduce the tonnage of litter
Tonnage is unfeasible to measure
Give people a hard question, and they’ll answer an easier one instead
Littering in Dataville is a complex system
You can’t build and implement a unified litter-measuring model
Heuristics are a middle ground between going with your gut and optimization
Use a fast and frugal tree
Is there a simpler way to assess LitterGitters’ success?
Stereotypes are heuristics
Your analysis is ready to present
Looks like your analysis impressed the city council members
9 histograms: The shape of numbers
Your annual review is coming up
Going for more cash could play out in a bunch of different ways
Here’s some data on raises
Histograms show frequencies of groups of numbers
Gaps between bars in a histogram mean gaps among the data points
Install and run R
Load data into R
R creates beautiful histograms
Make histograms from subsets of your data
Negotiation pays
What will negotiation mean for you?
10 regression: Prediction
What are you going to do with all this money?
An analysis that tells people what to ask for could be huge
Behold… the Raise Reckoner!
Inside the algorithm will be a method to predict raises
Scatterplots compare two variables
A line could tell your clients where to aim
Predict values in each strip with the graph of averages
The regression line predicts what raises people will receive
The line is useful if your data shows a linear correlation
You need an equation to make your predictions precise
Tell R to create a regression object
The regression equation goes hand in hand with your scatterplot
The regression equation is the Raise Reckoner algorithm
Your raise predictor didn’t work out as planned…
11 error: Err Well
Your clients are pretty ticked off
What did your raise prediction algorithm do?
The segments of customers
The guy who asked for 25%went outside the model
How to handle the client who wants a prediction outside the data range
The guy who got fired because of extrapolation has cooled off
You’ve only solved part of the problem
What does the data for the screwy outcomes look like?
Chance errors are deviations from what your model predicts
Error is good for you and your client
Specify error quantitatively
Quantify your residual distribution with Root Mean Squared error
Your model in R already knows the R.M.S. error
R’s summary of your linear model shows your R.M.S. error
Segmentation is all about managing error
Good regressions balance explanation and prediction
Your segmented models manage error better than the original model
Your clients are returning in droves
12 relational databases: Can you relate?
The Dataville Dispatch wants to analyze sales
Here’s the data they keep to track their operations
You need to know how the data tables relate to each other
A database is a collection of data with well‑specified relations to each other
Create a spreadsheet that goes across that path
Your summary ties article count and sales together
Looks like your scatterplot is going over really well
Copying and pasting all that data was a pain
Relational databases manage relations for you
Dataville Dispatch built an RDBMS with your relationship diagram
Dataville Dispatch extracted your data using the SQL language
Comparison possibilities are endless if your data is in a RDBMS
You’re on the cover
13 cleaning data: Impose order
Just got a client list from a defunct competitor
The dirty secret of data analysis
Head First Head Hunters wants the list for their sales team
Cleaning messy data is all about preparation
Once you’re organized, you can fix the data itself
Use the # sign as a delimiter
Excel split your data into columns using the delimiter
Use SUBSTITUTE to replace the carat character
You cleaned up all the first names
The last name pattern is too complex for SUBSTITUTE
Handle complex patterns with nested text formulas
R can use regular expressions to crunch complex data patterns
The sub command fixed your last names
Now you can ship the data to your client
Maybe you’re not quite done yet…
Sort your data to show duplicate values together
The data is probably from a relational database
Remove duplicate names
You created nice, clean, unique records
Head First Head Hunters is recruiting like gangbusters!
Leaving town...
appendix i: leftovers: The Top Ten Things (we didn't cover)
#1: Everything else in statistics
#2: Excel skills
#3: Edward Tufte and his principles of visualization
#4: PivotTables
#5: The R community
#6: Nonlinear and multiple regression
#7: Null-alternative hypothesis testing
#8: Randomness
#9: Google Docs
#10: Your expertise
appendix ii: install r: Start R up!
Get started with R
appendix iii: install excel analysis tools:The ToolPak
Install the data analysis tools in Excel
Index
Numbers
Symbols
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Download at Boykma.Com
Advance Praise for Head First Data Analysis “It’s about time a straightforward and comprehensive guide to analyzing data was written that makes learning the concepts simple and fun. It will change the way you think and approach problems using proven techniques and free tools. Concepts are good in theory and even better in practicality.” — Anthony Rose, President, Support Analytics “Head First Data Analysis does a fantastic job of giving readers systematic methods to analyze real-world problems. From coffee, to rubber duckies, to asking for a raise, Head First Data Analysis shows the reader how to find and unlock the power of data in everyday life. Using everything from graphs and visual aides to computer programs like Excel and R, Head First Data Analysis gives readers at all levels accessible ways to understand how systematic data analysis can improve decision making both large and small.” — Eric Heilman, Statistics teacher, Georgetown Preparatory School “Buried under mountains of data? Let Michael Milton be your guide as you fill your toolbox with the analytical skills that give you an edge. In Head First Data Analysis, you’ll learn how to turn raw numbers into real knowledge. Put away your Ouija board and tarot cards; all you need to make good decisions is some software and a copy of this book.” — Bill Mietelski, Software engineer Download at Boykma.Com
Praise for other Head First books “Kathy and Bert’s Head First Java transforms the printed page into the closest thing to a GUI you’ve ever seen. In a wry, hip manner, the authors make learning Java an engaging ‘what’re they gonna do next?’ experience.” —Warren Keuffel, Software Development Magazine “Beyond the engaging style that drags you forward from know-nothing into exalted Java warrior status, Head First Java covers a huge amount of practical matters that other texts leave as the dreaded “exercise for the reader...” It’s clever, wry, hip and practical—there aren’t a lot of textbooks that can make that claim and live up to it while also teaching you about object serialization and network launch protocols.” —Dr. Dan Russell, Director of User Sciences and Experience Research IBM Almaden Research Center (and teacher of Artificial Intelligence at Stanford University) “It’s fast, irreverent, fun, and engaging. Be careful—you might actually learn something!” —Ken Arnold, former Senior Engineer at Sun Microsystems Coauthor (with James Gosling, creator of Java), The Java Programming Language “I feel like a thousand pounds of books have just been lifted off of my head.” —Ward Cunningham, inventor of the Wiki and founder of the Hillside Group “Just the right tone for the geeked-out, casual-cool guru coder in all of us. The right reference for practi- cal development strategies—gets my brain going without having to slog through a bunch of tired stale professor -speak.” —Travis Kalanick, Founder of Scour and Red Swoosh Member of the MIT TR100 “There are books you buy, books you keep, books you keep on your desk, and thanks to O’Reilly and the Head First crew, there is the ultimate category, Head First books. They’re the ones that are dog-eared, mangled, and carried everywhere. Head First SQL is at the top of my stack. Heck, even the PDF I have for review is tattered and torn.” — Bill Sawyer, ATG Curriculum Manager, Oracle “This book’s admirable clarity, humor and substantial doses of clever make it the sort of book that helps even non-programmers think well about problem-solving.” — Cory Doctorow, co-editor of BoingBoing Author, Down and Out in the Magic Kingdom and Someone Comes to Town, Someone Leaves Town Download at Boykma.Com
Praise for other Head First books “I received the book yesterday and started to read it...and I couldn’t stop. This is definitely très ‘cool.’ It is fun, but they cover a lot of ground and they are right to the point. I’m really impressed.” — Erich Gamma, IBM Distinguished Engineer, and co-author of Design Patterns “One of the funniest and smartest books on software design I’ve ever read.” — Aaron LaBerge, VP Technology, ESPN.com “What used to be a long trial and error learning process has now been reduced neatly into an engaging paperback.” — Mike Davidson, CEO, Newsvine, Inc. “Elegant design is at the core of every chapter here, each concept conveyed with equal doses of pragmatism and wit.” — Ken Goldstein, Executive Vice President, Disney Online “I ♥ Head First HTML with CSS & XHTML—it teaches you everything you need to learn in a ‘fun coated’ format.” — Sally Applin, UI Designer and Artist “Usually when reading through a book or article on design patterns, I’d have to occasionally stick myself in the eye with something just to make sure I was paying attention. Not with this book. Odd as it may sound, this book makes learning about design patterns fun. “While other books on design patterns are saying ‘Buehler… Buehler… Buehler…’ this book is on the float belting out ‘Shake it up, baby!’” — Eric Wuehler “I literally love this book. In fact, I kissed this book in front of my wife.” — Satish Kumar Download at Boykma.Com
Other related books from O’Reilly Analyzing Business Data with Excel Excel Scientific and Engineering Cookbook Access Data Analysis Cookbook Other books in O’Reilly’s Head First series Head First Java Head First Object-Oriented Analysis and Design (OOA&D) Head First HTML with CSS and XHTML Head First Design Patterns Head First Servlets and JSP Head First EJB Head First PMP Head First SQL Head First Software Development Head First JavaScript Head First Ajax Head First Physics Head First Statistics Head First Rails Head First PHP & MySQL Head First Algebra Head First Web Design Head First Networking Download at Boykma.Com
Head First Data Analysis Wouldn’t it be dreamy if there was a book on data analysis that wasn’t just a glorified printout of Microsoft Excel help files? But it’s probably just a fantasy... Michael Milton Beijing • Cambridge • Farnham • Kln • Sebastopol • Taipei • Tokyo Download at Boykma.Com
Head First Data Analysis by Michael Milton Copyright © 2009 Michael Milton. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly Media books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Series Creators: Series Editor: Editor: Cover Designers: Production Editor: Proofreader: Indexer: Page Viewers: Kathy Sierra, Bert Bates Brett D. McLaughlin Brian Sawyer Karen Montgomery Scott DeLugan Nancy Reinhardt Jay Harward Mandarin, the fam, and Preston Printing History: July 2009: First Edition. Mandarin The fam Preston The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Head First series designations, Head First Data Analysis and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and the authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. No data was harmed in the making of this book. TM This book uses RepKover™, a durable and flexible lay-flat binding. ISBN: 978-0-596-15393-9 [M] Download at Boykma.Com
分享到:
收藏