logo资料库

An Introduction to Statistics with Python.pdf

第1页 / 共267页
第2页 / 共267页
第3页 / 共267页
第4页 / 共267页
第5页 / 共267页
第6页 / 共267页
第7页 / 共267页
第8页 / 共267页
资料共267页,剩余部分请下载后查看
I Python and Statistics
Why Statistics?
Python
Getting Started
Python Links
Free Python Books
Installation and Updates
PyPI - the Python Package Index
github
Conventions
IPython
First Session with the IPython Qt Console
Personalizing IPython
Ipython Notebook
IPython Tips
Developing Python Programs
First Python Script
Functions, Modules, and Packages
Python Tips
Python Data Structures
Indexing and Slicing
Plots in Python
Interactive plots
Graphical Output in Python: Functional and Object-oriented Approach
Pandas
Data Handling
Grouping
Statsmodels
Seaborn
General Routines
Exercises
Data Input
Data from Textfiles
Visual Inspection
Reading ASCII-data into Python
Regular Expressions
Input from MS Excel
Input from Matlab and other formats
Matlab
II Basics Principles and Hypothesis Tests
Basic Principles
Datatypes
Categorical
Numerical
Data Display
Univariate Data
Bivariate and Multivariate Plots
Populations and Samples
Degrees of Freedom
Study Design
Terminology
Overview
Personal Tips
Types of Studies
Design of Experiments
Structure of Experiments
Clinical Investigation Plan
Exercises
Distributions of one Variable
Discrete Distributions
Bernoulli Distribution
Binomial Distribution
Poisson Distribution
Programs: Discrete Distribution Functions
Continuous Distribution
Characterizing a Distribution
Distribution Center
Quantifying Variability
Parameters Describing the Form of a Distribution
Normal Distribution
Central Limit Theorem
Application Example
Other Continuous Distributions
Exercises
Statistical Data Analysis
Typical Analysis Procedure
Data Screening
Normality Check
Transformation
Hypothesis tests
An Example
Generalization
The interpretation of the p-value, and the "p-value fallacy"
Types of Error
Sample Size
Sensitivity and Specificity
ROC Curve
Common Statistical Tests for Comparing Groups
Examples
Exercises
Tests of Means of Continuous Data
Distribution of a Sample Mean
One sample t-test for a mean value
Wilcoxon signed rank sum test
Comparison of Two Groups
Paired T-Test
T-Test Between Independent Groups
Non-parametric Comparison of Two Groups: Mann-Whitney Test
Statistical Hypothesis Tests vs Statistical Modeling
Comparison of More Groups
Analysis of Variance - ANOVA
Multiple Comparisons
Kruskal-Wallis test
Exercises
Tests on Categorical Data
One Proportion
Explanation
Example
Frequency Tables
One-way Chi-square Test
Chi-square Contingency Test
Fisher's Exact Test
McNemar's Test
Cochran's Q Test
Analysis Programs
Exercises
Relations Between Several Variables
Two-way ANOVA
Three-way ANOVA
Analysis of Survival Times
Survival Distributions
Survival Probabilities
Censorship
Kaplan-Meier survival curve
Comparing Survival Curves in Two Groups
III Statistical Modelling
Advanced Statistical Analysis
statsmodels
PyMC: Bayesian Statistics and Markov Chain Monte Carlo Modeling
scikit-learn
Generalized Linear Models
Statistical Models
Linear Correlation
Correlation Coefficient
Rank Correlation
General linear regression model
Example 1: Simple Linear Regression
Example 2: Quadratic Fit
Coefficient of determination
Model Language
Design Matrix
Example: Program Effectiveness
Linear Regression Analysis with Python
Example 1: Line Fit with Confidence Intervals
Example 2: Noisy Quadratic Polynomial
Example 3: Tobacco and Alcohol in the UK
Model Results
Definitions for Regression with Intercept
The R2 Value
2 - The adjusted R2 Value
Model Coefficients and Their Interpretation
Analysis of Residuals
Comparison
Example 3: Regression Using Sklearn
Conclusion
Assumptions
Interpretation
Bootstrapping
Exercises
Multivariate Dataanalysis
Visualization of Multivariate Correlations
Scatterplot Matrix
Correlation Matrix
Multilinear Regression
Tests on Discrete Data
Comparing Groups of Ranked Data
Logistic Regression
Example: The Challenger Disaster
Generalized Linear Models
Exponential Family of Distributions
Linear Predictor and Link Function
Ordinal Logistic Regression
Optimization
Code
Bayesian Statistics
Bayesian vs. Frequentist Interpretation
Bayesian Example
The Bayesian Approach in the Age of Computers
Example: The Challenger Disaster
Appendix
Python Programs
Glossary
Acronyms
Index Topics
Python Programs
An Introduction to Statistics with Python Author: Thomas Haslwanter email: thomas.haslwanter@fh-linz.at Version: 8.0 June 30, 2015 c Copyright Thomas Haslwanter, May-2015.
2
Contents I Python and Statistics 1 Why Statistics? 13 15 2 Python 17 2.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 Python Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2 Free Python Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Installation and Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.4 PyPI - the Python Package Index . . . . . . . . . . . . . . . . . . . . . . 19 github . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.5 2.2 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 IPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 2.3.1 First Session with the IPython Qt Console . . . . . . . . . . . . . . . . . 20 2.3.2 Personalizing IPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Ipython Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3 2.3.4 IPython Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Developing Python Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . . . . . 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Python Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Plots in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.2 Graphical Output in Python: Functional and Object-oriented Approach . 38 2.7 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7.1 Data Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.2 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Statsmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.9 Seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.9.1 General Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4.1 First Python Script 2.4.2 Functions, Modules, and Packages 2.4.3 Python Tips Interactive plots 2.5.1 2.10 Exercises 3 Data Input 45 3.1 Data from Textfiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.1 Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.2 Reading ASCII-data into Python . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.3 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Input from MS Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Input from Matlab and other formats . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 3.3 3
4 II Basics Principles and Hypothesis Tests CONTENTS 49 4 Basic Principles 51 4.1 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.1 Categorical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.2 Numerical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Data Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 Bivariate and Multivariate Plots . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.3 Personal Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.4 Types of Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.5 Design of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.7 Clinical Investigation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Structure of Experiments 4.6 Exercises 5 Distributions of one Variable 5.1 Discrete Distributions 69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1.3 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1.4 Programs: Discrete Distribution Functions . . . . . . . . . . . . . . . . . 72 5.2 Continuous Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Characterizing a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.1 Distribution Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.2 Quantifying Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.3 Parameters Describing the Form of a Distribution . . . . . . . . . . . . . 79 5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.1 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.2 Application Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Other Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Statistical Data Analysis 6.2 Hypothesis tests 93 6.1 Typical Analysis Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.1 Data Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.2 Normality Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1.3 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.3 The interpretation of the p-value, and the ”p-value fallacy” . . . . . . . . 99 6.2.4 Types of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.5 Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.5 Common Statistical Tests for Comparing Groups . . . . . . . . . . . . . . . . . . 105
CONTENTS 5 6.5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.6 Exercises 7 Tests of Means of Continuous Data 107 7.1 Distribution of a Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.1.1 One sample t-test for a mean value . . . . . . . . . . . . . . . . . . . . . . 107 7.1.2 Wilcoxon signed rank sum test . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2 Comparison of Two Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2.1 Paired T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2.2 T-Test Between Independent Groups . . . . . . . . . . . . . . . . . . . . . 110 7.2.3 Non-parametric Comparison of Two Groups: Mann-Whitney Test . . . . 110 Statistical Hypothesis Tests vs Statistical Modeling . . . . . . . . . . . . . 110 7.2.4 7.3 Comparison of More Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3.1 Analysis of Variance - ANOVA . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3.3 Kruskal-Wallis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4 Exercises 8 Tests on Categorical Data 119 8.1 One Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.1.1 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.2 Frequency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 8.2.1 One-way Chi-square Test 8.2.2 Chi-square Contingency Test . . . . . . . . . . . . . . . . . . . . . . . . . 122 8.2.3 Fisher’s Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.2.4 McNemar’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.2.5 Cochran’s Q Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.3 Analysis Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.4 Exercises 9 Relations Between Several Variables 131 9.1 Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.2 Three-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 10 Analysis of Survival Times 135 10.1 Survival Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 10.2 Survival Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.2.1 Censorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.2.2 Kaplan-Meier survival curve . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.3 Comparing Survival Curves in Two Groups . . . . . . . . . . . . . . . . . . . . . 138 III Statistical Modelling 139 11 Advanced Statistical Analysis 141 11.1 statsmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 11.2 PyMC: Bayesian Statistics and Markov Chain Monte Carlo Modeling . . . . . . . 142 11.3 scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 11.4 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6 CONTENTS 12 Statistical Models 12.2 General linear regression model 145 12.1 Linear Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 12.1.1 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 12.1.2 Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 12.2.1 Example 1: Simple Linear Regression . . . . . . . . . . . . . . . . . . . . 148 12.2.2 Example 2: Quadratic Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 12.2.3 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . 148 12.3 Model Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 12.3.1 Design Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 12.3.2 Example: Program Effectiveness . . . . . . . . . . . . . . . . . . . . . . . 153 12.4 Linear Regression Analysis with Python . . . . . . . . . . . . . . . . . . . . . . . 153 12.4.1 Example 1: Line Fit with Confidence Intervals . . . . . . . . . . . . . . . 153 12.4.2 Example 2: Noisy Quadratic Polynomial . . . . . . . . . . . . . . . . . . . 153 12.4.3 Example 3: Tobacco and Alcohol in the UK . . . . . . . . . . . . . . . . . 156 12.5 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 12.5.1 Definitions for Regression with Intercept . . . . . . . . . . . . . . . . . . . 158 12.5.2 The R2 Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 12.5.3 ¯R2 - The adjusted R2 Value . . . . . . . . . . . . . . . . . . . . . . . . . . 158 12.5.4 Model Coefficients and Their Interpretation . . . . . . . . . . . . . . . . . 161 12.5.5 Analysis of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 12.5.6 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 12.5.7 Example 3: Regression Using Sklearn . . . . . . . . . . . . . . . . . . . . 166 12.5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 12.6 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 12.7 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 12.8 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 13 Multivariate Dataanalysis 173 13.1 Visualization of Multivariate Correlations . . . . . . . . . . . . . . . . . . . . . . 173 13.1.1 Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 13.1.2 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 13.2 Multilinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 14 Tests on Discrete Data 177 14.1 Comparing Groups of Ranked Data . . . . . . . . . . . . . . . . . . . . . . . . . . 177 14.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 14.2.1 Example: The Challenger Disaster . . . . . . . . . . . . . . . . . . . . . . 178 14.3 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 . . . . . . . . . . . . . . . . . . . . . 180 14.3.1 Exponential Family of Distributions 14.3.2 Linear Predictor and Link Function . . . . . . . . . . . . . . . . . . . . . 180 14.4 Ordinal Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 14.4.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 14.4.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 15 Bayesian Statistics 185 15.1 Bayesian vs. Frequentist Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 185 15.1.1 Bayesian Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 15.2 The Bayesian Approach in the Age of Computers . . . . . . . . . . . . . . . . . . 186 15.3 Example: The Challenger Disaster . . . . . . . . . . . . . . . . . . . . . . . . . . 187
CONTENTS 7 A Appendix 191 A.1 Python Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Glossary 259 Acronyms 263 Index Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Python Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8 CONTENTS
分享到:
收藏