logo资料库

An-Introduction-to-Statistics-with-Python-With-Applications-in-t....pdf

第1页 / 共285页
第2页 / 共285页
第3页 / 共285页
第4页 / 共285页
第5页 / 共285页
第6页 / 共285页
第7页 / 共285页
第8页 / 共285页
资料共285页,剩余部分请下载后查看
Preface
For Whom This Book Is
Additional Material
Acknowledgments
Contents
Acronyms
Part I Python and Statistics
1 Why Statistics?
2 Python
2.1 Getting Started
2.1.1 Conventions
2.1.2 Distributions and Packages
a) Python Packages for Statistics
b) PyPI: The Python Package Index
2.1.3 Installation of Python
a) Under Windows
b) Under Linux
c) Under Mac OS X
2.1.4 Installation of R and rpy2
a) Under Windows
b) Under Linux
2.1.5 Personalizing IPython/Jupyter
a) In Windows
b) In Linux
c) In Mac OS X
2.1.6 Python Resources
2.1.7 First Python Programs
a) Hello World
b) SquareMe
2.2 Python Data Structures
2.2.1 Python Datatypes
2.2.2 Indexing and Slicing
2.2.3 Vectors and Arrays
2.3 IPython/Jupyter: An Interactive Programming Environment
2.3.1 First Session with the Qt Console
2.3.2 Notebook and rpy2
a) The Notebook
b) rpy2
2.3.3 IPython Tips
2.4 Developing Python Programs
2.4.1 Converting Interactive Commands into a Python Program
2.4.2 Functions, Modules, and Packages
a) Functions
b) Modules
2.4.3 Python Tips
2.4.4 Code Versioning
2.5 Pandas: Data Structures for Statistics
2.5.1 Data Handling
a) Common Procedures
b) Notes on Data Selection
2.5.2 Grouping
2.6 Statsmodels: Tools for Statistical Modeling
2.7 Seaborn: Data Visualization
2.8 General Routines
2.9 Exercises
3 Data Input
3.1 Input from Text Files
3.1.1 Visual Inspection
3.1.2 Reading ASCII-Data into Python
a) Simple Text-Files
b) More Complex Text-Files
c) Regular Expressions
3.2 Input from MS Excel
3.3 Input from Other Formats
3.3.1 Matlab
4 Display of Statistical Data
4.1 Datatypes
4.1.1 Categorical
a) Boolean
b) Nominal
c) Ordinal
4.1.2 Numerical
a) Numerical Continuous
b) Numerical Discrete
4.2 Plotting in Python
4.2.1 Functional and Object-Oriented Approaches to Plotting
4.2.2 Interactive Plots
4.3 Displaying Statistical Datasets
4.3.1 Univariate Data
a) Scatter Plots
b) Histograms
c) Kernel-Density-Estimation (KDE) Plots
d) Cumulative Frequencies
e) Error-Bars
f) Box Plots
g) Grouped Bar Charts
h) Pie Charts
i) Programs: Data Display
4.3.2 Bivariate and Multivariate Plots
a) Bivariate Scatter Plots
b) 3D Plots
4.4 Exercises
Part II Distributions and Hypothesis Tests
5 Background
5.1 Populations and Samples
5.2 Probability Distributions
5.2.1 Discrete Distributions
5.2.2 Continuous Distributions
5.2.3 Expected Value and Variance
a) Expected Value
b) Variance
5.3 Degrees of Freedom
5.4 Study Design
5.4.1 Terminology
5.4.2 Overview
5.4.3 Types of Studies
a) Observational or Experimental
b) Prospective or Retrospective
c) Longitudinal or Cross-Sectional
d) Case–Control and Cohort studies
e) Randomized Controlled Trial
f) Crossover Studies
5.4.4 Design of Experiments
a) Sample Selection
b) Sample Size
c) Bias
d) Randomization
e) Blinding
f) Factorial Design
5.4.5 Personal Advice
1) Preliminary Investigations and Murphy's Law
2) Calibration Runs
3) Documentation
4) Data Storage
5.4.6 Clinical Investigation Plan
6 Distributions of One Variable
6.1 Characterizing a Distribution
6.1.1 Distribution Center
a) Mean
b) Median
c) Mode
d) Geometric Mean
6.1.2 Quantifying Variability
a) Range
b) Percentiles
c) Standard Deviation and Variance
d) Standard Error
e) Confidence Intervals
6.1.3 Parameters Describing the Form of a Distribution
a) Location
b) Scale
c) Shape Parameters
6.1.4 Important Presentations of Probability Densities
6.2 Discrete Distributions
6.2.1 Bernoulli Distribution
6.2.2 Binomial Distribution
b) Example: Binomial Test
6.2.3 Poisson Distribution
6.3 Normal Distribution
6.3.1 Examples of Normal Distributions
6.3.2 Central Limit Theorem
6.3.3 Distributions and Hypothesis Tests
6.4 Continuous Distributions Derived from the NormalDistribution
6.4.1 t-Distribution
6.4.2 Chi-Square Distribution
a) Definition
b) Application Example
6.4.3 F-Distribution
a) Definition
b) Application Example
6.5 Other Continuous Distributions
6.5.1 Lognormal Distribution
6.5.2 Weibull Distribution
6.5.3 Exponential Distribution
6.5.4 Uniform Distribution
6.6 Exercises
7 Hypothesis Tests
7.1 Typical Analysis Procedure
7.1.1 Data Screening and Outliers
7.1.2 Normality Check
a) Probability-Plots
b) Tests for Normality
7.1.3 Transformation
7.2 Hypothesis Concept, Errors, p-Value, and Sample Size
7.2.1 An Example
7.2.2 Generalization and Applications
a) Generalization
b) Additional Examples
7.2.3 The Interpretation of the p-Value
7.2.4 Types of Error
a) Type I Errors
b) Type II Errors and Test Power
c) Pitfalls in the Interpretation of p-Values
7.2.5 Sample Size
a) Examples
b) Python Solution
c) Programs: Sample Size
7.3 Sensitivity and Specificity
7.3.1 Related Calculations
7.4 Receiver-Operating-Characteristic (ROC) Curve
8 Tests of Means of Numerical Data
8.1 Distribution of a Sample Mean
8.1.1 One Sample t-Test for a Mean Value
a) Example
8.1.2 Wilcoxon Signed Rank Sum Test
8.2 Comparison of Two Groups
8.2.1 Paired t-Test
8.2.2 t-Test between Independent Groups
8.2.3 Nonparametric Comparison of Two Groups: Mann–Whitney Test
8.2.4 Statistical Hypothesis Tests vs Statistical Modeling
a) Classical t-Test
b) Statistical Modeling
8.3 Comparison of Multiple Groups
8.3.1 Analysis of Variance (ANOVA)
a) Principle
b) Example: One-Way ANOVA
8.3.2 Multiple Comparisons
a) Tukey's Test
b) Bonferroni Correction
c) Holm Correction
8.3.3 Kruskal–Wallis Test
8.3.4 Two-Way ANOVA
8.3.5 Three-Way ANOVA
8.4 Summary: Selecting the Right Test for Comparing Groups
8.4.1 Typical Tests
8.4.2 Hypothetical Examples
8.5 Exercises
9 Tests on Categorical Data
9.1 One Proportion
9.1.1 Confidence Intervals
9.1.2 Explanation
9.1.3 Example
9.2 Frequency Tables
9.2.1 One-Way Chi-Square Test
9.2.2 Chi-Square Contingency Test
a) Assumptions
b) Degrees of Freedom
c) Example 1
d) Example 2
e) Comments
9.2.3 Fisher's Exact Test
a) Example: ``A Lady Tasting Tea''
9.2.4 McNemar's Test
a) Example
9.2.5 Cochran's Q Test
a) Example
9.3 Exercises
10 Analysis of Survival Times
10.1 Survival Distributions
10.2 Survival Probabilities
10.2.1 Censorship
10.2.2 Kaplan–Meier Survival Curve
10.3 Comparing Survival Curves in Two Groups
Part III Statistical Modeling
11 Linear Regression Models
11.1 Linear Correlation
11.1.1 Correlation Coefficient
11.1.2 Rank Correlation
11.2 General Linear Regression Model
11.2.1 Example 1: Simple Linear Regression
11.2.2 Example 2: Quadratic Fit
11.2.3 Coefficient of Determination
a) Relation to Unexplained Variance
b) ``Good'' Fits
11.3 Patsy: The Formula Language
11.3.1 Design Matrix
a) Definition
b) Examples
11.4 Linear Regression Analysis with Python
11.4.1 Example 1: Line Fit with Confidence Intervals
11.4.2 Example 2: Noisy Quadratic Polynomial
11.5 Model Results of Linear Regression Models
11.5.1 Example: Tobacco and Alcohol in the UK
11.5.2 Definitions for Regression with Intercept
11.5.3 The R2 Value
11.5.4 2: The Adjusted R2 Value
a) The F-Test
b) Log-Likelihood Function
c) Information Content of Statistical Models: AIC and BIC
11.5.5 Model Coefficients and Their Interpretation
a) Coefficients
b) Standard Error
c) t-Statistic
d) Confidence Interval
11.5.6 Analysis of Residuals
a) Skewness and Kurtosis
b) Omnibus Test
c) Durbin–Watson
d) Jarque–Bera Test
e) Condition Number
11.5.7 Outliers
11.5.8 Regression Using Sklearn
11.5.9 Conclusion
11.6 Assumptions of Linear Regression Models
11.7 Interpreting the Results of Linear Regression Models
11.8 Bootstrapping
11.9 Exercises
12 Multivariate Data Analysis
12.1 Visualizing Multivariate Correlations
12.1.1 Scatterplot Matrix
12.1.2 Correlation Matrix
12.2 Multilinear Regression
13 Tests on Discrete Data
13.1 Comparing Groups of Ranked Data
13.2 Logistic Regression
13.2.1 Example: The Challenger Disaster
13.3 Generalized Linear Models
13.3.1 Exponential Family of Distributions
13.3.2 Linear Predictor and Link Function
13.4 Ordinal Logistic Regression
13.4.1 Problem Definition
13.4.2 Optimization
13.4.3 Code
13.4.4 Performance
14 Bayesian Statistics
14.1 Bayesian vs. Frequentist Interpretation
14.1.1 Bayesian Example
14.2 The Bayesian Approach in the Age of Computers
14.3 Example: Analysis of the Challenger Disaster with a Markov-Chain–Monte-Carlo Simulation
14.4 Summing Up
Solutions
Problems of Chap.2
Problems of Chap.4
Problems of Chap.6
Problems of Chap.8
Problems of Chap.9
Problems of Chap.11
Glossary
References
Index
Statistics and Computing Thomas Haslwanter An Introduction to Statistics with Python With Applications in the Life Sciences www.allitebooks.com
********************************************************************************* Statistics and Computing Series editor W.K. Härdle www.allitebooks.com
********************************************************************************* More information about this series at http://www.springer.com/series/3022 www.allitebooks.com
********************************************************************************* Thomas Haslwanter An Introduction to Statistics with Python With Applications in the Life Sciences 12 3 www.allitebooks.com
********************************************************************************* Thomas Haslwanter School of Applied Health and Social Sciences University of Applied Sciences Upper Austria Linz, Austria Series Editor: W.K. Härdle C.A.S.E. Centre for Applied Statistics and Economics School of Business and Economics Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin Germany The Python code samples accompanying the book are available at www.quantlet.de. All Python programs and data sets can be found on GitHub: https://github.com/thomas- haslwanter/statsintro_python.git. Links to all material are available athttp://www.springer. com/de/book/9783319283159. The Python solution codes in the appendix are published under the Creative Commons Attribution-ShareAlike 4.0 International License. ISSN 1431-8784 Statistics and Computing ISBN 978-3-319-28315-9 DOI 10.1007/978-3-319-28316-6 ISSN 2197-1706 (electronic) ISBN 978-3-319-28316-6 (eBook) Library of Congress Control Number: 2016939946 © Springer International Publishing Switzerland 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG Switzerland www.allitebooks.com 欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
********************************************************************************* To my two, three, and four-legged household companions: my wife Jean, Felix, and his sister Jessica. www.allitebooks.com 欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
********************************************************************************* www.allitebooks.com 欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
********************************************************************************* Preface In the data analysis for my own research work, I was often slowed down by two things: (1) I did not know enough statistics, and (2) the books available would provide a theoretical background, but no real practical help. The book you are holding in your hands (or on your tablet or laptop) is intended to be the book that will solve this very problem. It is designed to provide enough basic understanding so that you know what you are doing, and it should equip you with the tools you need. I believe that thePythonsolutions provided in this book for the most basic statistical problems address at least 90 % of the problems that most physicists, biologists, and medical doctors encounter in their work. So if you are the typical graduate student working on a degree, or a medical researcher analyzing the latest experiments, chances are that you will find the tools you require here—explanation and source-code included. This is the reason I have focused on statistical basics and hypothesis tests in this book and refer only briefly to other statistical approaches. I am well aware that most of the tests presented in this book can also be carried out using statistical modeling. But in many cases, this is not the methodology used in many life science journals. Advanced statistical analysis goes beyond the scope of this book and—to be frank— exceeds my own knowledge of statistics. My motivation for providing the solutions in Pythonis based on two considera- tions. One is that I would like them to be available to everyone. While commercial solutions likeMatlab,SPSS,Minitab, etc., offer powerful tools, most can only use them legally in an academic setting. In contrast,Pythonis completely free (“as in free beer” is often heard in thePythoncommunity). The second reason is thatPython is the most beautiful coding language that I have yet encountered; and around 2010 Pythonand its documentation matured to the point where one can use it without being a serious coder. Together, this book,Python, and the tools that thePython ecosystem offers today provide a beautiful, free package that covers all the statistics that most researchers willneed in their lifetime. vii www.allitebooks.com 欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
分享到:
收藏