logo资料库

Python for Data Analysis(2nd Edition 2017-09).pdf

第1页 / 共541页
第2页 / 共541页
第3页 / 共541页
第4页 / 共541页
第5页 / 共541页
第6页 / 共541页
第7页 / 共541页
第8页 / 共541页
资料共541页,剩余部分请下载后查看
Copyright
Table of Contents
Preface
Section 1. New for the Second Edition
Section 2. Conventions Used in This Book
Section 3. Using Code Examples
Section 4. O’Reilly Safari
Section 5. How to Contact Us
Section 6. Acknowledgments
In Memoriam: John D. Hunter (1968–2012)
Acknowledgments for the Second Edition (2017)
Acknowledgments for the First Edition (2012)
Chapter 1. Preliminaries
1.1 What Is This Book About?
What Kinds of Data?
1.2 Why Python for Data Analysis?
Python as Glue
Solving the “Two-Language” Problem
Why Not Python?
1.3 Essential Python Libraries
NumPy
pandas
matplotlib
IPython and Jupyter
SciPy
scikit-learn
statsmodels
1.4 Installation and Setup
Windows
Apple (OS X, macOS)
GNU/Linux
Installing or Updating Python Packages
Python 2 and Python 3
Integrated Development Environments (IDEs) and Text Editors
1.5 Community and Conferences
1.6 Navigating This Book
Code Examples
Data for Examples
Import Conventions
Jargon
Chapter 2. Python Language Basics, IPython, and Jupyter Notebooks
2.1 The Python Interpreter
2.2 IPython Basics
Running the IPython Shell
Running the Jupyter Notebook
Tab Completion
Introspection
The %run Command
Executing Code from the Clipboard
Terminal Keyboard Shortcuts
About Magic Commands
Matplotlib Integration
2.3 Python Language Basics
Language Semantics
Scalar Types
Control Flow
Chapter 3. Built-in Data Structures, Functions, and Files
3.1 Data Structures and Sequences
Tuple
List
Built-in Sequence Functions
dict
set
List, Set, and Dict Comprehensions
3.2 Functions
Namespaces, Scope, and Local Functions
Returning Multiple Values
Functions Are Objects
Anonymous (Lambda) Functions
Currying: Partial Argument Application
Generators
Errors and Exception Handling
3.3 Files and the Operating System
Bytes and Unicode with Files
3.4 Conclusion
Chapter 4. NumPy Basics: Arrays and Vectorized Computation
4.1 The NumPy ndarray: A Multidimensional Array Object
Creating ndarrays
Data Types for ndarrays
Arithmetic with NumPy Arrays
Basic Indexing and Slicing
Boolean Indexing
Fancy Indexing
Transposing Arrays and Swapping Axes
4.2 Universal Functions: Fast Element-Wise Array Functions
4.3 Array-Oriented Programming with Arrays
Expressing Conditional Logic as Array Operations
Mathematical and Statistical Methods
Methods for Boolean Arrays
Sorting
Unique and Other Set Logic
4.4 File Input and Output with Arrays
4.5 Linear Algebra
4.6 Pseudorandom Number Generation
4.7 Example: Random Walks
Simulating Many Random Walks at Once
4.8 Conclusion
Chapter 5. Getting Started with pandas
5.1 Introduction to pandas Data Structures
Series
DataFrame
Index Objects
5.2 Essential Functionality
Reindexing
Dropping Entries from an Axis
Indexing, Selection, and Filtering
Integer Indexes
Arithmetic and Data Alignment
Function Application and Mapping
Sorting and Ranking
Axis Indexes with Duplicate Labels
5.3 Summarizing and Computing Descriptive Statistics
Correlation and Covariance
Unique Values, Value Counts, and Membership
5.4 Conclusion
Chapter 6. Data Loading, Storage, and File Formats
6.1 Reading and Writing Data in Text Format
Reading Text Files in Pieces
Writing Data to Text Format
Working with Delimited Formats
JSON Data
XML and HTML: Web Scraping
6.2 Binary Data Formats
Using HDF5 Format
Reading Microsoft Excel Files
6.3 Interacting with Web APIs
6.4 Interacting with Databases
6.5 Conclusion
Chapter 7. Data Cleaning and Preparation
7.1 Handling Missing Data
Filtering Out Missing Data
Filling In Missing Data
7.2 Data Transformation
Removing Duplicates
Transforming Data Using a Function or Mapping
Replacing Values
Renaming Axis Indexes
Discretization and Binning
Detecting and Filtering Outliers
Permutation and Random Sampling
Computing Indicator/Dummy Variables
7.3 String Manipulation
String Object Methods
Regular Expressions
Vectorized String Functions in pandas
7.4 Conclusion
Chapter 8. Data Wrangling: Join, Combine, and Reshape
8.1 Hierarchical Indexing
Reordering and Sorting Levels
Summary Statistics by Level
Indexing with a DataFrame’s columns
8.2 Combining and Merging Datasets
Database-Style DataFrame Joins
Merging on Index
Concatenating Along an Axis
Combining Data with Overlap
8.3 Reshaping and Pivoting
Reshaping with Hierarchical Indexing
Pivoting “Long” to “Wide” Format
Pivoting “Wide” to “Long” Format
8.4 Conclusion
Chapter 9. Plotting and Visualization
9.1 A Brief matplotlib API Primer
Figures and Subplots
Colors, Markers, and Line Styles
Ticks, Labels, and Legends
Annotations and Drawing on a Subplot
Saving Plots to File
matplotlib Configuration
9.2 Plotting with pandas and seaborn
Line Plots
Bar Plots
Histograms and Density Plots
Scatter or Point Plots
Facet Grids and Categorical Data
9.3 Other Python Visualization Tools
9.4 Conclusion
Chapter 10. Data Aggregation and Group Operations
10.1 GroupBy Mechanics
Iterating Over Groups
Selecting a Column or Subset of Columns
Grouping with Dicts and Series
Grouping with Functions
Grouping by Index Levels
10.2 Data Aggregation
Column-Wise and Multiple Function Application
Returning Aggregated Data Without Row Indexes
10.3 Apply: General split-apply-combine
Suppressing the Group Keys
Quantile and Bucket Analysis
Example: Filling Missing Values with Group-Specific Values
Example: Random Sampling and Permutation
Example: Group Weighted Average and Correlation
Example: Group-Wise Linear Regression
10.4 Pivot Tables and Cross-Tabulation
Cross-Tabulations: Crosstab
10.5 Conclusion
Chapter 11. Time Series
11.1 Date and Time Data Types and Tools
Converting Between String and Datetime
11.2 Time Series Basics
Indexing, Selection, Subsetting
Time Series with Duplicate Indices
11.3 Date Ranges, Frequencies, and Shifting
Generating Date Ranges
Frequencies and Date Offsets
Shifting (Leading and Lagging) Data
11.4 Time Zone Handling
Time Zone Localization and Conversion
Operations with Time Zone−Aware Timestamp Objects
Operations Between Different Time Zones
11.5 Periods and Period Arithmetic
Period Frequency Conversion
Quarterly Period Frequencies
Converting Timestamps to Periods (and Back)
Creating a PeriodIndex from Arrays
11.6 Resampling and Frequency Conversion
Downsampling
Upsampling and Interpolation
Resampling with Periods
11.7 Moving Window Functions
Exponentially Weighted Functions
Binary Moving Window Functions
User-Defined Moving Window Functions
11.8 Conclusion
Chapter 12. Advanced pandas
12.1 Categorical Data
Background and Motivation
Categorical Type in pandas
Computations with Categoricals
Categorical Methods
12.2 Advanced GroupBy Use
Group Transforms and “Unwrapped” GroupBys
Grouped Time Resampling
12.3 Techniques for Method Chaining
The pipe Method
12.4 Conclusion
Chapter 13. Introduction to Modeling Libraries in Python
13.1 Interfacing Between pandas and Model Code
13.2 Creating Model Descriptions with Patsy
Data Transformations in Patsy Formulas
Categorical Data and Patsy
13.3 Introduction to statsmodels
Estimating Linear Models
Estimating Time Series Processes
13.4 Introduction to scikit-learn
13.5 Continuing Your Education
Chapter 14. Data Analysis Examples
14.1 1.USA.gov Data from Bitly
Counting Time Zones in Pure Python
Counting Time Zones with pandas
14.2 MovieLens 1M Dataset
Measuring Rating Disagreement
14.3 US Baby Names 1880–2010
Analyzing Naming Trends
14.4 USDA Food Database
14.5 2012 Federal Election Commission Database
Donation Statistics by Occupation and Employer
Bucketing Donation Amounts
Donation Statistics by State
14.6 Conclusion
Appendix A. Advanced NumPy
A.1 ndarray Object Internals
NumPy dtype Hierarchy
A.2 Advanced Array Manipulation
Reshaping Arrays
C Versus Fortran Order
Concatenating and Splitting Arrays
Repeating Elements: tile and repeat
Fancy Indexing Equivalents: take and put
A.3 Broadcasting
Broadcasting Over Other Axes
Setting Array Values by Broadcasting
A.4 Advanced ufunc Usage
ufunc Instance Methods
Writing New ufuncs in Python
A.5 Structured and Record Arrays
Nested dtypes and Multidimensional Fields
Why Use Structured Arrays?
A.6 More About Sorting
Indirect Sorts: argsort and lexsort
Alternative Sort Algorithms
Partially Sorting Arrays
numpy.searchsorted: Finding Elements in a Sorted Array
A.7 Writing Fast NumPy Functions with Numba
Creating Custom numpy.ufunc Objects with Numba
A.8 Advanced Array Input and Output
Memory-Mapped Files
HDF5 and Other Array Storage Options
A.9 Performance Tips
The Importance of Contiguous Memory
Appendix B. More on the IPython System
B.1 Using the Command History
Searching and Reusing the Command History
Input and Output Variables
B.2 Interacting with the Operating System
Shell Commands and Aliases
Directory Bookmark System
B.3 Software Development Tools
Interactive Debugger
Timing Code: %time and %timeit
Basic Profiling: %prun and %run -p
Profiling a Function Line by Line
B.4 Tips for Productive Code Development Using IPython
Reloading Module Dependencies
Code Design Tips
B.5 Advanced IPython Features
Making Your Own Classes IPython-Friendly
Profiles and Configuration
B.6 Conclusion
Index
About the Author
Colophon
2 n d E ditio n Python for Data Analysis DATA WRANGLING WITH PANDAS, NUMPY, AND IPYTHON powered by Wes McKinney
SECOND EDITION Python for Data Analysis Data Wrangling with Pandas, NumPy, and IPython Wes McKinney Beijing Beijing Boston Boston Farnham Sebastopol Farnham Sebastopol Tokyo Tokyo
Python for Data Analysis by Wes McKinney Copyright © 2018 William McKinney. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Indexer: Lucie Haskins Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest Editor: Marie Beaugureau Production Editor: Kristen Brown Copyeditor: Jasmine Kwityn Proofreader: Rachel Monaghan October 2012: October 2017: First Edition Second Edition Revision History for the Second Edition 2017-09-25: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491957660 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Python for Data Analysis, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-95766-0 [LSI]
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What Is This Book About? 1 What Kinds of Data? 1 1.2 Why Python for Data Analysis? 2 Python as Glue 2 Solving the “Two-Language” Problem 3 Why Not Python? 3 1.3 Essential Python Libraries 4 NumPy 4 pandas 4 matplotlib 5 IPython and Jupyter 6 SciPy 6 scikit-learn 7 statsmodels 8 1.4 Installation and Setup 8 Windows 9 Apple (OS X, macOS) 9 GNU/Linux 9 Installing or Updating Python Packages 10 Python 2 and Python 3 11 Integrated Development Environments (IDEs) and Text Editors 11 1.5 Community and Conferences 12 1.6 Navigating This Book 12 Code Examples 13 Data for Examples 13 iii
Import Conventions 14 Jargon 14 2. Python Language Basics, IPython, and Jupyter Notebooks. . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 The Python Interpreter 16 2.2 IPython Basics 17 Running the IPython Shell 17 Running the Jupyter Notebook 18 Tab Completion 21 Introspection 23 The %run Command 25 Executing Code from the Clipboard 26 Terminal Keyboard Shortcuts 27 About Magic Commands 28 Matplotlib Integration 29 2.3 Python Language Basics 30 Language Semantics 30 Scalar Types 38 Control Flow 46 3. Built-in Data Structures, Functions, and Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1 Data Structures and Sequences 51 Tuple 51 List 54 Built-in Sequence Functions 59 dict 61 set 65 List, Set, and Dict Comprehensions 67 3.2 Functions 69 Namespaces, Scope, and Local Functions 70 Returning Multiple Values 71 Functions Are Objects 72 Anonymous (Lambda) Functions 73 Currying: Partial Argument Application 74 Generators 75 Errors and Exception Handling 77 3.3 Files and the Operating System 80 Bytes and Unicode with Files 83 3.4 Conclusion 84 4. NumPy Basics: Arrays and Vectorized Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1 The NumPy ndarray: A Multidimensional Array Object 87 iv | Table of Contents
Creating ndarrays 88 Data Types for ndarrays 90 Arithmetic with NumPy Arrays 93 Basic Indexing and Slicing 94 Boolean Indexing 99 Fancy Indexing 102 Transposing Arrays and Swapping Axes 103 4.2 Universal Functions: Fast Element-Wise Array Functions 105 4.3 Array-Oriented Programming with Arrays 108 Expressing Conditional Logic as Array Operations 109 Mathematical and Statistical Methods 111 Methods for Boolean Arrays 113 Sorting 113 Unique and Other Set Logic 114 4.4 File Input and Output with Arrays 115 4.5 Linear Algebra 116 4.6 Pseudorandom Number Generation 118 4.7 Example: Random Walks 119 Simulating Many Random Walks at Once 121 4.8 Conclusion 122 5. Getting Started with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1 Introduction to pandas Data Structures 124 Series 124 DataFrame 128 Index Objects 134 5.2 Essential Functionality 136 Reindexing 136 Dropping Entries from an Axis 138 Indexing, Selection, and Filtering 140 Integer Indexes 145 Arithmetic and Data Alignment 146 Function Application and Mapping 151 Sorting and Ranking 153 Axis Indexes with Duplicate Labels 157 5.3 Summarizing and Computing Descriptive Statistics 158 Correlation and Covariance 160 Unique Values, Value Counts, and Membership 162 5.4 Conclusion 165 6. Data Loading, Storage, and File Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.1 Reading and Writing Data in Text Format 167 Table of Contents | v
Reading Text Files in Pieces 173 Writing Data to Text Format 175 Working with Delimited Formats 176 JSON Data 178 XML and HTML: Web Scraping 180 6.2 Binary Data Formats 183 Using HDF5 Format 184 Reading Microsoft Excel Files 186 6.3 Interacting with Web APIs 187 6.4 Interacting with Databases 188 6.5 Conclusion 190 7. Data Cleaning and Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 7.1 Handling Missing Data 191 Filtering Out Missing Data 193 Filling In Missing Data 195 7.2 Data Transformation 197 Removing Duplicates 197 Transforming Data Using a Function or Mapping 198 Replacing Values 200 Renaming Axis Indexes 201 Discretization and Binning 203 Detecting and Filtering Outliers 205 Permutation and Random Sampling 206 Computing Indicator/Dummy Variables 208 7.3 String Manipulation 211 String Object Methods 211 Regular Expressions 213 Vectorized String Functions in pandas 216 7.4 Conclusion 219 8. Data Wrangling: Join, Combine, and Reshape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 8.1 Hierarchical Indexing 221 Reordering and Sorting Levels 224 Summary Statistics by Level 225 Indexing with a DataFrame’s columns 225 8.2 Combining and Merging Datasets 227 Database-Style DataFrame Joins 227 Merging on Index 232 Concatenating Along an Axis 236 Combining Data with Overlap 241 8.3 Reshaping and Pivoting 242 vi | Table of Contents
分享到:
收藏