logo资料库

python 数学建模.pdf

第1页 / 共435页
第2页 / 共435页
第3页 / 共435页
第4页 / 共435页
第5页 / 共435页
第6页 / 共435页
第7页 / 共435页
第8页 / 共435页
资料共435页,剩余部分请下载后查看
Title Page
Copyright Page
Table of Contents
Introduction
About This Book
Foolish Assumptions
Icons Used in This Book
Beyond the Book
Where to Go from Here
Part I Getting Started with Python for Data Science
Chapter 1 Discovering the Match between Data Science and Python
Defining the Sexiest Job of the 21st Century
Considering the emergence of data science
Outlining the core competencies of a data scientist
Linking data science and big data
Understanding the role of programming
Creating the Data Science Pipeline
Preparing the data
Performing exploratory data analysis
Learning from data
Visualizing
Obtaining insights and data products
Understanding Python’s Role in Data Science
Considering the shifting profile of data scientists
Working with a multipurpose, simple, and efficient language
Learning to Use Python Fast
Loading data
Training a model
Viewing a result
Chapter 2 Introducing Python’s Capabilities and Wonders
Why Python?
Grasping Python’s core philosophy
Discovering present and future development goals
Working with Python
Getting a taste of the language
Understanding the need for indentation
Working at the command line or in the IDE
Performing Rapid Prototyping and Experimentation
Considering Speed of Execution
Visualizing Power
Using the Python Ecosystem for Data Science
Accessing scientific tools using SciPy
Performing fundamental scientific computing using NumPy
Performing data analysis using pandas
Implementing machine learning using Scikit‐learn
Plotting the data using matplotlib
Parsing HTML documents using Beautiful Soup
Chapter 3 Setting Up Python for Data Science
Considering the Off‐the‐Shelf Cross‐Platform Scientific Distributions
Getting Continuum Analytics Anaconda
Getting Enthought Canopy Express
Getting pythonxy
Getting WinPython
Installing Anaconda on Windows
Installing Anaconda on Linux
Installing Anaconda on Mac OS X
Downloading the Datasets and Example Code
Using IPython Notebook
Defining the code repository
Understanding the datasets used in this book
Chapter 4 Reviewing Basic Python
Working with Numbers and Logic
Performing variable assignments
Doing arithmetic
Comparing data using Boolean expressions
Creating and Using Strings
Interacting with Dates
Creating and Using Functions
Creating reusable functions
Calling functions in a variety of ways
Using Conditional and Loop Statements
Making decisions using the if statement
Choosing between multiple options using nested decisions
Performing repetitive tasks using for
Using the while statement
Storing Data Using Sets, Lists, and Tuples
Performing operations on sets
Working with lists
Creating and using Tuples
Defining Useful Iterators
Indexing Data Using Dictionaries
Part II Getting Your Hands Dirty with Data
Chapter 5 Working with Real Data
Uploading, Streaming, and Sampling Data
Uploading small amounts of data into memory
Streaming large amounts of data into memory
Sampling data
Accessing Data in Structured Flat‐File Form
Reading from a text file
Reading CSV delimited format
Reading Excel and other Microsoft Office files
Sending Data in Unstructured File Form
Managing Data from Relational Databases
Interacting with Data from NoSQL Databases
Accessing Data from the Web
Chapter 6 Conditioning Your Data
Juggling between NumPy and pandas
Knowing when to use NumPy
Knowing when to use pandas
Validating Your Data
Figuring out what’s in your data
Removing duplicates
Creating a data map and data plan
Manipulating Categorical Variables
Creating categorical variables
Renaming levels
Combining levels
Dealing with Dates in Your Data
Formatting date and time values
Using the right time transformation
Dealing with Missing Data
Finding the missing data
Encoding missingness
Imputing missing data
Slicing and Dicing: Filtering and Selecting Data
Slicing rows
Slicing columns
Dicing
Concatenating and Transforming
Adding new cases and variables
Removing data
Sorting and shuffling
Aggregating Data at Any Level
Chapter 7 Shaping Data
Working with HTML Pages
Parsing XML and HTML
Using XPath for data extraction
Working with Raw Text
Dealing with Unicode
Stemming and removing stop words
Introducing regular expressions
Using the Bag of Words Model and Beyond
Understanding the bag of words model
Working with n‐grams
Implementing TF‐IDF transformations
Working with Graph Data
Understanding the adjacency matrix
Using NetworkX basics
Chapter 8 Putting What You Know in Action
Contextualizing Problems and Data
Evaluating a data science problem
Researching solutions
Formulating a hypothesis
Preparing your data
Considering the Art of Feature Creation
Defining feature creation
Combining variables
Understanding binning and discretization
Using indicator variables
Transforming distributions
Performing Operations on Arrays
Using vectorization
Performing simple arithmetic on vectors and matrices
Performing matrix vector multiplication
Performing matrix multiplication
Part III Visualizing the Invisible
Chapter 9 Getting a Crash Course in MatPlotLib
Starting with a Graph
Defining the plot
Drawing multiple lines and plots
Saving your work
Setting the Axis, Ticks, Grids
Getting the axes
Formatting the axes
Adding grids
Defining the Line Appearance
Working with line styles
Using colors
Adding markers
Using Labels, Annotations, and Legends
Adding labels
Annotating the chart
Creating a legend
Chapter 10 Visualizing the Data
Choosing the Right Graph
Showing parts of a whole with pie charts
Creating comparisons with bar charts
Showing distributions using histograms
Depicting groups using box plots
Seeing data patterns using scatterplots
Creating Advanced Scatterplots
Depicting groups
Showing correlations
Plotting Time Series
Representing time on axes
Plotting trends over time
Plotting Geographical Data
Visualizing Graphs
Developing undirected graphs
Developing directed graphs
Chapter 11 Understanding the Tools
Using the IPython Console
Interacting with screen text
Changing the window appearance
Getting Python help
Getting IPython help
Using magic functions
Discovering objects
Using IPython Notebook
Working with styles
Restarting the kernel
Restoring a checkpoint
Performing Multimedia and Graphic Integration
Embedding plots and other images
Loading examples from online sites
Obtaining online graphics and multimedia
Part IV Wrangling Data
Chapter 12 Stretching Python’s Capabilities
Playing with Scikit‐learn
Understanding classes in Scikit‐learn
Defining applications for data science
Performing the Hashing Trick
Using hash functions
Demonstrating the hashing trick
Working with deterministic selection
Considering Timing and Performance
Benchmarking with timeit
Working with the memory profiler
Running in Parallel
Performing multicore parallelism
Demonstrating multiprocessing
Chapter 13 Exploring Data Analysis
The EDA Approach
Defining Descriptive Statistics for Numeric Data
Measuring central tendency
Measuring variance and range
Working with percentiles
Defining measures of normality
Counting for Categorical Data
Understanding frequencies
Creating contingency tables
Creating Applied Visualization for EDA
Inspecting boxplots
Performing t‐tests after boxplots
Observing parallel coordinates
Graphing distributions
Plotting scatterplots
Understanding Correlation
Using covariance and correlation
Using nonparametric correlation
Considering chi‐square for tables
Modifying Data Distributions
Using the normal distribution
Creating a Z‐score standardization
Transforming other notable distributions
Chapter 14 Reducing Dimensionality
Understanding SVD
Looking for dimensionality reduction
Using SVD to measure the invisible
Performing Factor and Principal Component Analysis
Considering the psychometric model
Looking for hidden factors
Using components, not factors
Achieving dimensionality reduction
Understanding Some Applications
Recognizing faces with PCA
Extracting Topics with NMF
Recommending movies
Chapter 15 Clustering
Clustering with K‐means
Understanding centroid‐based algorithms
Creating an example with image data
Looking for optimal solutions
Clustering big data
Performing Hierarchical Clustering
Moving Beyond the Round-Shaped Clusters: DBScan
Chapter 16 Detecting Outliers in Data
Considering Detection of Outliers
Finding more things that can go wrong
Understanding anomalies and novel data
Examining a Simple Univariate Method
Leveraging on the Gaussian distribution
Making assumptions and checking out
Developing a Multivariate Approach
Using principal component analysis
Using cluster analysis
Automating outliers detection with SVM
Part V Learning from Data
Chapter 17 Exploring Four Simple and Effective Algorithms
Guessing the Number: Linear Regression
Defining the family of linear models
Using more variables
Understanding limitations and problems
Moving to Logistic Regression
Applying logistic regression
Considering when classes are more
Making Things as Simple as Naïve Bayes
Finding out that Naïve Bayes isn’t so naïve
Predicting text classifications
Learning Lazily with Nearest Neighbors
Predicting after observing neighbors
Choosing your k parameter wisely
Chapter 18 Performing Cross‐Validation, Selection, and Optimization
Pondering the Problem of Fitting a Model
Understanding bias and variance
Defining a strategy for picking models
Dividing between training and test sets
Cross‐Validating
Using cross‐validation on k folds
Sampling stratifications for complex data
Selecting Variables Like a Pro
Selecting by univariate measures
Using a greedy search
Pumping Up Your Hyperparameters
Implementing a grid search
Trying a randomized search
Chapter 19 Increasing Complexity with Linear and Nonlinear Tricks
Using Nonlinear Transformations
Doing variable transformations
Creating interactions between variables
Regularizing Linear Models
Relying on Ridge regression (L2)
Using the Lasso (L1)
Leveraging regularization
Combining L1 & L2: Elasticnet
Fighting with Big Data Chunk by Chunk
Determining when there is too much data
Implementing Stochastic Gradient Descent
Understanding Support Vector Machines
Relying on a computational method
Fixing many new parameters
Classifying with SVC
Going nonlinear is easy
Performing regression with SVR
Creating a stochastic solution with SVM
Chapter 20 Understanding the Power of the Many
Starting with a Plain Decision Tree
Understanding a decision tree
Creating classification and regression trees
Making Machine Learning Accessible
Working with a Random Forest classifier
Working with a Random Forest regressor
Optimizing a Random Forest
Boosting Predictions
Knowing that many weak predictors win
Creating a gradient boosting classifier
Creating a gradient boosting regressor
Using GBM hyper‐parameters
Part VI The Part of Tens
Chapter 21 Ten Essential Data Science Resource Collections
Gaining Insights with Data Science Weekly
Obtaining a Resource List at U Climb Higher
Getting a Good Start with KDnuggets
Accessing the Huge List of Resources on Data Science Central
Obtaining the Facts of Open Source Data Science from Masters
Locating Free Learning Resources with Quora
Receiving Help with Advanced Topics at Conductrics
Learning New Tricks from the Aspirational Data Scientist
Finding Data Intelligence and Analytics Resources at AnalyticBridge
Zeroing In on Developer Resources with Jonathan Bower
Chapter 22 Ten Data Challenges You Should Take
Meeting the Data Science London + Scikit‐learn Challenge
Predicting Survival on the Titanic
Finding a Kaggle Competition that Suits Your Needs
Honing Your Overfit Strategies
Trudging Through the MovieLens Dataset
Getting Rid of Spam Emails
Working with Handwritten Information
Working with Pictures
Analyzing Amazon.com Reviews
Interacting with a Huge Graph
Index
EULA
Python® for Data Science by Luca Massaron and John Paul Mueller
Python® for Data Science For Dummies® Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030‐5774, www.wiley.com Copyright © 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey Media and software compilation copyright © 2015 by John Wiley & Sons, Inc. All rights reserved. Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permit- ted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permissions. Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. Python is a registered trademark of Python Software Foundation Corporation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ. For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877‐762‐2974, outside the U.S. at 317‐572‐3993, or fax 317‐572‐4002. For technical support, please visit www.wiley.com/techsupport. Wiley publishes in a variety of print and electronic formats and by print‐on‐demand. Some material included with standard print versions of this book may not be included in e‐books or in print‐on‐demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley prod- ucts, visit www.wiley.com. Library of Congress Control Number: 2013956848 ISBN: 978‐1‐118‐84418‐2 ISBN 978-1-118-84398-7 (ebk); ISBN ePDF 978-1-118-84414-4 (ebk) Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1
Table of Contents Introduction ................................................................. 1 About This Book .............................................................................................. 1 Foolish Assumptions ....................................................................................... 2 Icons Used in This Book ................................................................................. 3 Beyond the Book ............................................................................................. 4 Where to Go from Here ................................................................................... 5 Part I: Getting Started with Python for Data Science ...... 7 Chapter 1: Discovering the Match between Data Science and Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Defining the Sexiest Job of the 21st Century .............................................. 11 Considering the emergence of data science..................................... 11 Outlining the core competencies of a data scientist ....................... 12 Linking data science and big data ..................................................... 13 Understanding the role of programming .......................................... 13 Creating the Data Science Pipeline .............................................................. 14 Preparing the data ............................................................................... 14 Performing exploratory data analysis ............................................... 15 Learning from data .............................................................................. 15 Visualizing ............................................................................................. 15 Obtaining insights and data products ............................................... 15 Understanding Python’s Role in Data Science ........................................... 16 Considering the shifting profile of data scientists ........................... 16 Working with a multipurpose, simple, and efficient language ....... 17 Learning to Use Python Fast ........................................................................ 18 Loading data ......................................................................................... 18 Training a model .................................................................................. 18 Viewing a result .................................................................................... 20 Chapter 2: Introducing Python’s Capabilities and Wonders . . . . . . . .21 Why Python? .................................................................................................. 22 Grasping Python’s core philosophy .................................................. 23 Discovering present and future development goals ....................... 23 Working with Python .................................................................................... 24 Getting a taste of the language ........................................................... 24 Understanding the need for indentation .......................................... 25 Working at the command line or in the IDE ..................................... 25 欢迎加入非盈利Python编程学习交流QQ群783462347,群里免费提供500+本Python书籍!
iv Performing Rapid Prototyping and Experimentation ............................... 29 Considering Speed of Execution .................................................................. 30 Visualizing Power .......................................................................................... 32 Using the Python Ecosystem for Data Science .......................................... 33 Accessing scientific tools using SciPy ............................................... 33 Performing fundamental scientific computing using NumPy ......... 34 Performing data analysis using pandas ............................................ 34 Implementing machine learning using Scikit‐learn .......................... 35 Plotting the data using matplotlib ..................................................... 35 Parsing HTML documents using Beautiful Soup .............................. 35 Chapter 3: Setting Up Python for Data Science . . . . . . . . . . . . . . . . . . .37 Considering the Off‐the‐Shelf Cross‐Platform Scientific Distributions ............................................................................................... 38 Getting Continuum Analytics Anaconda ........................................... 39 Getting Enthought Canopy Express................................................... 40 Getting pythonxy ................................................................................. 40 Getting WinPython ............................................................................... 41 Installing Anaconda on Windows ................................................................ 41 Installing Anaconda on Linux ....................................................................... 45 Installing Anaconda on Mac OS X ................................................................ 46 Downloading the Datasets and Example Code .......................................... 47 Using IPython Notebook ..................................................................... 47 Defining the code repository .............................................................. 48 Understanding the datasets used in this book ................................ 54 Chapter 4: Reviewing Basic Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57 Working with Numbers and Logic ............................................................... 59 Performing variable assignments ...................................................... 60 Doing arithmetic .................................................................................. 61 Comparing data using Boolean expressions .................................... 62 Creating and Using Strings ........................................................................... 65 Interacting with Dates ................................................................................... 66 Creating and Using Functions ...................................................................... 68 Creating reusable functions ............................................................... 68 Calling functions in a variety of ways ................................................ 70 Using Conditional and Loop Statements .................................................... 73 Making decisions using the if statement ........................................... 73 Choosing between multiple options using nested decisions ......... 74 Performing repetitive tasks using for ................................................ 75 Using the while statement .................................................................. 76 Storing Data Using Sets, Lists, and Tuples ................................................. 77 Performing operations on sets ........................................................... 77 Working with lists ................................................................................ 78 Creating and using Tuples .................................................................. 80 Defining Useful Iterators ............................................................................... 81 Indexing Data Using Dictionaries ................................................................. 82 Python for Data Science For Dummies 欢迎加入非盈利Python编程学习交流QQ群783462347,群里免费提供500+本Python书籍!
v Part II: Getting Your Hands Dirty with Data ................. 83 Chapter 5: Working with Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85 Uploading, Streaming, and Sampling Data ................................................. 86 Uploading small amounts of data into memory ............................... 87 Streaming large amounts of data into memory ................................ 88 Sampling data ....................................................................................... 89 Accessing Data in Structured Flat‐File Form .............................................. 90 Reading from a text file ....................................................................... 91 Reading CSV delimited format ........................................................... 92 Reading Excel and other Microsoft Office files ................................ 94 Sending Data in Unstructured File Form .................................................... 95 Managing Data from Relational Databases ................................................. 98 Interacting with Data from NoSQL Databases ......................................... 100 Accessing Data from the Web .................................................................... 101 Chapter 6: Conditioning Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105 Juggling between NumPy and pandas ...................................................... 106 Knowing when to use NumPy ........................................................... 106 Knowing when to use pandas ........................................................... 106 Validating Your Data ................................................................................... 107 Figuring out what’s in your data ...................................................... 108 Removing duplicates ......................................................................... 109 Creating a data map and data plan .................................................. 110 Manipulating Categorical Variables .......................................................... 112 Creating categorical variables.......................................................... 113 Renaming levels ................................................................................. 114 Combining levels ................................................................................ 115 Dealing with Dates in Your Data ................................................................ 116 Formatting date and time values ..................................................... 117 Using the right time transformation ................................................ 117 Dealing with Missing Data .......................................................................... 118 Finding the missing data ................................................................... 119 Encoding missingness ....................................................................... 119 Imputing missing data ....................................................................... 120 Slicing and Dicing: Filtering and Selecting Data ....................................... 122 Slicing rows ......................................................................................... 122 Slicing columns .................................................................................. 123 Dicing ................................................................................................... 123 Concatenating and Transforming .............................................................. 124 Adding new cases and variables ...................................................... 125 Removing data .................................................................................... 126 Sorting and shuffling .......................................................................... 127 Aggregating Data at Any Level ................................................................... 128 Table of Contents欢迎加入非盈利Python编程学习交流QQ群783462347,群里免费提供500+本Python书籍!
vi Chapter 7: Shaping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131 Working with HTML Pages ......................................................................... 132 Parsing XML and HTML..................................................................... 132 Using XPath for data extraction ....................................................... 133 Working with Raw Text ............................................................................... 134 Dealing with Unicode......................................................................... 134 Stemming and removing stop words ............................................... 136 Introducing regular expressions ...................................................... 137 Using the Bag of Words Model and Beyond ............................................. 140 Understanding the bag of words model .......................................... 141 Working with n‐grams ....................................................................... 142 Implementing TF‐IDF transformations ............................................ 144 Working with Graph Data ........................................................................... 145 Understanding the adjacency matrix .............................................. 146 Using NetworkX basics...................................................................... 146 Chapter 8: Putting What You Know in Action . . . . . . . . . . . . . . . . . . . .149 Contextualizing Problems and Data .......................................................... 150 Evaluating a data science problem .................................................. 151 Researching solutions ....................................................................... 151 Formulating a hypothesis ................................................................. 152 Preparing your data ........................................................................... 153 Considering the Art of Feature Creation .................................................. 153 Defining feature creation .................................................................. 153 Combining variables .......................................................................... 154 Understanding binning and discretization ..................................... 155 Using indicator variables .................................................................. 155 Transforming distributions .............................................................. 156 Performing Operations on Arrays ............................................................. 156 Using vectorization ............................................................................ 157 Performing simple arithmetic on vectors and matrices ............... 157 Performing matrix vector multiplication ........................................ 158 Performing matrix multiplication .................................................... 159 Part III: Visualizing the Invisible ............................... 161 Chapter 9: Getting a Crash Course in MatPlotLib . . . . . . . . . . . . . . . . .163 Starting with a Graph .................................................................................. 164 Defining the plot ................................................................................. 164 Drawing multiple lines and plots ..................................................... 165 Saving your work ............................................................................... 165 Setting the Axis, Ticks, Grids ..................................................................... 166 Getting the axes ................................................................................. 167 Python for Data Science For Dummies 欢迎加入非盈利Python编程学习交流QQ群783462347,群里免费提供500+本Python书籍!
分享到:
收藏