Treading on Python Series
Python Tools for Data Munging, Data Analysis, and Visualization
Learning Pandas
Matt Harrison
Technical Editor: Copyright © 2016
While every precaution has been taken in the preparation of this book, the
publisher and author assumes no responsibility for errors or omissions, or for
damages resulting from the use of the information contained herein.
Table of Contents
From the Author
Introduction
Installation
Data Structures
Series
Series CRUD
Series Indexing
Series Methods
Series Plotting
Another Series Example
DataFrames
Data Frame Example
Data Frame Methods
Data Frame Statistics
Grouping, Pivoting, and Reshaping
Dealing With Missing Data
Joining Data Frames
Avalanche Analysis and Plotting
Summary
About the Author
Also Available
One more thing
From the Author
PYTHON IS EASY TO LEARN. YOU CAN LEARN THE BASICS IN A DAY AND BE PRODUCTIVE
with it. With only an understanding of Python, moving to pandas can be difficult
or confusing. This book is meant to aid you in mastering pandas.
I have taught Python and pandas to many people over the years, in large
corporate environments, small startups, and in Python and Data Science
conferences. I have seen what hangs people up, and confuses them. With the
correct background, an attitude of acceptance, and a deep breath, much of this
confusion evaporates.
Having said this, pandas is an excellent tool. Many are using it around the
world to great success. I hope you do as well.
Cheers!
Matt
Introduction
I HAVE BEEN USING PYTHON IS SOME PROFESSIONAL CAPACITY SINCE THE TURN OF THE
century. One of the trends that I have seen in that time is the uptake of Python
for various aspects of "data science"- gathering data, cleaning data, analysis,
machine learning, and visualization. The pandas library has seen much uptake in
this area.
pandas 1 is a data analysis library for Python that has exploded in popularity
over the past years. The website describes it thusly:
“pandas is an open source, BSD-licensed library providing high-
performance, easy-to-use data structures and data analysis tools for the
Python programming language.”
-pandas.pydata.org
My description of pandas is: pandas is an in memory nosql database, that has
sql-like constructs, basic statistical and analytic support, as well as graphing
capability. Because it is built on top of Cython, it has less memory overhead and
runs quicker. Many people are using pandas to replace Excel, perform ETL,
process tabular data, load CSV or JSON files, and more. Though it grew out of
the financial sector (for analysis of time series data), it is now a general purpose
data manipulation library.
Because pandas has some lineage back to NumPy, it adopts some NumPy'isms
that normal Python programmers may not be aware of or familiar with.
Certainly, one could go out and use Cython to perform fast typed data analysis
with a Python-like dialect, but with pandas, you don't need to. This work is done
for you. If you are using pandas and the vectorized operations, you are getting
close to C level speeds, but writing Python.
Who this book is for
This guide is intended to introduce pandas to Python programmers. It covers
many (but not all) aspects, as well as some gotchas or details that may be
counter-intuitive or even non-pythonic to longtime users of Python.
This book assumes basic knowledge of Python. The author has written
Treading on Python Vol 1 2 that provides all the background necessary.