Why and How to Use Pandas with
Large Data
But Not Big Data…
Admond Lee
Nov 4, 2018 · 5 min read
Follow
Pandas has been one of the most popular and favourite data science
tools used in Python programming language for data wrangling and
analysis.
Data is unavoidably messy in real world. And Pandas is seriously a game
changer when it comes to cleaning, transforming, manipulating and
analyzing data. In simple terms, Pandas helps to clean the mess.
My Story of NumPy & Pandas
When I rst started out learning Python, I was naturally introduced to
NumPy (Numerical Python). It is the fundamental package for scienti c
computing with Python that provides an abundance of useful features
for operations on n-arrays and matrices in Python.
In addition, the library provides vectorization of mathematical
operations on the NumPy array type, which signi cantly optimizes
computation with high performance and enhanced speed of execution.
NumPy is cool.
But therein still lies some underlying needs for more higher level of
data analysis tools. And this is where Pandas comes to my rescue.
Fundamentally, the functionality of Pandas is built on top of NumPy
and both libraries belong to the SciPy stack. This means that Pandas
relies heavily on NumPy array to implement its objects for
manipulation and computation — but used in a more convenient
fashion.
In practice, NumPy & Pandas are still being used interchangeably. The
high level features and its convenient usage are what determine my
preference in Pandas.
Why use Pandas with Large Data — Not BIG Data?
There is a stark di erence between large data and big data. With the
hype around big data, it is easy for us to consider everything as “big
data” and just go with the ow.
A famous joke by Prof. Dan Ariely:
(Source)
The word large and big are in themselves ‘relative’ and in my humble
opinion, large data is data sets that are less than 100GB.
Pandas is very e cient with small data (usually from 100MB up to
1GB) and performance is rarely a concern.
However, if you’re in data science or big data eld, chances are you’ll
encounter a common problem sooner or later when using Pandas — low
performance and long runtime that ultimately result in insu cient
memory usage — when you’re dealing with large data sets.
Indeed, Pandas has its own limitation when it comes to big data due to
its algorithm and local memory constraints. Therefore, big data is
typically stored in computing clusters for higher scalability and fault
tolerance. And it can often be accessed through big data ecosystem
(AWS EC2, Hadoop etc.) using Spark and many other tools.
Eventually, one of the ways to use Pandas with large data on local
machines (with certain memory constraints) is to reduce memory
usage of the data.
How to use Pandas with Large Data?
(Source)
So the question is: How to reduce memory usage of data using
Pandas?
The following explanation will be based my experience on an
anonymous large data set (40–50 GB) which required me to reduce the
memory usage to t into local memory for analysis (even before
reading the data set to a dataframe).
1. Read CSV le data in chunk size
To be honest, I was ba ed when I encountered an error and I couldn’t
read the data from CSV le, only to realize that the memory of my local
machine was too small for the data with 16GB of RAM.
Here comes the good news and the beauty of Pandas: I realized that
pandas.read_csv has a parameter called chunksize!
The parameter essentially means the number of rows to be read into a
dataframe at any single time in order to t into the local memory. Since
the data consists of more than 70 millions of rows, I speci ed the
chunksize as 1 million rows each time that broke the large data set into
many smaller pieces.
1
2
# read the large csv file with specified chunksize
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=
Read CSV le data in chunksize
The operation above resulted in a TextFileReader object for iteration.
Strictly speaking, df_chunk is not a dataframe but an object for further
operation in the next step.
Once I had the object ready, the basic work ow was to perform
operation on each chunk and concatenated each of them to form a
dataframe in the end (as shown below). By iterating each chunk, I
performed data ltering/preprocessing using a function —
chunk_preprocessing before appending each chunk to a list. And
nally I concatenated the list into a nal dataframe to t into the local
memory.
1
2
3
4
5
6
7
8
9
chunk_list = [] # append each chunk df here
# Each chunk is in df format
for chunk in df_chunk:
# perform data filtering
chunk_filter = chunk_preprocessing(chunk)
# Once the data filtering is done, append the chun
chunk list.append(chunk filter)
Work ow to perform operation on each chunk
2. Filter out unimportant columns to save memory
Great. At this stage, I already had a dataframe to do all sorts of analysis
required.
To save more time for data manipulation and computation, I further
ltered out some unimportant columns to save more memory.
1
2
# Filter out unimportant columns
df = df[['col_1','col_2', 'col_3', 'col_4', 'col_5', 'c
Filter out unimportant columns
3. Change dtypes for columns
The simplest way to convert a pandas column of data to a di erent type
is to use astype().
I can say that changing data types in Pandas is extremely helpful to save
memory, especially if you have large data for intense analysis or
computation (For example, feed data into your machine learning
model for training).
By reducing the bits required to store the data, I reduced the overall
memory usage by the data up to 50% !
Give it a try. And I believe you’ll nd that useful as well! Let me know
how it goes.
1
2
3
4
5
6
7
# Change the dtypes (int64 -> int32)
df[['col_1','col_2',
'col_3', 'col_4', 'col_5']] = df[['col_1','col_2',
'col_3', 'col_4',
# Change the dtypes (float64 -> float32)
df[['col_6', 'col_7',
Change data types to save memory
Final Thoughts
(Source)
There you have it. Thank you for reading.
I hope that sharing my experience in using Pandas with large data
could help you explore another useful feature in Pandas to deal with
large data by reducing memory usage and ultimately improving
computational e ciency.
Typically, Pandas has most of the features that we need for data
wrangling and analysis. I strongly encourage you to check them out as
they’d come in handy to you next time.
Also, if you’re serious about learning how to do data analysis in Python,
then this book is for you — Python for Data Analysis. With complete
instructions for manipulating, processing, cleaning, and crunching
datasets in Python using Pandas, the book gives a comprehensive and
step-by-step guides to e ectively use Pandas in your analysis.
Hope this helps!
.
.
.
As always, if you have any questions or comments feel free to leave your
feedback below or you can always reach me on LinkedIn. Till then, see
you in the next post!
About the Author
Admond Lee is a Big Data Engineer at work, Data Scientist in action. He
has been helping start-up founders and various companies tackle their
problems using data with deep data science and industry expertise. You
can connect with him on LinkedIn, Medium, Twitter, and Facebook.
Admond Lee Kin Lim - Big Data Engineer -
Micron Technology | LinkedIn
View Admond Lee Kin Lim's pro le on LinkedIn,
the world's largest professional community.…
www.linkedin.com