如何使用Pandas处理大批量数据.pdf-资料库

tox33-10997613-4744302543431272780.pdf-第1页.png

第1页 / 共9页

tox33-10997613-4744302543431272780.pdf-第2页.png

第2页 / 共9页

tox33-10997613-4744302543431272780.pdf-第3页.png

第3页 / 共9页

tox33-10997613-4744302543431272780.pdf-第4页.png

第4页 / 共9页

tox33-10997613-4744302543431272780.pdf-第5页.png

第5页 / 共9页

tox33-10997613-4744302543431272780.pdf-第6页.png

第6页 / 共9页

tox33-10997613-4744302543431272780.pdf-第7页.png

第7页 / 共9页

tox33-10997613-4744302543431272780.pdf-第8页.png

第8页 / 共9页

Why and How to Use Pandas with Large Data But Not Big Data… Admond Lee Nov 4, 2018 · 5 min read Follow Pandas has been one of the most popular and favourite data science tools used in Python programming language for data wrangling and analysis. Data is unavoidably messy in real world. And Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data. In simple terms, Pandas helps to clean the mess. My Story of NumPy & Pandas When I rst started out learning Python, I was naturally introduced to NumPy (Numerical Python). It is the fundamental package for scientic computing with Python that provides an abundance of useful features for operations on n-arrays and matrices in Python. In addition, the library provides vectorization of mathematical operations on the NumPy array type, which signicantly optimizes

computation with high performance and enhanced speed of execution. NumPy is cool. But therein still lies some underlying needs for more higher level of data analysis tools. And this is where Pandas comes to my rescue. Fundamentally, the functionality of Pandas is built on top of NumPy and both libraries belong to the SciPy stack. This means that Pandas relies heavily on NumPy array to implement its objects for manipulation and computation — but used in a more convenient fashion. In practice, NumPy & Pandas are still being used interchangeably. The high level features and its convenient usage are what determine my preference in Pandas. Why use Pandas with Large Data — Not BIG Data? There is a stark dierence between large data and big data. With the hype around big data, it is easy for us to consider everything as “big data” and just go with the ow. A famous joke by Prof. Dan Ariely: (Source) The word large and big are in themselves ‘relative’ and in my humble opinion, large data is data sets that are less than 100GB. Pandas is very ecient with small data (usually from 100MB up to 1GB) and performance is rarely a concern. However, if you’re in data science or big data eld, chances are you’ll encounter a common problem sooner or later when using Pandas — low

performance and long runtime that ultimately result in insucient memory usage — when you’re dealing with large data sets. Indeed, Pandas has its own limitation when it comes to big data due to its algorithm and local memory constraints. Therefore, big data is typically stored in computing clusters for higher scalability and fault tolerance. And it can often be accessed through big data ecosystem (AWS EC2, Hadoop etc.) using Spark and many other tools. Eventually, one of the ways to use Pandas with large data on local machines (with certain memory constraints) is to reduce memory usage of the data. How to use Pandas with Large Data?

(Source) So the question is: How to reduce memory usage of data using Pandas? The following explanation will be based my experience on an anonymous large data set (40–50 GB) which required me to reduce the memory usage to t into local memory for analysis (even before reading the data set to a dataframe). 1. Read CSV le data in chunk size To be honest, I was baed when I encountered an error and I couldn’t read the data from CSV le, only to realize that the memory of my local machine was too small for the data with 16GB of RAM.

Here comes the good news and the beauty of Pandas: I realized that pandas.read_csv has a parameter called chunksize! The parameter essentially means the number of rows to be read into a dataframe at any single time in order to t into the local memory. Since the data consists of more than 70 millions of rows, I specied the chunksize as 1 million rows each time that broke the large data set into many smaller pieces. 1 2 # read the large csv file with specified chunksize df_chunk = pd.read_csv(r'../input/data.csv', chunksize= Read CSV le data in chunksize The operation above resulted in a TextFileReader object for iteration. Strictly speaking, df_chunk is not a dataframe but an object for further operation in the next step. Once I had the object ready, the basic workow was to perform operation on each chunk and concatenated each of them to form a dataframe in the end (as shown below). By iterating each chunk, I performed data ltering/preprocessing using a function — chunk_preprocessing before appending each chunk to a list. And nally I concatenated the list into a nal dataframe to t into the local memory. 1 2 3 4 5 6 7 8 9 chunk_list = [] # append each chunk df here # Each chunk is in df format for chunk in df_chunk: # perform data filtering chunk_filter = chunk_preprocessing(chunk) # Once the data filtering is done, append the chun chunk list.append(chunk filter) Workow to perform operation on each chunk 2. Filter out unimportant columns to save memory

Great. At this stage, I already had a dataframe to do all sorts of analysis required. To save more time for data manipulation and computation, I further ltered out some unimportant columns to save more memory. 1 2 # Filter out unimportant columns df = df[['col_1','col_2', 'col_3', 'col_4', 'col_5', 'c Filter out unimportant columns 3. Change dtypes for columns The simplest way to convert a pandas column of data to a dierent type is to use astype(). I can say that changing data types in Pandas is extremely helpful to save memory, especially if you have large data for intense analysis or computation (For example, feed data into your machine learning model for training). By reducing the bits required to store the data, I reduced the overall memory usage by the data up to 50% ! Give it a try. And I believe you’ll nd that useful as well! Let me know how it goes. 1 2 3 4 5 6 7 # Change the dtypes (int64 -> int32) df[['col_1','col_2', 'col_3', 'col_4', 'col_5']] = df[['col_1','col_2', 'col_3', 'col_4', # Change the dtypes (float64 -> float32) df[['col_6', 'col_7', Change data types to save memory Final Thoughts

(Source) There you have it. Thank you for reading. I hope that sharing my experience in using Pandas with large data could help you explore another useful feature in Pandas to deal with large data by reducing memory usage and ultimately improving computational eciency. Typically, Pandas has most of the features that we need for data wrangling and analysis. I strongly encourage you to check them out as they’d come in handy to you next time. Also, if you’re serious about learning how to do data analysis in Python, then this book is for you — Python for Data Analysis. With complete instructions for manipulating, processing, cleaning, and crunching datasets in Python using Pandas, the book gives a comprehensive and step-by-step guides to eectively use Pandas in your analysis. Hope this helps! . . . As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post!

About the Author Admond Lee is a Big Data Engineer at work, Data Scientist in action. He has been helping start-up founders and various companies tackle their problems using data with deep data science and industry expertise. You can connect with him on LinkedIn, Medium, Twitter, and Facebook. Admond Lee Kin Lim - Big Data Engineer - Micron Technology | LinkedIn View Admond Lee Kin Lim's prole on LinkedIn, the world's largest professional community.… www.linkedin.com

资料库

如何使用Pandas处理大批量数据.pdf

相关推荐

开发技术

热门标签

最新资料