Parsing large data files to working chunks

Contributed by Cory Schlesener, B.S.

When analyzing large data files in the realm of gigabytes, system memory and program/function set memory limits becomes an issue. Creating a smaller data set for analysis can bring the working data down to a manageable size. This downsizing would preferably be removing redundancy or features not relevant to analysis, over fracturing data into chunks to analyze separately. There is however the problem of manipulating data in the first place if the file is too big to load into memory. For tabular data, the best work around is to load the specific row(s) to manipulate. This can be achieved through python coded manipulations using pandas tool library (built on numpy). With pandas, the files (e.g. comma separated values (CSV) files) can be read in chunks of rows, specifying which row(s) in the file to read/load, and can then be formed into a list for manipulations. With further python code, the list can be manipulated in a standardized format, removing columns/data or replacing multiple-columns/long-data-strings with condensed representative values. The manipulated list can then be written out to a new file. This process, repeated to iterate over the original file’s rows, will generate a new file of curated reduced data that is more manageable for manipulation or analysis as a whole.

Reference:
Documentation on pandas.read_csv() function [look for chunksize and skiprows options]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

‘ ‘ ‘
This entry was posted in uncategorized. Bookmark the permalink.