Contributed by Cory Schlesener, B.S.
When analyzing large data files in the realm of gigabytes, system memory and program/function set memory limits becomes an issue. Creating a smaller data set for analysis can bring the working data down to a manageable size. This downsizing would preferably be removing redundancy or features not relevant to analysis, over fracturing data into chunks to analyze separately. There is however the problem of manipulating data in the first place if the file is too big to load into memory. For tabular data, the best work around is to load the specific row(s) to manipulate. This can be achieved through python coded manipulations using pandas tool library (built on numpy). With pandas, the files (e.g. comma separated values (CSV) files) can be read in chunks of rows, specifying which row(s) in the file to read/load, and can then be formed into a list for manipulations. With further python code, the list can be manipulated in a standardized format, removing columns/data or replacing multiple-columns/long-data-
Reference:
Documentation on pandas.read_csv() function [look for chunksize and skiprows options]
https://pandas.pydata.org/