Working with large datasets on pandas? Here some tips to boost performance.
While working on a machine learning or a data visualization project I believe python + pandas + jupyter is the best set of tools to work on POC or hypothesis to quickly validate. However, sometimes there is performance degradation with large datasets to which I would like to share some of the tips I use to speed up the process.
Data Set
I have chosen an existing dataset from Kaggle NYC Parking Tickets 2014 data to demonstrate what I am trying to convey.
- CSV file size: 1.74 GB
- No of Rows: 9M+
- No of Columns: 51
1. Selective Data : Save RAM = more performance
With data of less than 500MB it usually doesn't matter as by default we read a CSV file as a whole but there is an option called nrows where we can specify no of rows to import. For example, if we want to read a CSV file with the first 10000 rows only then the syntax is as follows. Another option skiprows which allows you to skip rows. These options allow you to test your code on a smaller chunk of data multiple times rather than waiting for loading all the data
data_tickets = pd.read_csv('../input/Parking_Violations_Issued_-_Fiscal_Year_2014__August_2013___June_2014_.csv',nrows=10000,skiprows=1000)
These two options are helpful if you are just started working on the data and want to hit and try a few things.
2. Downcast : Use only that much you need
Downcast in simple terms means converting high storage consuming datatype to satisfying low storage consuming datatype. eg. float64 to int64. Let's say if we have a column which is currently stored as float64 but its values can be satisfied by int64 or lower(int8) then we should always downcast. Let's see how we do it.
data_tickets.dtypes
#outputSummons Number int64Plate ID objectRegistration State objectPlate Type objectIssue Date object.........Unregistered Vehicle? int64Vehicle Year int64Meter Number objectFeet From Curb int64Violation Post Code float64Violation Description float64No Standing or Stopping Violation float64Hydrant Violation float64Double Parking Violation float64Latitude float64Longitude float64Community Board float64Community Council float64Census Tract float64BIN float64BBL float64NTA float64dtype: object
In the above example, we have Vehicle Year as int64 and as we know Year cannot be negative and since the data is of 2014 it won't have a value greater than 2014 so replacing it with uint16 is safe.
The CSV file size is 1.74GB but the memory used by pandas object will be 3.5GB+. using data_tickets. dtype we found out there are 24 columns using 64bit memory which can easily be converted to 16bit (uint16 or float16).
Lets Do some math.
int64 = 64bitint16 = 16bitfor 9M records of int64 = 72MBfor 9M records of int16 = 18MB1 column saves = 54MB24 columns = 1.2GB
There are two ways to do this.
#1. converting all int columns at oncedata_tickets_ints = data_tickets.select_dtypes(include=['int'])data_tickets_uints = data_tickets_ints.apply(pd.to_numeric,downcast='unsigned')#2. one by one downcastingdata_tickets['Vehicle Year'] = data_tickets['Vehicle Year'].astype('uint16')
3. Object Optimization using Categorical.
Categorical is a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited and usually fixed, number of possible values. Examples Vehicle Body Type, Vehicle Make, Issuing Agency, Registration State, Plate Type, Street Name, and more. So in simple terms, we are converting objects of different lengths to fixed-length unsigned integers.
for dtype in ['float','int','object']:selected_dtype = data_tickets.select_dtypes(include=[dtype])mean_usage_b = selected_dtype.memory_usage(deep=True).mean()mean_usage_mb = mean_usage_b / 1024 ** 2print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
Average memory usage for float columns: 65.57 MBAverage memory usage for int columns: 62.49 MBAverage memory usage for object columns: 476.72 MB
Now let's see how to do implement Categorical.
vehicle_make = data_tickets['Vehicle Make']print(vehicle_make.head())# 0 AUDI# 1 FORD# 2 CHEVR# 3 FORD# 4 GMC# Name: Vehicle Make, dtype: object [533.27MB]vehicle_make_cat = vehicle_make.astype('category')print(vehicle_make_cat.head())# 0 AUDI# 1 FORD# 2 CHEVR# 3 FORD# 4 GMC# Name: Vehicle Make, dtype: category [18MB]
4. Using del and gc.collect()
As datasets load into the RAM and do not free on its own and after making any change to our original dataset which will still be consuming memory. Therefore we should build the habit of deleting data frames when its no longer used.
import gc
at the beginning of your project, and then each time you want to clear up space put command gc.collect()
del data_ticketsgc.collect()