Parallel Processing with Pandas

3 min readJul 9, 2019

Does Your Pre-Processing/Cleaning task of Dataset with Pandas take more time?
Not all the cores are being used? Only the single core is being used?

Multiprocessing helps us to perform parallel processing on data-sets with pandas. With this, you can have 100% core utilization and the processing is very fast.

Explanation with an Example

Dataset has taken from Kaggle. It is about Wikipedia movie plots with 34886 records. For a dataset with the classification of news data of any source, you can contact our sales.

Columns of the dataset are [‘Release Year’, ‘Title’, ‘Origin/Ethnicity’, ‘Director’, ‘Cast’, ‘Genre’, ‘Wiki Page’, ‘Plot’]

Pre-Processing Task

Cleaning the “plot” column of the dataset. Pre-processing steps include “removing stop words”, “removing special characters” and “extra spaces”.

import re 
from nltk.corpus import stopwords
import stringt = str.maketrans(dict.fromkeys(string.punctuation))def clean_text(text): 
    # Remove stop words
    stops = set(stopwords.words("english"))
    text = " ".join(list(set(text.lower().split()) - stops))    # Remove Special Characters
    text = text.translate(t)    # removing the extra spaces
    text = re.sub(' +',' ', text)    return text

Save the above script as “clean.py”

The time consumed before parallel processing is 12.24s and after parallel processing is 2.44s. Here is the script to test.

import pandas as pd
import multiprocessing as mp
import time
from clean import clean_textdf = pd.read_csv("wikipedia-movie-plots/wiki_movie_plots_deduped.csv") # file loadingprint("Columns of the dataset", list(df.columns))
print("Total records of the dataset", len(df))# Before Parallel Processing
df1 = df.copy()
t1 = time.time()
df1['Plot'] = df1['Plot'].apply(clean_text)
t2 = time.time()
print("time consuming before Parallel Processing to process the Dataset {0:.2f}s".format(round(t2-t1, 2)))# After Parallel Processing
p = mp.Pool(mp.cpu_count()) # Data parallelism Object
df2 = df.copy()
t3 = time.time()
df2['Plot'] = p.map(clean_text, df2['Plot'])
t4 = time.time()
print("time consuming after Parallel Processing to process the Dataset {0:.2f}s".format(round(t4-t3, 2)))"""
Output 
-------------------------------------------------------------------------------Columns of the dataset ['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page', 'Plot']Total records of the dataset 34886time consuming before Parallel Processing to process the Dataset 12.24stime consuming after Parallel Processing to process the Dataset 2.44s
"""

Alternative Option:

pandarallel - Multiprocessing on the pandas Dataframe. Memory efficient and proper CPU utilization. In Windows, it works only in WSL(Windows Subsystem for Linux).

Array_split of Numpy can help you with splitting the huge dataset into batches.

If you like the concept, please don’t forget to endorse my skills on Linkedin. Thanks!

Parallel Processing with Pandas

Explanation with an Example

Pre-Processing Task

Written by Vasista Reddy

No responses yet