Parallel Processing with Pandas

Vasista Reddy
3 min readJul 9, 2019

--

  • Does Your Pre-Processing/Cleaning task of Dataset with Pandas take more time?
  • Not all the cores are being used? Only the single core is being used?
Single Core

Multiprocessing helps us to perform parallel processing on data-sets with pandas. With this, you can have 100% core utilization and the processing is very fast.

Multi-Core

Explanation with an Example

Dataset has taken from Kaggle. It is about Wikipedia movie plots with 34886 records. For a dataset with the classification of news data of any source, you can contact our sales.

Wikipedia Movie Plot Dataset

Columns of the dataset are [‘Release Year’, ‘Title’, ‘Origin/Ethnicity’, ‘Director’, ‘Cast’, ‘Genre’, ‘Wiki Page’, ‘Plot’]

Pre-Processing Task

Cleaning the “plot” column of the dataset. Pre-processing steps include “removing stop words”, “removing special characters” and “extra spaces”.

import re 
from nltk.corpus import stopwords
import string
t = str.maketrans(dict.fromkeys(string.punctuation))def clean_text(text):
# Remove stop words
stops = set(stopwords.words("english"))
text = " ".join(list(set(text.lower().split()) - stops))
# Remove Special Characters
text = text.translate(t)
# removing the extra spaces
text = re.sub(' +',' ', text)
return text

Save the above script as “clean.py”

The time consumed before parallel processing is 12.24s and after parallel processing is 2.44s. Here is the script to test.

import pandas as pd
import multiprocessing as mp
import time
from clean import clean_text
df = pd.read_csv("wikipedia-movie-plots/wiki_movie_plots_deduped.csv") # file loadingprint("Columns of the dataset", list(df.columns))
print("Total records of the dataset", len(df))
# Before Parallel Processing
df1 = df.copy()
t1 = time.time()
df1['Plot'] = df1['Plot'].apply(clean_text)
t2 = time.time()
print("time consuming before Parallel Processing to process the Dataset {0:.2f}s".format(round(t2-t1, 2)))
# After Parallel Processing
p = mp.Pool(mp.cpu_count()) # Data parallelism Object
df2 = df.copy()
t3 = time.time()
df2['Plot'] = p.map(clean_text, df2['Plot'])
t4 = time.time()
print("time consuming after Parallel Processing to process the Dataset {0:.2f}s".format(round(t4-t3, 2)))
"""
Output
-------------------------------------------------------------------------------
Columns of the dataset ['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast', 'Genre', 'Wiki Page', 'Plot']Total records of the dataset 34886time consuming before Parallel Processing to process the Dataset 12.24stime consuming after Parallel Processing to process the Dataset 2.44s
"""

Alternative Option:

pandarallel - Multiprocessing on the pandas Dataframe. Memory efficient and proper CPU utilization. In Windows, it works only in WSL(Windows Subsystem for Linux).

Array_split of Numpy can help you with splitting the huge dataset into batches.

If you like the concept, please don’t forget to endorse my skills on Linkedin. Thanks!

--

--

Vasista Reddy
Vasista Reddy

Written by Vasista Reddy

Works at Cognizant. Ex-Turbolab-ian and loves trekking…. Reach_Me_Out_on_Linkedin: https://www.linkedin.com/in/vasista-reddy-100a852b/

No responses yet