Deep Learning Vs Machine Learning with Text Classification

Vasista Reddy
8 min readMar 21, 2023

--

Introduction

Outline

Deep Learning, Machine Learning, and Artificial Intelligence are the popular buzzwords in present trends.

Artificial Intelligence is the branch of computer science that deals with developing intelligence artificially for machines that are able to think, act and behave like humans.

Machine Learning is a subset of Artificial Intelligence and is the way to implement artificial intelligence. It is the statistical approach where each instance in a data set is described by a set of features or attributes. Feature Extraction is key in Machine Learning.

Deep Learning is the next evolution and subset of Machine Learning. It is a method of statistical learning that extracts features or attributes from raw data. Deep Learning uses a network of algorithms called artificial neural networks which imitates the function of the human neural networks present in the brain. Deep Learning takes the data into a network of layers(Input, Hidden & Output) to extract the features and to learn from the data.

In Machine Learning and Deep Learning, there are models that fall into different categories like supervised, unsupervised & reinforcement learning. In this tutorial, we will discuss Supervised learning which involves an output label associated with each instance in the data set.

Supervised Learning Model Flow Chart

Data Set

Data is the new oil

The above quote says it all. The popularity grew in technologies like deep learning, machine learning and the network of algorithms associated are data-hungry for better efficiency and accuracy. The more data you feed, the Better the accuracy you get by learning.

For our tutorial, the data set has been taken from Kaggle. I have chosen 10 of 40 labels from the data set of which 2K values per label. Categorized the data set with the following 10 labels Business, Crime, Entertainment, Food & Drink, Politics, Religion, Science, Sports, Technology, and Travel.

Dataset Overview

We split the data set into three samples — Train, Validate, and Test samples with percentages 80, 10, and 10 respectively. Train and Validate are for training and validating the model whereas the Test sample is for evaluating the model with F1-Score. The test sample is what the model never encountered while training. We further do analysis with F1-Score of Machine Learning and Deep Learning Models with different data-size and epochs.

The data sets are uploaded here.

Data-Cleaning

Pre-processing of data will have an impact on the output ie., the accuracy, and performance of the model. Some of the data-cleaning steps are as follows

  • Removing Stop Words. (NLTK)
  • Performing Stemming on the text. (NLTK)
  • Removing special characters & extra spaces or keeping only Alpha-Numeric characters in the text
# NLTK python module for stemming and stopwords removal
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import string, re
stemmer = SnowballStemmer('english') # stemmer
t = str.maketrans(dict.fromkeys(string.punctuation)) # special char removal
def clean_text(text):
## Remove Punctuation
text = text.translate(t)
text = text.split()
## Remove stop words
stops = set(stopwords.words("english"))
text = [stemmer.stem(w) for w in text if not w in stops]

text = " ".join(text)
text = re.sub(' +',' ', text) # extra consecutive space removal
return text

Goal

Our goal is to find the best approach among deep learning and machine learning to perform the classification task. “Text Classification” is an example of Supervised Machine Learning & Natural Language Processing task.

The goal of text classification is to automatically classify text documents into one or more predefined categories.

Dive into the Deep Learning

The fundamental network architectures of neural networks are

  1. Convolutional Neural Networks (CNN)
  2. Recurrent Neural Networks (RNN)
  3. Recursive/Hierarchical Neural Networks (HAN)
  4. Unsupervised Pre-trained Networks

For this tutorial, we are focusing only on CNN. CNN shows promising results compared to other architectures, especially for the classification of images/text. Compared to CNN, other architectures are time-consuming too.

CNN is a class of deep, feed-forward artificial neural networks & uses a variation of multilayer perceptrons designed to require minimal preprocessing. Here, we use convolutions over the input layer to compute the output instead of the connection of each input neuron to each output neuron in the next layer like a fully connected network.

CNN WorkFlow

CNNs are generally used in computer vision, however, they’ve recently been applied to various NLP tasks like text classification, sentiment analysis, etc and the results are promising. Unlike image pixels, in our case, a character/word will be the input to the network. NLP tasks contain data which is text/sentence/words represented as a matrix. That is, each row is a vector that represents a word. Typically, these vectors are word embeddings like Glove (Global Vectors for Word Representation), word2vec, and TF-IDF, but they could also be one-hot vectors that index the word into a vocabulary. For a 10-word sentence using a 100-dimensional glove embedding, we would have a 10×100 matrix as our input.

In CNN, we can make convolutions detect a pattern like Ngrams by varying the size of the kernel and concatenating the convolution outputs.

For example — The sentence is “I like this movie very much”. Convolution kernel filters are [3, 4, 5] so the patterns detected from the sentence will be “I like this”, “I like this movie”, “I like this movie very”, “like this movie”, … etc. which are very useful in the next layers of the network. Let's have a look at the following diagram.

Source: Google(Altered)
convs = []
filter_sizes = [3,4,5]
MAX_SEQUENCE_LENGTH = 1000
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
for filter_size in filter_sizes:
l_conv = Conv1D(filters=128, kernel_size=filter_size, activation='relu')(embedded_sequences)
l_pool = MaxPooling1D(5)(l_conv)
convs.append(l_pool)
l_merge = Concatenate(axis=1)(convs)
l_cov1= Conv1D(filters=128, kernel_size=5, activation='relu')(l_merge)
l_cov1 = Dropout(0.2)(l_cov1)
l_pool1 = MaxPooling1D(5)(l_cov1)
l_cov2 = Conv1D(filters=128, kernel_size=5, activation='relu')(l_pool1)
l_cov2 = Dropout(0.2)(l_cov2)
l_pool2 = MaxPooling1D(30)(l_cov2)
l_flat = Flatten()(l_pool2)
l_dense = Dense(128, activation='relu')(l_flat)
preds = Dense(10, activation='softmax')(l_dense) # 10 labels = 10 nodes

The above Convolutional Architecture uses a total of 128 filters with sizes 3, 4, and 5 and max-pooling of 5 and 30.

Model Network Architecture

The final layer dense_2 Output shape will change depending on the data-set categories. Since the data set has 10 categories, the final layer shape is 10.

Dropout is dropping off the neurons to prevent an over-fitting problem in neural networks. It is an approach to regularization in neural networks which helps to reduce interdependent learning amongst the neurons. In Machine Learning we use regularization to prevent an over-fitting problem by adding a penalty to the loss function.

Batch normalization is another method to regularize a convolutional network.

How the inclusion of Dropout or/and Batch Normalization affects the performance of the network model will be another topic to discuss.

The complete code of “text classification using Deep Learning-CNN” is here.

Working with the Machine Learning algorithms

While in Deep Learning — We know there are different networks and we choose the CNN — Here in Machine Learning, the replacement of network is the algorithm and those algorithms are:

  • Naive Bayes.
  • Decision Trees.
  • Logistic Regression (Linear Model).
  • Support Vector Machines (SVM).
  • Random Forest.
  • K-Means Clustering.
  • K-Nearest Neighbour.
  • Gaussian Mixture Model.
  • Hidden Markov Model. et cetera

Among these ML Algorithms, we will discuss how the Naive Bayes, Logistic Regression and SVM classifier models perform on the data-set feature vectors.

Transforming the data into feature vectors with the following methods

  • Count Vectors.
  • TF-IDF Word Vectors.
  • TF-IDF N-Gram Vectors.
from sklearn import model_selection, preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
'''Assume df is the dataset with columns "content" and "label"'''# split the data into training and validation
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['content'], df['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
target_names = list(encoder.classes_) # output labels for report generation# count vectorization
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['content'])
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
# word level tf-idf vectorization
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['content'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
# ngram level tf-idf vectorization
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(df['content'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)

We use Naive Bayes, Logistic Regression, and SVM algorithms to train the data-set feature vectors to form classifier models that are used for prediction.

from sklearn import naive_bayes, metrics
from sklearn.metrics import classification_report
def report_generation(classifier, train_data, valid_data, train_y, valid_y):
classifier.fit(train_data, train_y)
predictions = classifier.predict(valid_data)
print("Accuracy :", metrics.accuracy_score(predictions, valid_y))
report = classification_report(valid_y, predictions, output_dict=True, target_names=target_names)
return report
# Naive Bayes
classifier = naive_bayes.MultinomialNB()
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)
print("NB Count Vectorizer Report :", report['weighted avg'])
# Results
Accuracy : 0.7948773138183384
NB Count Vectorizer Report : {'precision': 0.7967624232183244, 'recall': 0.7948773138183384, 'f1-score': 0.7941004411892886, 'support': 4646}

Similarly, Logistic Regression and SVM classifiers are passed to get the classification report.

from sklearn import linear_model, naive_bayes, svm, metrics
from sklearn.metrics import classification_report
def report_generation(classifier, train_data, valid_data, train_y, valid_y):
classifier.fit(train_data, train_y)
predictions = classifier.predict(valid_data)
print("Accuracy :", metrics.accuracy_score(predictions, valid_y))
report = classification_report(valid_y, predictions, output_dict=True, target_names=target_names)
return report
# Naive Bayes
classifier = naive_bayes.MultinomialNB()
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)
print("NB Count Vectorizer Report :", report['weighted avg'])
# Logistic Regression classifier = linear_model.LogisticRegression()
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)
print("LogisticRegression Count Vectorizer Report :", report['weighted avg'])

report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y)
print("LogisticRegression TFIDF-Word Report :", report['weighted avg'])
report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)
print("LogisticRegression TFIDF-NGram Report :", report['weighted avg'])
# Support Vector Machines

classifier = svm.SVC(gamma="scale")
report = report_generation(classifier, xtrain_count, xvalid_count, train_y, valid_y)
print("SVM Count Vectorizer Report :", report['weighted avg'])
# Results
Accuracy : 0.9668
SVM Count Vectorizer Report : {'precision': 0.9687847838287942, 'recall': 0.9668, 'f1-score': 0.9672306318670637, 'support': 2500}
report = report_generation(classifier, xtrain_tfidf, xvalid_tfidf, train_y, valid_y)
print("SVM TFIDF-Word Report :", report['weighted avg'])
# Results
Accuracy
: 0.9804
SVM TFIDF-Word Report : {'precision': 0.980766234757573, 'recall': 0.9804, 'f1-score': 0.9804795388691244, 'support': 2500}

report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)
print("SVM TFIDF-NGram Report :", report['weighted avg'])
# Results
Accuracy : 0.9304
SVM TFIDF-NGram Report : {'precision': 0.9324797933370057, 'recall': 0.9304, 'f1-score': 0.9306949638900389, 'support': 2500}

The complete code of “text classification using Machine Learning-Naive Bayes, Logistic Regression and SVM” is here.

Results

Observations

  • CNN outperformed the ML Algorithms like Naive Bayes, SVM, and Logistic Regression in all the 3 cases. Neural Networks showed tremendous accuracy for the NLP text classification task.
  • Among the ML algorithms, Naive Bayes with Count Vectorization has better accuracy.
  • An increase in the number of epochs increased the accuracy in the case of CNN of the respective data set.
  • More Epochs = Better Accuracy of the model considering the sample size.

--

--

Vasista Reddy
Vasista Reddy

Written by Vasista Reddy

Works at Cognizant. Ex-Turbolab-ian and loves trekking…. Reach_Me_Out_on_Linkedin: https://www.linkedin.com/in/vasista-reddy-100a852b/

No responses yet