Preparing the text Data with scikit-learn — Feature Extraction

4 min readNov 12, 2018

In this tutorial, we will discuss preparing the text data for the machine learning algorithm to draw the features for efficient predictive modeling.

In machine words “just turn the text data into vectors”, I can’t understand your beep language.

“Torture the data, and it will confess to anything.” — Ronald Coase

There are few methods to transform the text data into vectors:

Bag of Words Model — Find the unique words i.e., vocabulary from the list of documents. Parse each document word with the vocabulary, if present ‘1’ else ‘0’. This makes each document vector maintain the same length that of vocabulary length. We use this vocabulary for the new document vectorization.

docs = ["SUPERB, I AM IN LOVE IN THIS PHONE", "I hate this phone"]words = list(set([word for doc in docs for word in doc.lower().split()]))vectors = []
for doc in docs:
    vectors.append([1 if word in doc.lower().split() else 0 for word in words])
print("vocabulary: ", words)   
print("vectors: ", vectors)
--------------------------------------------------------------------
vocabulary:  ['am', 'hate', 'i', 'in', 'love', 'phone', 'superb,', 'this']
vectors:  [[1, 0, 1, 1, 1, 1, 1, 1], [0, 1, 1, 0, 0, 1, 0, 1]]

2. Word Counts with CountVectorizer(scikit-learn) — Tokenize the collection of documents and form a vocabulary with it and use this vocabulary to encode new documents. We can use CountVectorizer of the scikit-learn library. It by default remove punctuation and lower the documents.

from sklearn.feature_extraction.text import CountVectorizer# list of documents
docs = ['SUPERB, I AM IN LOVE IN THIS PHONE', 'I hate this phone']# create the transform
vectorizer = CountVectorizer()# tokenize and build vocab
vectorizer.fit(docs)print('vocabulary: ', vectorizer.vocabulary_)# encode document
vector = vectorizer.transform(docs)# summarize encoded vector
print('shape: ', vector.shape)
print('vectors: ', vector.toarray())
--------------------------------------------------------------------
vocabulary:  {'superb': 5, 'am': 0, 'in': 2, 'love': 3, 'this': 6, 'phone': 4, 'hate': 1}
shape:  (2, 7)
vectors:  [[1 0 2 1 1 1 1] [0 1 0 0 1 0 1]]

It turns each vector into the sparse matrix. It will make sure the word present in the vocabulary and if present it prints the number of occurrences of the word in the vocabulary. The word “in” present twice in the first document so ‘2’ appeared in the first vector.

3. Word Frequencies with TfidfVectorizer (scikit-learn) — Word counts are pretty basic. In the first document, the word “in” has repeated and with that word we can’t draw any meaning. Stop words can repeat several times in a document and word count prioritize with the occurrence of the word. From word counts, we lose the interesting words and we mostly give priority to stopping words/less meaning carrying words.

TF-IDF is a popular method. Acronym is “Term Frequency and Inverse Document Frequency”. TF-IDF is word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

There are a few types of weighting schemes for tf-idf in general. Let's see how scikit-learn calculates tf*idf.

From scikit-learn — “The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf”. You can check out in this page

from sklearn.feature_extraction.text import TfidfVectorizer# list of documents
docs = ["SUPERB, I AM IN LOVE IN THIS PHONE", 
        "I hate this phone"]# create the transform
vectorizer = TfidfVectorizer()# tokenize and build vocab
vectorizer.fit(docs)# summarize
print('vocabulary: ', vectorizer.vocabulary_)
print('idfs: ', vectorizer.idf_)# encode document
vector = vectorizer.transform([docs[0]])# summarize encoded vector
print('vectors: ', vector.toarray())
--------------------------------------------------------------
vocabulary:  {'superb': 5, 'am': 0, 'in': 2, 'love': 3, 'this': 6, 'phone': 4, 'hate': 1}idfs:  [1.40546511 1.40546511 1.40546511 1.40546511 1.         1.40546511 1.        ]vectors:  [[0.35327777 0.         0.70655553 0.35327777 0.25136004 0.35327777 0.25136004]]

Analysis — idf per term is calculated by

nd is Number of Documents and df(d,t) is the term present in number of documents. Remember: This is the log with base 'e'

Example: Let's take the word “phone”, it presents in both documents, so log(1+2/1+2) + 1 == log(3/3) + 1 == 1. idfs[4] is 1, you can check in the above snippet.
Let's take another word “love”, it presents in one document, so log(1+2/1+1) + 1 = log(3/2) + 1 == 1.40546510811. idfs[3] is 1.40546511.

The final step is vector normalization, scikit-learn uses ‘l2’ normalization technique for each document.

The sum of squares of each value in the document vector is always 1.

Tf-idf is the best vectorization method among these three because it prioritises the words in each document. IDF value for the word “this” is less since it present in both the documents. So, unlike word counts which give a higher value for stop words like “in”, “this”, word frequency lowers the value if it present in more number of documents, because stop words repeat in each document almost.

Please check this tutorial for the tf-idf usage in SVM sentiment analysis.

Thanks for reading. If you like the concept, please don’t forget to endorse my skills on Linkedin.

Preparing the text Data with scikit-learn — Feature Extraction

Analysis — idf per term is calculated by

Written by Vasista Reddy

Responses (2)