The basic Text Vectorization Techniques

Tungon Dugi
7 min readJan 23, 2024

--

Image by Bellinon from Pixabay

Text vectorization is a fundamental process in Natural Language Processing (NLP) that involves converting text data into numerical vectors. Since machine learning algorithms and models typically work with numerical data, text vectorization is crucial for enabling these algorithms to process and analyze textual information effectively.

In simple terms, text vectorization transforms each piece of text, such as a document, sentence, or word, into a numerical representation. This numerical representation captures the semantic meaning and relationships between words, allowing machines to understand and learn patterns from the text.

There are many text vectorization techniques available, but here I will only mention the basic text vectorization techniques.

Bag of Words(BoW):

Bag of Words(BoW) is one of the simplest vectorization methods for text data.

It simplifies text analysis by disregarding word order. Though lacking context, BoW is efficient for tasks like document classification.

It generally has the length of the entire vocabulary i.e., the set of unique words in the corpus.

The BoW technique essentially involves two main steps:
1. Tokenizing the words,
2. Creating Dictionary.
While there are additional pre-processing steps, these two form the core of implementing BoW.

Example:
Document 1: Tungon and Tado are sick
Document 2: Tungon went to a doctor
Document 3: Tado went to buy medicine

Tokenizing the words:

Creating Dictionary:
a. Listing all the unique words in the corpus:

Here the letter a is ignored, because it is one of the stop words and has less significance.

b. Generating Sparse Matrix or Document Vector Table:

Frequency of words in their respective Documents.

c. Generating Dictionary:

Frequency of words in entire corpus.

Drawbacks:

  1. BoW technique doesn’t preserve the word order.
  2. It does not allow to draw of useful inference for downstream NLP tasks.

Implementation in Python

a. Generating Sparse Matrix

# Bag of Words
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Define a corpus of documents
corpus = ['Tungon and Tado are sick', 'Tungon went to a doctor', 'Tado went to buy medicine']

# Create a CountVectorizer instance
count_vectorizer = CountVectorizer()

# Fit and transform the corpus using CountVectorizer
x = count_vectorizer.fit_transform(corpus)

# Get feature names (unique words) from the CountVectorizer
feature_names = count_vectorizer.get_feature_names_out()

# Convert the result to a dense array
result = x.toarray()

# Display the result (word frequencies in each document)
print(result)
Output:

array([[1, 1, 0, 0, 0, 1, 1, 0, 1, 0],
[0, 0, 0, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 1, 0, 1, 0, 1, 1, 0, 1]], dtype=int64)

b. Generating Word-Frequency Table

# Combine the documents in the corpus into a single text
text = corpus[0] + ". " + corpus[1] + ". " + corpus[2]

# Define a function for creating a bag-of-words representation using scikit-learn
def bag_of_words_sklearn(text):
# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the text using CountVectorizer
X = vectorizer.fit_transform([text])

# Get feature names (unique words) from CountVectorizer
feature_names = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dictionary of word counts
word_counts = dict(zip(feature_names, X.toarray()[0]))

return word_counts

# Call the function with the combined text
result_sklearn = bag_of_words_sklearn(text)

# Create a DataFrame to display the word frequencies
df = pd.DataFrame(list(result_sklearn.items()), columns=["Words", "Frequency"])

# Print the DataFrame
print(df)
Output: 

Words Frequency
and 1
are 1
buy 1
doctor 1
medicine 1
sick 1
tado 2
to 2
tungon 2
went 2

TF-IDF Vectorization:

The Bag of Words (BoW) method is straightforward but treats all words equally. It fails to recognize very common or rare words. TF-IDF (Term Frequency-Inverse Document Frequency) solves this by giving more weight to words that are important in a document but not too common across all documents.

TF-IDF stands for term frequency — inverse document frequency gives a measure that takes the importance of a word into consideration depending on how frequently it occurs in a document and a corpus.

Firstly, if we understand TF and IDF, it will help us grasp the technique better.

Term Frequency (TF):

Term frequency is basically a measure of how often a specific word appears in a piece of writing, like an article or a document.

To calculate it, you just divide the number of times that word shows up by the total number of words in the document. It helps us understand how important or common a particular word is in a given text. So, higher the term frequency, the more frequently that word appears in the document.

It is the percentage of the number of times a word(x) occurs in a particular document(y) divided by the total number of words in that document.

TF(“term”) = No. of times “term” appears in a document / Total no. of items in the document

Example:
Document 1: She loves to play with the balls.

Find TF(“balls”)?

i.e., TF(“balls”) = no. of times “balls” appeared / total no. of terms in the document.
= 1/7

Inverse Document Frequency(IDF):

It measures the importance of the word in the corpus. It measures how common a particular word is across all the document in the corpus.

It is the logarithmic ratio of number of total documents to the number of a document with a particular word.

IDF(“term”) = log(Total no. of documents / No. of documents with “term” in it)

If a word appears multiple times across many documents, then the denominator will increase, reducing the value of the second term. The IDF is used to find the common or the rare words in a corpus.

Example:

In any corpus, few words like ‘is’ or ‘and’ are very common, and most likely, they will be present in almost every document.

Let’s say the word ‘is’ is present in all the documents in a corpus of 5000000 documents. The IDF for that would be:

IDF(“is”) = log(5000000/5000000) = log 1 = 0
Thus, most common words will have least importance.

TF-IDF(“term”) = TF(“term”) * IDF(“terms”)

TF-IDF Formula:

W(x,y) = TF(x,y) * log(N/df(x))
where,
-
W(x,y) = word ‘x’ within document ‘y’
-
TF(x,y) = frequency of ‘x’ and ‘y’
-
N = total no. of documents
-
df(x) = no. of documents containing ‘x’

Python Implementation:

Consider documents:
Doc1 : He is young.
Doc2: He loves us.
Doc3: He needs it.
Doc4: He stood up.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Transform the input text data to TF-IDF matrix
vectors = vectorizer.fit_transform(corpusFile)

# Get feature names (words) used in the TF-IDF matrix
feature_names = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense matrix and then to a list
matrix = vectors.todense()
list_dense = matrix.tolist()

# Create a Pandas DataFrame using the list of dense matrix values and feature names
df = pd.DataFrame(list_dense, columns=feature_names)

# Print the resulting DataFrame
print(df)

Output:

Important point of TF-IDF:

  1. A document term matrix is generated and each column represents and individual unique word, just like count vectorization.
  2. In TF-IDF, the matrix cells don’t show how often a word appears (like in TF), but instead, they hold a weight. This weight reflects how significant a word is within a specific text message or document.
  3. This approach relies on the frequency method, yet it differs from count vectorization. It considers not only how often a word appears in a single document but also takes into account its occurrence across the entire collection of documents (corpus).
  4. TF-IDF assigns higher importance to less frequent events and lowers the importance of common events. This means it penalizes words that appear often in a document, like “the” or “is,” but gives more significance to words that are less common or rare.
  5. The product of TF (Term Frequency) and IDF (Inverse Document Frequency) for a word reveals how frequently the token appears in a specific document and how distinctive it is across the entire collection of documents (corpus).

Conclusion:

The techniques discussed, such as TF-IDF and Bag of Words, grapple with the challenge of vector sparsity. This issue arises due to the vast number of unique words in a corpus, resulting in sparse matrices that can hinder meaningful analysis. Additionally, these methods lack the capacity to effectively capture complex word relationships, limiting their ability to model intricate semantic structures. Consequently, when confronted with long text sequences, they may fall short in providing a comprehensive representation.

Researchers are exploring more advanced approaches, like neural networks and deep learning, to address these limitations and enhance the understanding of intricate patterns within textual data.

--

--

Tungon Dugi

⛩📝Data Science Enthusiast | PhD Scholar | Writer | Leo Messi for life