Articles → NLP → TF-IDF In NLP

TF-IDF In NLP

This article describes TF-IDF in NLP.

Purpose

TF-IDF (Term Frequency – Inverse Document Frequency) is a statistical technique used to evaluate how important a word is in a document relative to a collection of documents.

Imagine you have one document. A word that appears many times in that document is likely important for that document. For example, in a document about Machine Learning, the word algorithm may appear 30 times. That signals importance.

Terms Frequency

The Term Frequency Measures how frequently a term appears in a document.

Picture showing the formula to calculate the term frequency

Inverse Document Frequency (IDF)

The inverse Document Frequency measures how important a term is across multiple documents.

Picture showing the formula to calculate inverse document frequency

TF-IDF Weight

TF-IDF Score

The TF-IDF score is simply the product of TF and IDF.

High TF-IDF

word is frequent in a document but rare in the corpus → more important.

Low TF-IDF

The word is either rare overall or very common everywhere (like stop words).

Example

Consider the following documents: -

D1: "NLP is fun and exciting"
D2: "NLP is a branch of AI"
D3: "AI and ML are related"

Total number of documents = 3.

Words	Number Of Documents In Which The Term Appears	IDF	Comments
NLP	2	log(3/2) = 0.18	lower
AI	2	log(3/2) = 0.18	lower
ML	1	log(3/1) = 0.48	higher
exciting	1	log(3/1) = 0.48	higher
fun	1	log(3/1) = 0.48	higher
branch	1	log(3/1) = 0.48	higher
related	1	log(3/1) = 0.48	higher
is, and	2	log(3/2) = 0.18	lower

Example

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
docs = [
    "NLP is fun and exciting",
    "NLP is a branch of AI",
    "AI and ML are related"
]

# Create TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform
X = vectorizer.fit_transform(docs)

# Get feature names (vocabulary)
features = vectorizer.get_feature_names_out()

# Convert to array for readability
tfidf_matrix = X.toarray()

# Print results
import pandas as pd
df = pd.DataFrame(tfidf_matrix, columns=features)
print(df)

Output

Posted By -	Karan Gupta

Posted On -	Tuesday, September 16, 2025

Query/Feedback

Your Email Id		**

Subject		*

Query/Feedback	Characters remaining 250	**