Articles → NLP → Bag Of Words In NLP

Bag Of Words In NLP






What Is A Bag Of Words?







Sparse Matrix







How Does A Bag Of Words Work?






Sentence 1: "I like NLP"
Sentence 2: "I like machine learning"










"I like NLP" → [1, 1, 1, 0, 0]  
"I like machine learning" → [1, 1, 0, 1, 1]




import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Download required resources (run once)
nltk.download('punkt')
nltk.download('stopwords')

# Example corpus (documents)
documents = [
    "I like NLP",
    "I like machine learning",
    "NLP is fun and powerful"
]

# Step 1: Tokenize and clean text
stop_words = set(stopwords.words('english'))
cleaned_docs = []

for doc in documents:
    tokens = word_tokenize(doc.lower())  # tokenize & lowercase
    tokens = [t for t in tokens if t not in string.punctuation]  # remove punctuation
    tokens = [t for t in tokens if t not in stop_words]  # remove stopwords
    cleaned_docs.append(tokens)

print("Cleaned tokens:", cleaned_docs)

# Step 2: Build vocabulary
vocab = sorted(set(word for doc in cleaned_docs for word in doc))
print("\nVocabulary:", vocab)

# Step 3: Create Bag of Words vectors
def bow_vector(tokens, vocab):
    return [tokens.count(word) for word in vocab]

bow_matrix = [bow_vector(doc, vocab) for doc in cleaned_docs]

print("\nBag of Words Matrix:")
for i, vec in enumerate(bow_matrix):
    print(f"Doc {i+1}: {vec}")



Output


Picture showing the output of bag of words in nlp



How Is Bags Of Words Different From The Count Vectorization?





Posted By  -  Karan Gupta
 
Posted On  -  Thursday, September 11, 2025
 
Updated On  -  Thursday, December 18, 2025

Query/Feedback


Your Email Id
 
Subject
 
Query/FeedbackCharacters remaining 250