Articles → NLP → Bag Of Words In NLP

Bag Of Words In NLP






What Is A Bag Of Words?




  1. It ignores grammar, which means that it does not care about the sentence structure.
  2. It ignores word order.
  3. It keeps the count of the number of occurrences of the word.



How Bag Of Words Work?






Sentence 1: "I like NLP"
Sentence 2: "I like machine learning"










"I like NLP" → [1, 1, 1, 0, 0]  
"I like machine learning" → [1, 1, 0, 1, 1]




import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Download required resources (run once)
nltk.download('punkt')
nltk.download('stopwords')

# Example corpus (documents)
documents = [
    "I like NLP",
    "I like machine learning",
    "NLP is fun and powerful"
]

# Step 1: Tokenize and clean text
stop_words = set(stopwords.words('english'))
cleaned_docs = []

for doc in documents:
    tokens = word_tokenize(doc.lower())  # tokenize & lowercase
    tokens = [t for t in tokens if t not in string.punctuation]  # remove punctuation
    tokens = [t for t in tokens if t not in stop_words]  # remove stopwords
    cleaned_docs.append(tokens)

print("Cleaned tokens:", cleaned_docs)

# Step 2: Build vocabulary
vocab = sorted(set(word for doc in cleaned_docs for word in doc))
print("\nVocabulary:", vocab)

# Step 3: Create Bag of Words vectors
def bow_vector(tokens, vocab):
    return [tokens.count(word) for word in vocab]

bow_matrix = [bow_vector(doc, vocab) for doc in cleaned_docs]

print("\nBag of Words Matrix:")
for i, vec in enumerate(bow_matrix):
    print(f"Doc {i+1}: {vec}")



Output


Picture showing the output of bag of words in nlp



How Is Bags Of Words Different From The Count Vectorization?





Posted By  -  Karan Gupta
 
Posted On  -  Thursday, September 11, 2025

Query/Feedback


Your Email Id
 
Subject
 
Query/FeedbackCharacters remaining 250