Articles → NLP → Bag Of Words In NLP
Bag Of Words In NLP
What Is A Bag Of Words?
- It ignores grammar, which means that it does not care about the sentence structure.
- It ignores word order.
- It keeps the count of the number of occurrences of the word.
How Bag Of Words Work?
Sentence 1: "I like NLP"
Sentence 2: "I like machine learning"
"I like NLP" → [1, 1, 1, 0, 0]
"I like machine learning" → [1, 1, 0, 1, 1]
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# Download required resources (run once)
nltk.download('punkt')
nltk.download('stopwords')
# Example corpus (documents)
documents = [
"I like NLP",
"I like machine learning",
"NLP is fun and powerful"
]
# Step 1: Tokenize and clean text
stop_words = set(stopwords.words('english'))
cleaned_docs = []
for doc in documents:
tokens = word_tokenize(doc.lower()) # tokenize & lowercase
tokens = [t for t in tokens if t not in string.punctuation] # remove punctuation
tokens = [t for t in tokens if t not in stop_words] # remove stopwords
cleaned_docs.append(tokens)
print("Cleaned tokens:", cleaned_docs)
# Step 2: Build vocabulary
vocab = sorted(set(word for doc in cleaned_docs for word in doc))
print("\nVocabulary:", vocab)
# Step 3: Create Bag of Words vectors
def bow_vector(tokens, vocab):
return [tokens.count(word) for word in vocab]
bow_matrix = [bow_vector(doc, vocab) for doc in cleaned_docs]
print("\nBag of Words Matrix:")
for i, vec in enumerate(bow_matrix):
print(f"Doc {i+1}: {vec}")
Output
How Is Bags Of Words Different From The Count Vectorization?
Posted By - | Karan Gupta |
|
Posted On - | Thursday, September 11, 2025 |