Articles → NLP → Predict If The Text Is Spam Or Not In NLP

Predict If The Text Is Spam Or Not In NLP

This article describes how to predict if the text is spam or not in NLP.

NLP Pipeline

The entire pipeline is: -

Picture showing how the pipeline works in nlp using a flow chart

Let us discuss each step.

Text

To begin with, create a list of text and labels.

texts = [
    "Win cash now!!! Click this link",
    "Congratulations, you have won a prize",
    "Hey, are we meeting today?",
    "Please call me when you are free",
    "Limited offer! Buy now and get 50% off",
    "Let's have lunch tomorrow"
]

labels = ["spam","spam","ham","ham","spam","ham"]

Tokens

In this step, we will do tokenization, lemmatization, and remove stop words.

# tokenizer
def spacy_tokenizer(text):
    doc = nlp(text)
    tokens = [] 
    for token in doc:
        if not token.is_stop:  
            tokens.append(token.lemma_)
    return tokens

TF-IDF Fit Transformation

The purpose of this step is to convert words into a sparse matrix.

# Initialize the Vectorizer
#Learns the vocabulary from the input data (fit). (the goal is to build a dictionary of all unique words and calculate the importance of those words)
#Transforms the text data into a feature matrix (transform).
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer, lowercase=True)
X = vectorizer.fit_transform(texts)

Model

In this step, we assign the sparse matrix to the model.

# Initialize the Model
model = LinearSVC()
model.fit(X, labels)

Prediction

Finally, do the prediction.

# predict
new_message = ["Congratulations! You win free cash now"]
X_new = vectorizer.transform(new_message)
prediction = model.predict(X_new)
print("Prediction:", prediction[0])

Output

When you execute the program, you will get the following warning - The parameter 'token_pattern' will not be used since 'tokenizer' is not None'. Ignore this warning.

Complete Code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

texts = [
    "Win cash now!!! Click this link",
    "Congratulations, you have won a prize",
    "Hey, are we meeting today?",
    "Please call me when you are free",
    "Limited offer! Buy now and get 50% off",
    "Let's have lunch tomorrow"
]

labels = ["spam","spam","ham","ham","spam","ham"]

# tokenizer
def spacy_tokenizer(text):
    doc = nlp(text)
    tokens = [] 
    for token in doc:
        if not token.is_stop:  
            tokens.append(token.lemma_)
    return tokens


# Initialize the Vectorizer
#Learns the vocabulary from the input data (fit). (the goal is to build a dictionary of all unique words and calculate the importance of those words)
#Transforms the text data into a feature matrix (transform).
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer, lowercase=True)
X = vectorizer.fit_transform(texts)


# Initialize the Model
model = LinearSVC()
model.fit(X, labels)

# predict
new_message = ["Congratulations! You win free cash now"]
X_new = vectorizer.transform(new_message)
prediction = model.predict(X_new)
print("Prediction:", prediction[0])

Posted By -	Karan Gupta

Posted On -	Wednesday, January 7, 2026

Query/Feedback

Your Email Id		**

Subject		*

Query/Feedback	Characters remaining 250	**