Articles → SKLEARN → Splitting The Data Into A Training And A Testing Set In Sklearn

Splitting The Data Into A Training And A Testing Set In Sklearn

This article describes splitting the data into a training and a testing set in SkLearn.

Why Do We Need Data Splitting?

When you train a model, it learns patterns from the training data. If you only evaluate the model on the same data it was trained on, it might appear to perform very well, but this performance might not generalize to new, unseen data.

Example

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Your data
X = np.array([150, 160, 170]).reshape(-1, 1)
y = np.array([50, 56, 63])

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate
print("Predicted:", y_pred)
print("Actual:   ", y_test)

XTrain, X_test, Y_Train, And Y_Test

Variable Name	Dependent/Independent	Purpose
X_train	Input feature (Independent variable)	Portion used to train the model.
X_test	Input feature (Independent variable)	Portion used to test the model.
Y_train	Output (Dependent variable)	Portion used while training.
Y_test	Output (Dependent variable)	Portion used to test the model.

Output

Picture showing how data is splitted between training and testing data

Posted By -	Karan Gupta

Posted On -	Tuesday, May 27, 2025

Query/Feedback

Your Email Id		**

Subject		*

Query/Feedback	Characters remaining 250	**