How to split data into training and testing sets in Python
Learn how to split data into training and testing sets in Python. Explore methods, tips, real-world applications, and common error fixes.

To build a reliable machine learning model, you must split your data into training and testing sets. This step ensures you can accurately evaluate your model's performance on unseen data.
In this article, we'll cover several techniques to split your data using popular Python libraries. We'll also provide practical tips, explore real-world applications, and offer debugging advice for a smooth implementation.
Basic split using train_test_split
from sklearn.model_selection import train_test_split
import numpy as np
X = np.arange(100).reshape(50, 2) # Sample features
y = np.arange(50) # Sample labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set shape: {X_train.shape}, Testing set shape: {X_test.shape}")--OUTPUT--Training set shape: (40, 2), Testing set shape: (10, 2)
The train_test_split function from scikit-learn is a straightforward way to partition your data. It shuffles the dataset randomly before splitting it, which helps prevent any bias from the original data order. Two key parameters control this process:
test_size=0.2: This argument specifies that 20% of your data will be set aside for the test set. The remaining 80% is used for training.random_state=42: Setting this ensures your split is reproducible. Every time you run the code, you'll get the exact same training and testing sets, which is crucial for consistent model evaluation.
Core splitting techniques
While train_test_split is a versatile tool, certain situations demand more specialized techniques for manual control, balancing classes, or handling sequential data.
Manual splitting with numpy
import numpy as np
data = np.arange(100).reshape(50, 2)
labels = np.arange(50)
train_ratio = 0.8
train_size = int(len(data) * train_ratio)
train_data, test_data = data[:train_size], data[train_size:]
train_labels, test_labels = labels[:train_size], labels[train_size:]
print(f"Train size: {len(train_data)}, Test size: {len(test_data)}")--OUTPUT--Train size: 40, Test size: 10
Manually splitting with numpy gives you direct control over how your data is divided. You simply calculate a split index based on a train_ratio and use Python's slicing syntax—like data[:train_size] and data[train_size:]—to create the training and testing sets in a memory-efficient way.
- A crucial point to remember is that this method doesn't shuffle the data. It performs a sequential split, which might be ideal for time-series data but could introduce bias if your dataset has an inherent order.
Stratified splitting for balanced classes
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(100, 2)
y = np.array([0] * 80 + [1] * 20) # Imbalanced classes
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Original class ratio: {np.bincount(y) / len(y)}")
print(f"Training class ratio: {np.bincount(y_train) / len(y_train)}")--OUTPUT--Original class ratio: [0.8 0.2]
Training class ratio: [0.8 0.2]
When your dataset has imbalanced classes—for example, 80% of one category and 20% of another—a standard random split can be misleading. You risk creating a test set that doesn't fairly represent the minority class, which skews your model's performance evaluation.
- Stratified splitting solves this problem. By adding the parameter
stratify=y, you tell the function to preserve the original percentage of each class in both the training and testing sets. This ensures your model is evaluated on a truly representative data sample.
Time-based splitting for sequential data
import pandas as pd
import numpy as np
dates = pd.date_range(start='2023-01-01', periods=100, freq='D')
data = np.random.randn(100, 2)
time_series = pd.DataFrame(data, index=dates, columns=['A', 'B'])
cutoff_date = '2023-03-15'
train_df = time_series.loc[:cutoff_date]
test_df = time_series.loc[cutoff_date:]
print(f"Training: {train_df.shape} rows, Testing: {test_df.shape} rows")--OUTPUT--Training: (74, 2) rows, Testing: (26, 2) rows
For sequential data, like stock prices or sensor readings, random shuffling isn't an option because it destroys the chronological order. You need to train your model on past data to predict future events. This approach mimics how you'd use the model in the real world.
- The code defines a
cutoff_dateto split the dataset. All data before this date is used for training, while the data from that point onward is reserved for testing. - This is done using pandas'
.locaccessor, which selects data by its index—in this case, the dates.
Advanced splitting methods
When a simple train-test split doesn't cut it, advanced methods provide more robust ways to validate your model and manage complex data dependencies. These techniques are perfect for vibe coding machine learning experiments.
Cross-validation with KFold
from sklearn.model_selection import KFold
import numpy as np
X = np.random.rand(100, 4)
y = np.random.randint(0, 2, 100)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
print(f"Fold {fold+1}: {len(train_idx)} train samples, {len(test_idx)} test samples")
if fold == 0: # Show only first fold details
break--OUTPUT--Fold 1: 80 train samples, 20 test samples
Cross-validation gives you a more reliable estimate of your model's performance than a single train and test split. The KFold function automates this by splitting the data into a specified number of parts, or "folds," then training and testing the model multiple times. This process ensures your results aren't just a fluke from one specific data partition, making it an essential technique for AI coding workflows.
- The
KFoldobject is set up withn_splits=5, creating five distinct folds from the data. - The loop iterates through each fold, using one for testing and the remaining four for training, giving you a more comprehensive performance metric.
Multi-stage splitting for train/validation/test
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(1000, 5)
y = np.random.randint(0, 2, 1000)
# First split out test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Split remaining data into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
print(f"Train: {len(X_train)}, Validation: {len(X_val)}, Test: {len(X_test)}")--OUTPUT--Train: 600, Validation: 200, Test: 200
A simple train/test split isn't enough when you need to tune your model's hyperparameters. Using the test set for tuning can cause your model to indirectly learn from it, leading to an overly optimistic performance estimate. This is where a three-way split—train, validation, and test—comes in. The validation set is for tuning, while the test set is reserved for the final, unbiased evaluation.
- This is achieved with two sequential calls to
train_test_split. First, you split off a test set from the entire dataset. - Next, you split the remaining data again to create your training and validation sets. Using
test_size=0.25on the remaining 80% of data results in a final 60% train, 20% validation, and 20% test split.
Using GroupKFold for dependent samples
from sklearn.model_selection import GroupKFold
import numpy as np
X = np.random.rand(20, 2)
y = np.random.randint(0, 2, 20)
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7]
gkf = GroupKFold(n_splits=3)
for i, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups)):
print(f"Fold {i+1} - Train groups: {np.unique([groups[i] for i in train_idx])}")
print(f"Fold {i+1} - Test groups: {np.unique([groups[i] for i in test_idx])}")
if i == 0: # Show only first fold
break--OUTPUT--Fold 1 - Train groups: [1 2 4 6 7]
Fold 1 - Test groups: [3 5]
Sometimes your data isn't independent. You might have multiple data points from the same user or sensor, for example. GroupKFold is designed for these situations, preventing your model from being tested on data it has indirectly seen during training.
- It works by ensuring all samples from a specific group—defined by the
groupsarray—are kept together. - This means an entire group will land in either the training set or the test set, but never be split across both, giving you a more accurate performance evaluation.
Move faster with Replit
Learning individual techniques is one thing, but building a complete application is another. Replit is an AI-powered development platform designed to bridge that gap. It comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. With Agent 4, you can take an idea to a working product—it handles the code, databases, APIs, and deployment, all from a simple description.
Instead of piecing together techniques, you can build tools that apply them in a real-world context:
- A performance dashboard that uses stratified splitting to fairly evaluate a fraud detection model on an imbalanced transaction dataset.
- A backtesting tool for a stock trading algorithm that uses time-based splitting to train on historical price data and test on recent data.
- A churn prediction model for a subscription service that uses
GroupKFoldto ensure all data from a single user stays in either the training or test set, preventing data leakage.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with the right tools, splitting data can lead to subtle errors that compromise your model's integrity and reproducibility.
Fixing data leakage when preprocessing time series data
Data leakage is a common pitfall, especially with time-series data. It happens when information from your test set inadvertently bleeds into your training set, giving your model an unrealistic performance boost. This often occurs during preprocessing steps like scaling or normalization if you apply them to the entire dataset before splitting.
- The problem is that fitting a scaler on all your data allows it to learn statistical properties—like the mean and standard deviation—from the future data you've reserved for testing.
- To fix this, you must always split your data first. Then, fit your preprocessor exclusively on the training data and use that same fitted preprocessor to transform both the training and test sets.
Troubleshooting incorrect usage of stratify with continuous variables
The stratify parameter is a powerful tool for handling imbalanced classes, but it's designed to work with categorical labels, not continuous data. If you try to pass a continuous variable to stratify, you'll get an error because the function can't maintain proportions for an infinite number of unique values.
- The solution is to discretize the continuous variable first. You can convert the numerical data into a set of bins or categories, such as "low," "medium," and "high."
- Once you have these new categorical labels, you can pass them to the
stratifyparameter to ensure your train and test sets have a similar distribution based on those bins.
Resolving random_state issues in reproducible splits
Reproducibility is crucial for reliable machine learning experiments. If you can't get the same result twice, you can't confidently measure the impact of your changes. The random_state parameter in functions like train_test_split is the key to achieving this consistency.
- When you omit
random_state, the function uses a different random seed for each run, resulting in a different data split every time. This makes it impossible to debug issues or fairly compare model performance. - Always set
random_stateto a fixed integer. This ensures that the shuffling and splitting process is identical every time you run the code, making your results stable and reproducible.
Fixing data leakage when preprocessing time series data
A frequent mistake with time-series data is applying preprocessing steps, like scaling, to the entire dataset before splitting. This contaminates your training data with information from the future, leading to an overly optimistic evaluation of your model's performance.
The following code demonstrates this error. Notice how the StandardScaler is fitted on the full dataset before it's divided into training and testing sets.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
dates = pd.date_range('2023-01-01', periods=100)
df = pd.DataFrame({'value': np.cumsum(np.random.randn(100))}, index=dates)
# Incorrect: scaling before splitting
scaler = StandardScaler()
df['scaled_value'] = scaler.fit_transform(df[['value']])
train_df, test_df = df.iloc[:80], df.iloc[80:]
By using scaler.fit_transform on the entire dataframe, the scaler learns from data that should be reserved for testing. This gives the model an unfair preview of the future. The corrected implementation is shown below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
dates = pd.date_range('2023-01-01', periods=100)
df = pd.DataFrame({'value': np.cumsum(np.random.randn(100))}, index=dates)
# Correct: split first, then scale
train_df, test_df = df.iloc[:80], df.iloc[80:]
scaler = StandardScaler()
train_df['scaled_value'] = scaler.fit_transform(train_df[['value']])
test_df['scaled_value'] = scaler.transform(test_df[['value']])
The solution is to split your data *before* any preprocessing. This prevents the model from learning from future data it shouldn't see yet.
- First, fit your scaler on the training data using
fit_transform(). This teaches the scaler the statistical properties of only the training set. - Then, apply that same fitted scaler to transform the test data using just
transform(). This correctly applies the learned scaling without leaking information.
Troubleshooting incorrect usage of stratify with continuous variables
The stratify parameter is designed for categorical labels, ensuring class proportions are maintained. It doesn't work with continuous data, like prices or measurements, because there are too many unique values to balance. Attempting this will trigger an error, as shown below.
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(100, 3)
y = np.random.randn(100) # Continuous target
# Error: can't stratify with continuous values
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, test_size=0.3, random_state=42
)
The error happens because stratify=y is passed a continuous variable. The function can't create proportional splits from unique floating-point values and needs discrete classes instead. The following code demonstrates the correct implementation.
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
X = np.random.randn(100, 3)
y = np.random.randn(100) # Continuous target
# Create categorical bins for stratification
y_binned = pd.qcut(y, q=5, labels=False)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y_binned, test_size=0.3, random_state=42
)
The solution is to convert the continuous variable into discrete categories before splitting. This is useful when your target variable is skewed and you want to maintain its distribution in both the train and test sets.
- The code uses pandas'
pd.qcutfunction to group the continuousyvalues into five bins. - This creates a new categorical series,
y_binned, which you can then pass to thestratifyparameter to perform a balanced split.
Resolving random_state issues in reproducible splits
When the random_state parameter is omitted, functions like train_test_split produce a different result on every run. This inconsistency makes debugging and model comparison unreliable. The following code demonstrates this by running the same split twice and comparing the outputs.
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(100, 4)
y = np.random.randint(0, 2, 100)
# Without fixed random_state, results differ each run
split1 = train_test_split(X, y, test_size=0.3)
split2 = train_test_split(X, y, test_size=0.3)
print(f"Same splits? {np.array_equal(split1[0], split2[0])}")
The np.array_equal check confirms the two splits are different. Since no random_state was set, the function produces a new, random partition on each run. The following code shows how to ensure consistency.
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(100, 4)
y = np.random.randint(0, 2, 100)
# Fix random_state for reproducibility
split1 = train_test_split(X, y, test_size=0.3, random_state=42)
split2 = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Same splits? {np.array_equal(split1[0], split2[0])}")
The `np.array_equal` check now confirms the splits are identical. By setting `random_state=42` in both calls to `train_test_split`, you're seeding the random number generator with a fixed value. This makes the shuffling and splitting process deterministic.
- Always use a fixed `random_state` during development and experimentation. It ensures your results are reproducible, which is essential for debugging and making fair comparisons between different model versions or hyperparameters.
Real-world applications
With the common pitfalls addressed, these splitting techniques can be applied to solve critical problems in medicine and finance.
Handling imbalanced data in medical diagnosis with train_test_split
In medical diagnostics, where positive cases can be rare, using train_test_split with the stratify parameter is crucial to ensure your test set accurately represents the real-world class imbalance.
from sklearn.model_selection import train_test_split
import numpy as np
# Simulate imbalanced medical dataset (5% positive cases)
np.random.seed(42)
X = np.random.randn(1000, 5) # 5 features
y = np.zeros(1000, dtype=int)
y[:50] = 1 # Only 5% positive cases
# Regular split vs stratified split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.3)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.3, stratify=y)
print(f"Regular - Train: {sum(y_train1)/len(y_train1):.1%}, Test: {sum(y_test1)/len(y_test1):.1%}")
print(f"Stratified - Train: {sum(y_train2)/len(y_train2):.1%}, Test: {sum(y_test2)/len(y_test2):.1%}")
This code simulates a medical dataset where only 5% of cases are positive. It then performs two types of splits to demonstrate a key difference.
- The first split uses a standard
train_test_splitcall, which randomly divides the data without considering class balance. - The second split adds the
stratify=yparameter. This tells the function to preserve the original 5% class distribution in both the training and testing sets.
This comparison highlights why stratification is essential for imbalanced datasets, ensuring your model's evaluation is based on a fair representation of all classes.
Using TimeSeriesSplit for financial forecasting validation
For financial data, TimeSeriesSplit implements a walk-forward validation, where the model is sequentially trained on expanding windows of past data to predict future outcomes.
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
# Simulate daily stock prices for 100 days
np.random.seed(42)
stock_prices = 100 + np.cumsum(np.random.normal(0.1, 1, 100))
# Create a walk-forward validation with 3 folds
tscv = TimeSeriesSplit(n_splits=3)
# Show the training and testing periods
for fold, (train_idx, test_idx) in enumerate(tscv.split(stock_prices)):
train_size = len(train_idx)
test_size = len(test_idx)
print(f"Fold {fold+1}: Train on days 1-{train_size}, predict days {train_size+1}-{train_size+test_size}")
This code shows how TimeSeriesSplit creates sequential folds for validation. After simulating stock price data, it initializes the splitter to create three distinct training and testing periods.
- The key is that these splits aren't random. Each fold uses all data up to a certain point for training and the immediately following segment for testing.
- This process ensures you're always predicting the "future" relative to your training data, which prevents data leakage from later time periods and provides a more realistic backtest.
Get started with Replit
Turn these techniques into a working tool. Describe what you want to build, like "a dashboard that backtests a trading strategy using TimeSeriesSplit" or "an app that prepares imbalanced medical data with stratified splitting."
Replit Agent will write the code, test for errors, and deploy your application from a simple description. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



