How to do cross validation in Python

Master cross-validation in Python. Explore various methods, tips, real-world uses, and how to debug common errors.

How to do cross validation in Python
Published on: 
Tue
Apr 21, 2026
Updated on: 
Tue
Apr 21, 2026
The Replit Team

Cross-validation is a crucial technique to evaluate your machine learning models in Python. It ensures your model performs well on new, unseen data, which helps you avoid the common pitfall of overfitting.

You'll explore several key cross-validation techniques, along with practical tips for implementation. You will also find real-world applications and debugging advice to help you apply these methods effectively.

Using basic KFold cross-validation

from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.svm import SVC

iris = load_iris()
X, y = iris.data, iris.target
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]--OUTPUT--# No output - this code splits data into train/test sets

The KFold object is configured to manage the data splitting. It’s set up with a few key parameters:

  • n_splits=5: This tells KFold to divide the dataset into five distinct sections, or folds.
  • shuffle=True: Shuffling the data first ensures that each fold is a random sample, preventing any bias from the data's original ordering.
  • random_state=42: This makes the shuffling reproducible, so you'll get the exact same splits every time you run the code.

The for loop then iterates through these folds. In each cycle, one fold becomes the test set while the remaining four are combined for training. This process gives you a more robust performance measure by testing the model on different subsets of your data.

Basic cross-validation techniques

While the manual KFold loop is effective, scikit-learn also provides more direct and specialized methods for evaluating your models.

Using cross_val_score for simplified evaluation

from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.svm import SVC

iris = load_iris()
X, y = iris.data, iris.target
clf = SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2f}")--OUTPUT--Cross-validation scores: [0.96666667 1. 0.96666667 0.96666667 1. ]
Mean accuracy: 0.98

The cross_val_score function streamlines the entire cross-validation process into a single call. It’s a convenient alternative to a manual loop, as it handles the data splitting, model training, and scoring internally.

  • You simply pass your model (clf), features (X), and target (y) to the function.
  • The cv=5 argument specifies that you want to use five-fold cross-validation.
  • The function returns an array of scores, one for each of the five test runs.

The resulting array shows the model's accuracy on each fold. By calculating the mean of these scores, you get a more reliable estimate of your model's performance on unseen data.

Using StratifiedKFold for imbalanced classification

from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in skf.split(X, y):
clf = SVC().fit(X[train_idx], y[train_idx])
scores.append(clf.score(X[test_idx], y[test_idx]))
print(f"Stratified CV scores: {np.array(scores)}")--OUTPUT--Stratified CV scores: [0.96666667 1. 0.96666667 0.96666667 1. ]

When you're dealing with imbalanced datasets, StratifiedKFold is a smarter choice than a standard KFold. It ensures that each fold's class distribution mirrors the original dataset's proportions. This prevents a fold from accidentally ending up with too few samples of a minority class, which would skew your evaluation.

  • The main difference in the code is that the split method now requires both the features X and the target labels y. This is how it stratifies the data correctly.

Implementing LeaveOneOut cross-validation

from sklearn.model_selection import LeaveOneOut
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import numpy as np

iris = load_iris()
X, y = iris.data[:20], iris.target[:20] # Using subset for brevity
loo = LeaveOneOut()
scores = []
for train_idx, test_idx in loo.split(X):
clf = SVC().fit(X[train_idx], y[train_idx])
scores.append(clf.score(X[test_idx], y[test_idx]))
print(f"LOO CV accuracy: {np.mean(scores):.2f}")--OUTPUT--LOO CV accuracy: 1.00

LeaveOneOut (LOO) is the most exhaustive cross-validation method. It treats each data point as a separate test set. The model is trained on all other data points and then evaluated on that single point. This cycle repeats until every data point has been used as a test set once.

  • The number of folds is equal to the number of samples in your dataset.
  • The code uses a small subset with X, y = iris.data[:20], iris.target[:20] because LOO is computationally expensive—it trains a new model for every single sample.

Advanced cross-validation techniques

When basic methods aren't enough, you can turn to advanced techniques for temporal data, custom validation sets, and more robust hyperparameter tuning.

Using TimeSeriesSplit for temporal data

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Simulated time series data
X = np.array([[i] for i in range(10)])
y = np.array([i * 2 + np.random.randn() for i in range(10)])

tscv = TimeSeriesSplit(n_splits=3)
for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
print(f"Split {i+1}: Train: {train_idx}, Test: {test_idx}")--OUTPUT--Split 1: Train: [0 1 2], Test: [3 4 5]
Split 2: Train: [0 1 2 3 4 5], Test: [6 7]
Split 3: Train: [0 1 2 3 4 5 6 7], Test: [8 9]

When you're working with time-dependent data, like stock prices or sensor readings, shuffling the data for cross-validation is a big mistake. It would let your model peek into the future, leading to unrealistic performance scores. TimeSeriesSplit solves this by preserving the chronological order of your data.

  • It creates folds where the training set always comes before the test set.
  • The training set grows with each split, which simulates a real-world scenario of retraining your model as new data becomes available.

Implementing PredefinedSplit for custom validation

from sklearn.model_selection import PredefinedSplit
import numpy as np

# Create data
X = np.random.rand(10, 2)
y = np.random.randint(0, 2, 10)
# Define which samples belong to which split (-1 for train)
test_fold = np.array([-1, -1, -1, -1, 0, 0, 0, 1, 1, 1])

ps = PredefinedSplit(test_fold)
for train_idx, test_idx in ps.split():
print(f"Train: {train_idx}, Test: {test_idx}")--OUTPUT--Train: [0 1 2 3 7 8 9], Test: [4 5 6]
Train: [0 1 2 3 4 5 6], Test: [7 8 9]

PredefinedSplit gives you complete control over your cross-validation folds, which is useful when you have a fixed validation set you must use. Instead of letting scikit-learn create random splits, you define them yourself, dictating exactly which samples go into each train and test split.

  • The test_fold array is the key; each number in it corresponds to a data sample.
  • A value of -1 tells the splitter to always keep that sample in the training set.
  • Other integers, like 0 and 1, group the remaining samples into specific test folds.

Performing nested cross-validation with GridSearchCV

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

# Set up parameter grid and inner CV for tuning
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
inner_cv = GridSearchCV(SVC(), param_grid, cv=3)

# Outer CV to evaluate tuned model
scores = cross_val_score(inner_cv, X, y, cv=5)
print(f"Nested CV scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2f}")--OUTPUT--Nested CV scores: [0.96666667 0.96666667 0.93333333 0.96666667 1. ]
Mean accuracy: 0.97

Nested cross-validation provides a more honest assessment of your model's performance when you're also tuning hyperparameters. It uses a loop inside another loop to prevent the model evaluation from being influenced by the hyperparameter search, which can lead to overly optimistic scores.

  • The inner loop, handled by GridSearchCV, finds the best combination of C and kernel using three-fold cross-validation.
  • The outer loop, driven by cross_val_score, evaluates the entire tuning process with five separate folds, providing an unbiased performance estimate.

Move faster with Replit

Replit is an AI-powered development platform where you can go from learning techniques to building full applications without friction. It comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly.

With Agent 4, you can move beyond piecing together methods like cross_val_score and start creating complete tools. It takes your idea for a product and builds it—handling the code, databases, APIs, and deployment directly from your description.

  • A model evaluation dashboard that runs KFold, StratifiedKFold, and TimeSeriesSplit on your data and visualizes the accuracy scores for each.
  • A custom data splitter that uses PredefinedSplit to partition a dataset based on a user-uploaded configuration file.
  • An automated hyperparameter tuning utility that uses nested cross-validation to find the best model settings and report the unbiased performance.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Cross-validation is powerful, but common mistakes can easily skew your results, so it's crucial to know what to watch out for.

Forgetting to shuffle data in KFold

If your dataset has any inherent order, such as being sorted by class or time, failing to shuffle can create biased folds. For instance, one fold might end up with samples from only a single category, making it impossible to train a useful model. Since KFold doesn't shuffle by default, you must set shuffle=True to ensure your splits are random and representative.

Not stratifying when dealing with imbalanced data

When one class has far fewer samples than others, a random split might create test folds with no examples of the minority class, making your evaluation unreliable. Using StratifiedKFold instead of KFold is essential in these cases. It ensures each fold maintains the same class proportions as the original dataset, giving you a more accurate performance picture.

Data leakage during preprocessing

Data leakage is a subtle error where information from the test set bleeds into the training process, leading to inflated performance scores. This often happens if you apply preprocessing steps, like scaling data, to the entire dataset before splitting. The correct approach is to perform these transformations inside the cross-validation loop, fitting them only on the training data for each fold.

Forgetting to shuffle data in KFold

When a dataset is sorted by class, forgetting to shuffle it can completely break your cross-validation. Since KFold doesn't shuffle automatically, you risk creating folds that contain only a single class, leading to useless models. The following code demonstrates this problem.

from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
kf = KFold(n_splits=5) # Missing shuffle parameter
scores = []
for train_idx, test_idx in kf.split(X):
clf = SVC().fit(X[train_idx], y[train_idx])
scores.append(clf.score(X[test_idx], y[test_idx]))
print(f"CV scores: {np.array(scores)}")

Because the Iris dataset is sorted by class, the non-shuffled KFold creates test sets with classes the model hasn't been trained on, causing some scores to plummet. See how to correct this in the following example.

from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in kf.split(X):
clf = SVC().fit(X[train_idx], y[train_idx])
scores.append(clf.score(X[test_idx], y[test_idx]))
print(f"CV scores: {np.array(scores)}")

The fix is simple yet crucial. Adding shuffle=True to the KFold constructor ensures the data is randomized before splitting, which prevents folds from containing samples of only a single class. It's also wise to include random_state=42 to make the shuffle reproducible, so your results are consistent every time. This small change leads to much more reliable cross-validation scores, especially when your dataset has a natural order.

Not stratifying when dealing with imbalanced data

When your data is imbalanced, a standard KFold can produce misleading results. Even with shuffling, some folds might randomly end up without any minority class samples, making a true performance evaluation impossible. The following code demonstrates this problem in action.

from sklearn.model_selection import KFold
from sklearn.datasets import make_classification
from sklearn.svm import SVC
import numpy as np

X, y = make_classification(n_samples=100, weights=[0.9, 0.1], random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in kf.split(X):
clf = SVC().fit(X[train_idx], y[train_idx])
scores.append(clf.score(X[test_idx], y[test_idx]))
print(f"CV scores: {np.array(scores)}")

The code creates a dataset with a 90/10 class imbalance. Since KFold splits the data randomly, one fold can end up with only majority class samples, resulting in a misleadingly perfect score. The following example demonstrates the proper approach.

from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification
from sklearn.svm import SVC
import numpy as np

X, y = make_classification(n_samples=100, weights=[0.9, 0.1], random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in skf.split(X, y):
clf = SVC().fit(X[train_idx], y[train_idx])
scores.append(clf.score(X[test_idx], y[test_idx]))
print(f"CV scores: {np.array(scores)}")

The fix is to swap KFold for StratifiedKFold. This splitter ensures each fold mirrors the class proportions of your entire dataset, which is essential for imbalanced data. Notice that the split method now takes both the features X and the labels y—this is how it maintains the correct class balance. This approach prevents skewed evaluations and gives you a far more accurate picture of your model's performance on minority classes.

Data leakage during preprocessing

Data leakage is a subtle but serious error where information from your test set contaminates the training data, giving you falsely optimistic results. This commonly occurs when you apply preprocessing steps, like scaling, to the entire dataset before the cross-validation split.

The following code demonstrates how scaling data with StandardScaler before the loop leads to this problem.

from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Scaling outside CV loop
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in kf.split(X_scaled):
clf = SVC().fit(X_scaled[train_idx], y[train_idx])
scores.append(clf.score(X_scaled[test_idx], y[test_idx]))

The StandardScaler is fitted on the entire dataset before the split, allowing it to learn from the test data. This leakage results in overly optimistic scores. The following example shows how to correctly integrate this step into your workflow.

from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.svm import SVC
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in kf.split(X):
scaler = StandardScaler()
X_train = scaler.fit_transform(X[train_idx])
X_test = scaler.transform(X[test_idx])
clf = SVC().fit(X_train, y[train_idx])
scores.append(clf.score(X_test, y[test_idx]))

The fix is to move the StandardScaler inside the cross-validation loop. For each fold, you fit the scaler and transform the training data (X_train) in one step with fit_transform. Then, you apply that same fitted scaler to transform the test data (X_test) using just transform. This crucial separation ensures that the model trains without any knowledge of the test set's data distribution, preventing leakage and providing an honest performance evaluation.

Real-world applications

Now that you've seen the common pitfalls, you can apply these techniques to real-world challenges like comparing models and classifying text.

Comparing multiple models with cross_val_score()

You can use cross_val_score() to quickly run a head-to-head comparison and see which model performs best on your dataset.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
rf_scores = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=5)
gb_scores = cross_val_score(GradientBoostingClassifier(random_state=42), X, y, cv=5)
print(f"Random Forest: {rf_scores.mean():.4f}")
print(f"Gradient Boosting: {gb_scores.mean():.4f}")

This code evaluates two different machine learning models, a RandomForestClassifier and a GradientBoostingClassifier, on the same dataset. It uses the cross_val_score function to automate the five-fold cross-validation process for each one.

  • The function returns an array of accuracy scores, one for each of the five test runs.
  • By calculating the mean of these scores, you get a reliable performance estimate for each model on the given data.

Using a Pipeline for text classification with cross-validation

A Pipeline lets you chain together text vectorization and a classifier, ensuring that the entire workflow is correctly evaluated with cross-validation.

from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories)
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
scores = cross_val_score(text_clf, newsgroups.data, newsgroups.target, cv=5)
print(f"Text classification accuracy: {scores.mean():.4f}")

This code evaluates a text classification model using a Pipeline to streamline the process. The pipeline bundles two key steps together, making the evaluation cleaner and more reliable.

  • First, TfidfVectorizer converts the raw text from the 20newsgroups dataset into numerical data.
  • Then, a LinearSVC classifier is trained on these numbers.

The cross_val_score function automatically runs this entire two-step process five times, calculating the model's average accuracy across different data splits.

Get started with Replit

Turn your knowledge into a real application with Replit Agent. Describe what you want, like "a dashboard that compares model scores using cross_val_score" or "a tool that visualizes TimeSeriesSplit folds."

It writes the code, tests for errors, and deploys your app from a simple description. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.