How to find the accuracy of a model in Python

Learn how to find the accuracy of a model in Python. Explore different methods, tips, real-world applications, and common error debugging.

Published on:

Mon

Apr 6, 2026

Updated on:

Wed

Apr 8, 2026

The Replit Team

ON THIS PAGE

Example H2

To find a model's accuracy in Python is a crucial step in machine learning. It helps you evaluate performance and validate results before you deploy your model to production.

In this article, we'll cover key techniques to measure accuracy. You'll get practical tips, see real-world applications, and learn how to debug common issues to improve your model's reliability.

Using `accuracy_score` from scikit-learn

from sklearn.metrics import accuracy_score y_true = [0, 1, 0, 1, 1] y_pred = [0, 0, 0, 1, 1] accuracy = accuracy_score(y_true, y_pred) print(f"Model accuracy: {accuracy:.2f}")--OUTPUT--Model accuracy: 0.80

The accuracy_score function is a straightforward tool for validation. It operates by comparing two key lists:

y_true: This list contains the ground truth, or the correct labels for your data.
y_pred: This list holds the labels your model predicted.

The function calculates the proportion of correct predictions. Here, the model got 4 out of 5 right, which is why the output is 0.80, or 80% accuracy.

Basic accuracy measurement techniques

Although accuracy_score offers a solid starting point, you'll often need more detailed methods to truly understand your model's performance.

Calculating accuracy from a confusion matrix

from sklearn.metrics import confusion_matrix import numpy as np y_true = [0, 1, 0, 1, 1, 0, 1, 0] y_pred = [0, 1, 0, 0, 1, 0, 1, 1] cm = confusion_matrix(y_true, y_pred) accuracy = np.sum(np.diag(cm)) / np.sum(cm) print(f"Confusion matrix:\n{cm}") print(f"Accuracy: {accuracy:.2f}")--OUTPUT--Confusion matrix: [[3 1] [1 3]] Accuracy: 0.75

A confusion matrix offers a more detailed view than a simple accuracy score. The confusion_matrix function creates a grid that visualizes correct and incorrect predictions. To calculate accuracy from this matrix, you divide the number of correct predictions by the total number of predictions.

The diagonal elements of the matrix, found with np.diag(cm), represent all correct predictions.
Summing these and dividing by the total elements, calculated with np.sum(cm), gives you the overall accuracy—in this case, 75%.

Using cross-validation to measure accuracy

from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=100, random_state=42) model = RandomForestClassifier(random_state=42) cv_scores = cross_val_score(model, X, y, cv=5) print(f"Cross-validation accuracy scores: {cv_scores}") print(f"Mean accuracy: {cv_scores.mean():.2f}")--OUTPUT--Cross-validation accuracy scores: [0.85 0.9 0.85 0.9 0.95] Mean accuracy: 0.89

Cross-validation provides a more reliable accuracy estimate than a single train-test split. The cross_val_score function automates this process for you. In the example, cv=5 tells the function to split the dataset into five sections, or "folds."

The model trains on four folds and is tested on the fifth one.
This process repeats five times, ensuring each fold gets a turn as the test set.

This results in five distinct accuracy scores. By calculating their average with cv_scores.mean(), you get a more stable and trustworthy measure of how your model will perform on new data.

Using `classification_report` for detailed metrics

from sklearn.metrics import classification_report from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = SVC().fit(X_train, y_train) y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))--OUTPUT--precision recall f1-score support 0 1.00 1.00 1.00 16 1 0.93 0.93 0.93 14 2 0.93 0.93 0.93 15 accuracy 0.96 45 macro avg 0.95 0.96 0.95 45 weighted avg 0.96 0.96 0.96 45

The classification_report function provides a comprehensive performance summary. It’s particularly useful for seeing how your model handles each class individually—something a simple accuracy score can’t show you. The report breaks down performance using several key metrics:

precision: Of all the times the model predicted a certain class, how often was it correct?
recall: Of all the actual instances of a class, how many did the model successfully identify?
f1-score: A single metric that balances both precision and recall.

This detailed view helps you diagnose if your model is biased toward or against specific classes.

Advanced accuracy evaluation methods

While the basic tools offer a solid baseline, you'll need more advanced methods to tackle complex scenarios like imbalanced datasets and custom requirements.

Handling imbalanced datasets with weighted accuracy

from sklearn.metrics import balanced_accuracy_score from sklearn.metrics import accuracy_score # Imbalanced dataset example y_true = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1] y_pred = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] standard_acc = accuracy_score(y_true, y_pred) balanced_acc = balanced_accuracy_score(y_true, y_pred) print(f"Standard accuracy: {standard_acc:.2f}") print(f"Balanced accuracy: {balanced_acc:.2f}")--OUTPUT--Standard accuracy: 0.60 Balanced accuracy: 0.50

When your dataset is imbalanced, like this example with many more instances of class 0 than class 1, standard accuracy can be deceptive. The accuracy_score of 0.60 might seem acceptable, but it hides a critical weakness because it doesn't distinguish between the majority and minority classes.

This is where balanced_accuracy_score comes in. It provides a more honest evaluation by calculating the average of the recall scores for each class, giving them equal weight. The resulting score of 0.50 reveals the model's poor performance on the underrepresented class, giving you a truer picture of its effectiveness.

Using stratified K-fold cross-validation

from sklearn.model_selection import StratifiedKFold, cross_val_score from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import load_breast_cancer X, y = load_breast_cancer(return_X_y=True) model = GradientBoostingClassifier(random_state=42) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy') print(f"Stratified CV scores: {scores}") print(f"Mean accuracy: {scores.mean():.4f}")--OUTPUT--Stratified CV scores: [0.9649123 0.95614035 0.94736842 0.98245614 0.96491228] Mean accuracy: 0.9632

When your dataset has an uneven class distribution, standard cross-validation can be misleading. StratifiedKFold is a smarter approach that ensures each data split maintains the original dataset's class proportions, giving you a more reliable evaluation.

It works by preserving the percentage of samples for each class within every fold. This prevents a split from accidentally containing too many examples of one class, which would skew the results.

By passing StratifiedKFold to the cv parameter in cross_val_score, you ensure each of the five folds is a representative sample, leading to a more accurate performance estimate.

Creating custom accuracy metrics

from sklearn.metrics import make_scorer from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_classification def custom_accuracy(y_true, y_pred): correct = sum(1 for true, pred in zip(y_true, y_pred) if true == pred) return correct / len(y_true) custom_scorer = make_scorer(custom_accuracy) X, y = make_classification(n_samples=200, random_state=42) model = DecisionTreeClassifier(random_state=42) scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer) print(f"Custom accuracy scores: {scores}") print(f"Average custom accuracy: {scores.mean():.2f}")--OUTPUT--Custom accuracy scores: [0.85 0.8 0.8 0.9 0.875] Average custom accuracy: 0.85

Sometimes you need a performance metric that scikit-learn doesn't offer out of the box. You can create your own by defining a Python function, like custom_accuracy in the example, which takes the true and predicted labels and returns a score.

To integrate your function with scikit-learn's evaluation tools, you just need to wrap it.

The make_scorer utility converts your custom function into a usable scorer object.
You can then pass this new scorer to the scoring parameter in functions like cross_val_score to evaluate your model based on your unique logic.

Move faster with Replit

Replit is an AI-powered development platform where you can skip setup and start coding instantly. All Python dependencies are pre-installed, so you can focus on building instead of managing environments.

While mastering individual techniques is important, Agent 4 helps you move from piecing together functions to building complete applications. Instead of just calculating accuracy, you can describe the app you want to build, and the Agent will take it from concept to a working product.

A model performance dashboard that uses metrics from classification_report to visualize precision and recall for each class, helping you spot biases.
A fraud detection evaluator that automatically compares standard accuracy with balanced_accuracy_score to ensure your model works well on imbalanced data.
A model selection tool that runs cross_val_score on several algorithms and presents the mean accuracy scores in a simple table to help you choose the best one.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with the right tools, you can run into a few common pitfalls when measuring model accuracy.

Fixing incorrect parameter order in `accuracy_score`

A simple but frequent mistake is mixing up the parameters in accuracy_score. The function expects the true labels first (y_true) followed by the predicted labels (y_pred). Reversing them won't cause an error, but it can give you a misleading accuracy score if your predictions and true labels aren't symmetrical.

Handling missing values before calculating accuracy

Most scikit-learn functions, including those for accuracy, can't handle missing data like NaN values. If you try to calculate accuracy on a dataset with gaps, you'll get a ValueError. Always clean your data by either removing or filling in missing values before you evaluate your model.

Avoiding accuracy misinterpretation with imbalanced data

Relying on a single accuracy score can be deceptive, especially with imbalanced data. Here’s how to avoid common misinterpretations:

A high accuracy score might just mean your model is good at predicting the majority class, while failing on the minority one.
Always check the class distribution in your dataset. If it's skewed, use metrics like balanced_accuracy_score or a classification_report for a more complete picture.
These tools give you a class-by-class breakdown, revealing weaknesses that a simple accuracy score would otherwise hide.

Fixing incorrect parameter order in `accuracy_score`

One of the most frequent yet subtle errors is swapping the y_true and y_pred arguments in accuracy_score. This won't trigger an error, but it can silently corrupt your evaluation. The following code shows this mistake in action.

from sklearn.metrics import accuracy_score y_true = [0, 1, 0, 1, 1] y_pred = [0, 0, 1, 1, 0] # Incorrect parameter order accuracy = accuracy_score(y_pred, y_true) print(f"Model accuracy: {accuracy:.2f}")

By passing y_pred first, the function treats your model's predictions as the ground truth, making the score meaningless. See how to pass the parameters in the correct order to get a valid result.

from sklearn.metrics import accuracy_score y_true = [0, 1, 0, 1, 1] y_pred = [0, 0, 1, 1, 0] # Correct parameter order: y_true first, then y_pred accuracy = accuracy_score(y_true, y_pred) print(f"Model accuracy: {accuracy:.2f}")

The fix is simple: always pass the true labels, y_true, as the first argument to accuracy_score, followed by your model's predictions, y_pred. While swapping them won't raise an error, it makes the result invalid because the function will treat your predictions as the ground truth. This subtle mistake can easily happen when you're quickly testing or refactoring code, so it's a good habit to double-check the parameter order every time you use the function.

Handling missing values before calculating accuracy

Your accuracy calculation will fail if your data contains missing values, often represented as NaN. Scikit-learn's metrics functions aren't designed to handle these gaps and will raise a ValueError. The following code demonstrates what happens when you try.

import numpy as np from sklearn.metrics import accuracy_score y_true = [0, 1, 0, 1, np.nan, 1] y_pred = [0, 0, 0, 1, 1, 1] # This will raise an error accuracy = accuracy_score(y_true, y_pred) print(f"Accuracy: {accuracy:.2f}")

The code fails because accuracy_score can't handle the np.nan value inside the y_true list. It's a common roadblock when your data isn't clean. The following example shows how to fix this before calculating accuracy.

import numpy as np from sklearn.metrics import accuracy_score y_true = [0, 1, 0, 1, np.nan, 1] y_pred = [0, 0, 0, 1, 1, 1] # Filter out NaN values mask = ~np.isnan(y_true) accuracy = accuracy_score(np.array(y_true)[mask], np.array(y_pred)[mask]) print(f"Accuracy: {accuracy:.2f}")

To fix this, you must remove the NaN values before evaluation. The solution creates a boolean mask using ~np.isnan(y_true) to identify all valid entries. This mask is then applied to both y_true and y_pred, ensuring you only compare the non-missing values. This keeps your data aligned and gives you a correct accuracy score. You'll often run into this issue when working with real-world datasets, which frequently contain missing information.

Avoiding accuracy misinterpretation with imbalanced data

A high accuracy score can be deceptive with imbalanced data. A model might achieve 90% accuracy by only predicting the majority class, while completely failing on the minority one. This creates a false sense of success, hiding critical performance issues.

The following code demonstrates this problem. It uses a DummyClassifier that always predicts the most frequent class, yet still achieves a high accuracy score.

from sklearn.metrics import accuracy_score from sklearn.dummy import DummyClassifier from sklearn.datasets import make_classification # Create imbalanced dataset (90% class 0, 10% class 1) X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42) dummy = DummyClassifier(strategy='most_frequent').fit(X, y) y_pred = dummy.predict(X) print(f"Accuracy: {accuracy_score(y, y_pred):.2f}")

Because the DummyClassifier only predicts the most frequent class, it achieves a high accuracy score by default. This hides its complete failure on the minority class. The following code demonstrates how to get a more reliable assessment.

from sklearn.metrics import accuracy_score, balanced_accuracy_score from sklearn.dummy import DummyClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42) dummy = DummyClassifier(strategy='most_frequent').fit(X, y) y_pred = dummy.predict(X) print(f"Standard accuracy: {accuracy_score(y, y_pred):.2f}") print(f"Balanced accuracy: {balanced_accuracy_score(y, y_pred):.2f}")

To get a true sense of performance, use balanced_accuracy_score. While standard accuracy is a deceptive 0.90, the balanced score is 0.50—no better than a coin flip. This is because it averages the recall for each class, giving them equal weight, which correctly shows that the model fails on the minority class. Always use it when your dataset is skewed, like in fraud detection, where missing rare cases is costly.

Real-world applications

With these potential errors in mind, you can apply accuracy evaluation techniques to solve real-world problems like customer churn and medical diagnosis.

Evaluating a customer churn prediction model with `accuracy_score`

In a business scenario like predicting customer churn, you can use accuracy_score to get a quick read on your model's performance.

from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split import pandas as pd # Simple customer churn dataset data = {'usage_minutes': [105, 231, 19, 142, 325, 88, 302, 250], 'contract_length': [1, 24, 12, 1, 24, 1, 36, 12], 'churn': [1, 0, 0, 1, 0, 1, 0, 0]} # 1=churned df = pd.DataFrame(data) X = df[['usage_minutes', 'contract_length']] y = df['churn'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) model = RandomForestClassifier(random_state=42).fit(X_train, y_train) print(f"Churn prediction accuracy: {accuracy_score(y_test, y_pred=model.predict(X_test)):.2f}")

This example demonstrates a complete, small-scale machine learning workflow. It begins by creating a pandas DataFrame to hold customer data, using features like usage_minutes and contract_length to predict whether a customer will churn.

The data is split using train_test_split, a crucial step that ensures the model is evaluated on information it hasn't seen during training.
A RandomForestClassifier is then trained on the training portion of the data.
Finally, the model makes predictions on the test set, and accuracy_score calculates the percentage of correct predictions.

Comparing multiple models' accuracy for medical diagnosis

In high-stakes fields like medical diagnosis, it's critical to compare several models to ensure you select the most reliable one for the task.

from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, balanced_accuracy_score from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) models = {"Logistic Regression": LogisticRegression(max_iter=1000), "Random Forest": RandomForestClassifier(random_state=42), "SVM": SVC()} for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"{name}: Accuracy={accuracy_score(y_test, y_pred):.4f}, " f"Balanced Accuracy={balanced_accuracy_score(y_test, y_pred):.4f}")

This code automates the comparison of three different classification algorithms. It first loads the breast cancer dataset and splits it for training and testing. Then, it iterates through a dictionary containing the models:

LogisticRegression
RandomForestClassifier
SVC

For each model, the code trains it, makes predictions, and prints both the standard accuracy_score and the balanced_accuracy_score. This provides a direct, side-by-side evaluation, making it easy to identify which model performs best on the test data.

Get started with Replit

Turn what you've learned into a real tool with Replit Agent. Try prompts like: “Build a dashboard that compares models using cross_val_score” or “Create a tool that checks for imbalance with balanced_accuracy_score.”

The Agent writes the code, tests for errors, and helps you deploy your app. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Follow @Replit