How to make a decision tree in Python
Build decision trees in Python. Our guide covers methods, tips, real-world uses, and debugging common errors.

Decision trees are a core machine learning concept. They help you model complex decisions and predict outcomes. Python, with its rich libraries, provides an excellent environment to build and train them.
In this article, we'll walk through the essential techniques to build your own decision tree. You'll get practical tips, see real-world applications, and learn how to debug your models effectively.
Basic decision tree with scikit-learn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.2f}")--OUTPUT--Accuracy: 0.95
This code uses scikit-learn to build and test a simple decision tree on the classic Iris dataset. The process involves a few key steps:
- Data Splitting: We use
train_test_splitto divide the data. This is crucial for testing the model's predictive power on data it hasn't seen before, which prevents overfitting. - Training: The
fitmethod trains theDecisionTreeClassifieron the training portion of the data, teaching it to associate flower measurements with specific species. - Reproducibility: Setting
random_state=42ensures that the data split and model training are deterministic, so you'll get the same 95% accuracy every time you run the code.
Fundamental decision tree techniques
Now that you have a working model, you can take it further by visualizing it with export_graphviz, validating it with cross-validation, and tuning it with GridSearchCV.
Visualizing the decision tree with export_graphviz
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
iris = load_iris()
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(iris.data, iris.target)
dot_data = export_graphviz(clf, feature_names=iris.feature_names,
class_names=iris.target_names, filled=True)
graph = graphviz.Source(dot_data)--OUTPUT--[Graphviz visualization of the decision tree structure]
Visualizing your decision tree is a great way to understand its internal logic. The export_graphviz function converts your trained model into a DOT format—a graph description language—which the graphviz library then renders as an image. Notice we set max_depth=3 to keep the tree simple and readable.
- The
feature_namesandclass_namesarguments label the nodes, clarifying the decision criteria and outcomes at each step. - Setting
filled=Truecolors the nodes to visually represent the majority class for each split.
Using cross-validation for more reliable evaluation
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
iris = load_iris()
clf = DecisionTreeClassifier(random_state=42)
cv_scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Average accuracy: {cv_scores.mean():.2f}")--OUTPUT--Cross-validation scores: [0.96666667 0.96666667 0.9 0.93333333 1. ]
Average accuracy: 0.95
While a single train-test split is useful, cross-validation provides a more robust evaluation of your model's performance. The cross_val_score function handles this for you automatically.
- It splits the data into a specified number of "folds"—here, five, because we set
cv=5. - The model is then trained and tested five times. In each run, a different fold serves as the test set while the remaining four are used for training.
- By averaging the scores from all five runs with
cv_scores.mean(), you get a more stable and trustworthy measure of accuracy.
Optimizing with GridSearchCV for hyperparameter tuning
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
iris = load_iris()
param_grid = {'max_depth': [3, 5, 10], 'min_samples_split': [2, 5, 10]}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(iris.data, iris.target)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.2f}")--OUTPUT--Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Best accuracy: 0.97
Fine-tuning your model's hyperparameters is key to improving its performance. GridSearchCV automates this process by testing every combination of parameters you provide in a param_grid. It uses cross-validation to find which settings work best.
- The
param_griddictionary defines the search space—here, you're testing different values formax_depthandmin_samples_split. - After running
fit, you can access the optimal settings withbest_params_, which in this case led to a 97% accuracy.
Advanced decision tree implementations
With your model tuned, you can now push its performance further by analyzing feature importance, handling imbalanced data, and using more powerful ensemble methods.
Analyzing feature importance for better understanding
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import numpy as np
iris = load_iris()
clf = DecisionTreeClassifier(random_state=42)
clf.fit(iris.data, iris.target)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for i in range(len(iris.feature_names)):
print(f"{iris.feature_names[indices[i]]}: {importances[indices[i]]:.4f}")--OUTPUT--petal width (cm): 0.5425
petal length (cm): 0.4575
sepal length (cm): 0.0000
sepal width (cm): 0.0000
Understanding which features your model values most is crucial for interpretation. After training, the DecisionTreeClassifier stores this information in its feature_importances_ attribute. Each feature gets a score reflecting its impact on the model's decisions. The code then sorts these scores to rank the features from most to least influential.
- For the Iris dataset, the model relies entirely on
petal width (cm)andpetal length (cm). The sepal measurements have an importance of zero, meaning they weren't used in any decision splits.
Handling imbalanced data with SMOTE
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
import numpy as np
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)
print(f"Original class distribution: {np.bincount(y)}")
print(f"Balanced class distribution: {np.bincount(y_balanced)}")--OUTPUT--Original class distribution: [900 100]
Balanced class distribution: [900 900]
Decision trees can struggle with imbalanced data, where one class vastly outnumbers another. This biases the model toward the majority class. SMOTE (Synthetic Minority Over-sampling Technique) addresses this by creating new, synthetic samples for the minority class instead of just duplicating them.
- The code uses
make_classificationto generate a dataset with a 900-to-100 class imbalance. - Applying
smote.fit_resamplebalances the dataset by oversampling the minority class, resulting in an even 900-to-900 split. This ensures your model learns the patterns of both classes effectively.
Boosting performance with RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)
dt = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)
rf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train, y_train)
print(f"Decision Tree accuracy: {dt.score(X_test, y_test):.2f}")
print(f"Random Forest accuracy: {rf.score(X_test, y_test):.2f}")--OUTPUT--Decision Tree accuracy: 0.95
Random Forest accuracy: 0.97
A single decision tree is good, but a Random Forest is often better. It’s an ensemble method, meaning it builds many decision trees—100 in this case, set by n_estimators=100—and aggregates their predictions. This collective approach makes the model more robust and less prone to errors from any single tree.
- The code directly compares a single
DecisionTreeClassifierwith aRandomForestClassifieron the same data. - By averaging the results from all its trees, the Random Forest reduces overfitting and improves predictive power, boosting accuracy from 95% to 97%.
Move faster with Replit
Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. You don't need to worry about managing environments or installations.
The techniques in this article are powerful building blocks. With Agent 4, you can move from piecing them together to building complete applications. It takes your description and handles the code, databases, APIs, and deployment.
- A customer churn predictor that uses
feature_importances_to identify the most significant reasons customers leave. - A fraud detection system that leverages
SMOTEto effectively train a model on rare but critical fraudulent activities. - An automated model tuner that runs
GridSearchCVto find the optimalmax_depthandmin_samples_splitfor your classifier.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Building decision trees often involves a few common hurdles, but they're straightforward to overcome once you know what to look for.
- Fixing overfitting with
max_depth: Overfitting happens when your model learns the training data too well, including its noise, and then fails to generalize to new data. You can prevent this by tuning themax_depthhyperparameter, which limits the tree's complexity and forces it to capture only the most significant patterns. - Handling categorical features with
OneHotEncoder: Decision trees inscikit-learnrequire numerical inputs, so they can't process text labels directly. UseOneHotEncoderto convert categorical data into a binary format (columns of 0s and 1s) that the model can understand. - Troubleshooting prediction shape mismatch with
reshape: You'll often see errors when predicting a single sample because its array shape is wrong. This is because the model expects a 2D array, but a single sample is often a 1D array. Use thereshape(1, -1)method to adjust the data's dimensions to the format your model expects.
Fixing overfitting with proper max_depth parameter
Overfitting is a classic trap where a model memorizes training data instead of learning general patterns. This results in high training accuracy but poor performance on unseen test data. The code below demonstrates this problem with an unconstrained decision tree.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)
# No limit on tree depth - will likely overfit
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
print(f"Training accuracy: {clf.score(X_train, y_train):.2f}")
print(f"Test accuracy: {clf.score(X_test, y_test):.2f}")
Because the DecisionTreeClassifier is unconstrained, it achieves perfect training accuracy but lower test accuracy. This gap is a clear sign of overfitting. The code below shows how a simple adjustment to the model's parameters closes this performance gap.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)
# Set max_depth to prevent overfitting
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)
print(f"Training accuracy: {clf.score(X_train, y_train):.2f}")
print(f"Test accuracy: {clf.score(X_test, y_test):.2f}")
By setting max_depth=3, you limit the tree's complexity, forcing the model to focus on broader patterns instead of memorizing training data. The result is that training and test accuracies become much more aligned, showing the model now generalizes better. Keep an eye out for a large gap between training and test scores—it’s a classic sign that you need to rein in your model's complexity with a parameter like max_depth.
Handling categorical features with OneHotEncoder
Decision trees in scikit-learn work with numbers, not text. This means you can't directly use categorical features like 'red' or 'blue' in your model. Attempting to fit the model with this data will raise an error, as you'll see below.
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
# Create dataset with categorical features
data = pd.DataFrame({
'feature1': [1.2, 0.5, 3.1, 2.0],
'feature2': ['red', 'blue', 'red', 'green'],
'target': [0, 1, 0, 1]
})
X = data[['feature1', 'feature2']]
y = data['target']
# This will fail because 'feature2' is categorical
clf = DecisionTreeClassifier()
clf.fit(X, y)
The fit method can't process text values like 'red' and 'blue' in the feature2 column, which triggers an error. The following code demonstrates how to properly prepare this data for the model before training.
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
# Create dataset with categorical features
data = pd.DataFrame({
'feature1': [1.2, 0.5, 3.1, 2.0],
'feature2': ['red', 'blue', 'red', 'green'],
'target': [0, 1, 0, 1]
})
# Properly encode categorical features
preprocessor = ColumnTransformer(
transformers=[('cat', OneHotEncoder(), [1])],
remainder='passthrough'
)
X_encoded = preprocessor.fit_transform(data[['feature1', 'feature2']])
y = data['target']
clf = DecisionTreeClassifier()
clf.fit(X_encoded, y)
To fix this, you'll need to convert the text data into a numerical format. The OneHotEncoder handles this by creating new binary columns for each category. You can use a ColumnTransformer to apply this encoding only to the categorical feature, while remainder='passthrough' keeps the numerical columns as they are. Your model can then be trained on this newly encoded data without any errors. This is a common step whenever your dataset contains non-numeric features.
Troubleshooting prediction shape mismatch with reshape
A common error you'll encounter is a shape mismatch when making a single prediction. scikit-learn models expect a 2D array of samples, but a single data point is often a 1D array. This mismatch will cause the predict method to fail.
The code below shows exactly what happens when you pass a single sample with the wrong dimensions.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
clf = DecisionTreeClassifier(random_state=42)
clf.fit(iris.data, iris.target)
# Trying to predict with incorrect shape
new_sample = [5.1, 3.5, 1.4, 0.2]
prediction = clf.predict(new_sample) # This will fail
print(f"Prediction: {prediction}")
The predict method is designed to process a batch of samples. Passing a single sample directly, as with new_sample, creates a structural mismatch and triggers an error. The following code shows how to format the input correctly.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import numpy as np
iris = load_iris()
clf = DecisionTreeClassifier(random_state=42)
clf.fit(iris.data, iris.target)
# Reshape data for prediction - each sample needs to be 2D
new_sample = np.array([5.1, 3.5, 1.4, 0.2]).reshape(1, -1)
prediction = clf.predict(new_sample)
print(f"Prediction: {prediction}")
The fix is simple: you need to reshape your single sample into a 2D array. Using NumPy's reshape(1, -1) method wraps your single data point in another array, matching the structure the predict method expects. This tells the model you're passing one sample with an inferred number of features. Keep an eye out for this error whenever you're predicting a single observation after training on a full dataset. It's a common and easy-to-fix mismatch.
Real-world applications
With the technical challenges solved, you can now apply these models to predict practical outcomes like customer churn and loan defaults.
Using decision trees for customer churn prediction
You can train a decision tree on customer data, like account tenure and monthly charges, to predict which users are likely to churn.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Sample customer data (1=churned, 0=retained)
customers = pd.DataFrame({
'tenure_months': [2, 7, 5, 1, 8, 2, 4],
'monthly_charge': [65, 45, 85, 95, 35, 75, 50],
'churn': [1, 0, 1, 1, 0, 1, 0]
})
model = DecisionTreeClassifier(max_depth=2)
model.fit(customers[['tenure_months', 'monthly_charge']], customers['churn'])
print(f"New customer churn prediction: {model.predict([[3, 60]])}")
This code builds a model to forecast customer churn using a sample dataset created with a pandas DataFrame. The model learns from features like tenure_months and monthly_charge to predict whether a customer will stay or leave.
- A
DecisionTreeClassifieris initialized withmax_depth=2to keep the model simple and interpretable. - The
fitmethod trains the model on the historical data, teaching it the patterns associated with churn. - Finally,
predictis used to forecast the outcome for a new, unseen customer.
Predicting loan defaults with feature_importances_
In finance, you can use feature_importances_ to identify which factors in a loan application are most predictive of a default.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
# Generate synthetic loan data (age, income, credit_score, default)
np.random.seed(42)
X = np.column_stack([
np.random.normal(35, 10, 1000), # age
np.random.normal(60000, 20000, 1000), # income
np.random.normal(700, 100, 1000) # credit score
])
y = ((X[:, 0] < 25) | (X[:, 1] < 40000) | (X[:, 2] < 600)).astype(int)
model = DecisionTreeClassifier(max_depth=3).fit(X, y)
for name, importance in zip(['Age', 'Income', 'Credit Score'], model.feature_importances_):
print(f"{name}: {importance:.2f}")
This code generates synthetic loan data using NumPy. It creates features like `age`, `income`, and `credit score` with np.random.normal. The target variable, `default`, is determined by a clear set of rules using logical operators like | (or).
- A
DecisionTreeClassifieris trained on this data, withmax_depth=3to control its complexity. - The model learns the relationships between the features and the default outcome you defined.
- Finally, the code inspects the trained model to see how it weighed each feature when making decisions.
Get started with Replit
Now, turn what you've learned into a real tool. Give Replit Agent a prompt like, “build a churn predictor that shows feature_importances_,” or “create a loan risk calculator that uses SMOTE to balance the data.”
It will write the code, test for errors, and deploy your application directly from your browser. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

.png)

.png)