How to do exploratory data analysis in Python

Learn how to do exploratory data analysis in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

Published on:

Mon

Apr 6, 2026

Updated on:

Wed

Apr 8, 2026

The Replit Team

ON THIS PAGE

Example H2

Exploratory Data Analysis (EDA) is a critical first step for any data project. Python's robust libraries offer an ideal environment to uncover initial insights and understand the structure of your dataset.

In this article, you'll learn essential EDA techniques and practical tips for Python. You'll also review real-world applications and get debugging advice to help you confidently master data exploration for your next project.

Basic data inspection with pandas

import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') print(df.head()) print(f"Shape: {df.shape}")--OUTPUT--sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa Shape: (150, 5)

The first step in any EDA process is getting a feel for your data's structure. We're using the popular pandas library to load the Iris dataset into a DataFrame, which is a powerful, table-like data structure in Python. The df.head() function provides an initial glimpse by displaying the first five rows. This quick check helps you verify that the data loaded correctly and lets you see the column names and data types at a glance.

Next, df.shape reveals the dimensions of your dataset. The output, (150, 5), tells us we have 150 rows and 5 columns. Knowing the size of your data is crucial for understanding its scale and planning your next steps in the analysis.

Descriptive statistics and visualization

With the data's basic structure in hand, you can now calculate summary statistics and create visualizations to better understand its underlying patterns and relationships.

Calculating summary statistics with `.describe()`

import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') summary = df.describe() print(summary)--OUTPUT--sepal_length sepal_width petal_length petal_width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 1.199333 std 0.828066 0.435866 1.765298 0.762238 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000

The df.describe() function is a powerful shortcut for generating a statistical summary of your numerical data. It provides a quick overview of the central tendency and dispersion of each column, which is essential for understanding your dataset's characteristics at a glance.

This single command reveals several key metrics:

count: The number of non-null values.
mean: The average value.
std: The standard deviation, which measures how spread out the data is.
min/max: The minimum and maximum values in the column.
25%, 50%, 75%: The quartiles, which show how the data is distributed. The 50% mark is the median.

Visualizing distributions with histograms

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') plt.figure(figsize=(10, 6)) sns.histplot(data=df, x='sepal_length', hue='species', kde=True) plt.title('Distribution of Sepal Length by Species') plt.show()--OUTPUT--[Histogram showing distribution of sepal length by species]

While numbers are helpful, a histogram gives you a far more intuitive feel for your data's distribution. We use the seaborn library here, which builds on matplotlib to create more attractive plots. The key is the sns.histplot function, which does the heavy lifting.

The hue='species' argument is especially powerful. It automatically separates the data, creating a distinct, color-coded histogram for each species.
Setting kde=True adds a smooth line over the bars, helping you visualize the underlying shape of each distribution.

Exploring relationships with scatter plots

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') plt.figure(figsize=(8, 6)) sns.scatterplot(data=df, x='sepal_length', y='sepal_width', hue='species') plt.title('Sepal Length vs Sepal Width') plt.show()--OUTPUT--[Scatter plot showing relationship between sepal length and width by species]

Scatter plots are perfect for visualizing the relationship between two numerical variables. The sns.scatterplot function plots one variable on the x-axis and another on the y-axis—in this case, we're comparing sepal_length and sepal_width.

The hue='species' parameter is the key here. It colors each point according to its species.
This lets you instantly see if there are clusters or patterns specific to each group. You can observe how the relationship between sepal dimensions differs for setosa versus virginica, for example.

Advanced analysis techniques

Building on basic plots, you can now uncover deeper patterns with correlation heatmaps, identify outliers using box plots, and simplify dimensions with PCA.

Analyzing correlations with heatmaps

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') corr = df.select_dtypes('number').corr() plt.figure(figsize=(8, 6)) sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show()--OUTPUT--[Heatmap showing correlation between numeric variables]

A heatmap is a great tool for visualizing a correlation matrix, which shows how numerical variables relate to each other. The code first creates this matrix with df.select_dtypes('number').corr(). This step ensures you're only comparing numbers.

The annot=True argument is key—it prints the correlation value directly onto the heatmap.
The cmap='coolwarm' parameter assigns warm colors to positive correlations and cool colors to negative ones, making patterns instantly recognizable.

Identifying outliers with box plots

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') plt.figure(figsize=(12, 6)) sns.boxplot(data=df, x='species', y='sepal_length') plt.title('Boxplot of Sepal Length by Species') plt.show()--OUTPUT--[Box plot showing distribution and potential outliers of sepal length by species]

Box plots offer a concise visual summary of a variable's distribution, making them ideal for spotting outliers. The sns.boxplot function is perfect for this, letting you quickly compare distributions across different categories. By setting x='species' and y='sepal_length', you generate a separate plot for each species, making comparisons intuitive.

The box: This represents the interquartile range (IQR), or the middle 50% of the data.
The whiskers: These lines typically extend to show the rest of the data's expected range.
Outliers: Any data points that fall outside the whiskers are flagged as potential outliers.

Performing dimension reduction with `PCA`

import pandas as pd from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') X = df.drop('species', axis=1) pca = PCA(n_components=2) components = pca.fit_transform(X) plt.figure(figsize=(8, 6)) sns.scatterplot(x=components[:,0], y=components[:,1], hue=df['species']) plt.title('PCA of Iris Dataset') plt.show()--OUTPUT--[PCA scatter plot showing the first two principal components colored by species]

Principal Component Analysis (PCA) is a powerful technique for simplifying complex datasets. It reduces the number of variables by combining them into a smaller set of "principal components" that still capture most of the original data's variance. This is especially useful for visualizing high-dimensional data.

We use PCA(n_components=2) to tell scikit-learn we want to condense our four numerical features into just two.
The fit_transform method then performs the calculation, creating the new components.
The final scatter plot shows that the species are still clearly separated, which confirms PCA successfully preserved the key patterns in the data.

Move faster with Replit

Replit is an AI-powered development platform where you can skip setup and start coding Python instantly. It comes with all the necessary dependencies pre-installed, so you don't have to worry about managing environments or installing libraries.

Instead of piecing together techniques like creating heatmaps or running PCA, you can use Agent 4 to build complete applications. Describe the tool you want to create, and Agent will write the code, manage databases, and handle APIs to produce a working product. For example, you could build:

An automated EDA dashboard that ingests a CSV file and generates a full report with summary statistics, correlation heatmaps, and distribution plots.
A data-cleaning utility that identifies and visualizes outliers in your dataset using box plots for each numerical feature.
A dimensionality reduction tool that applies PCA to a high-dimensional dataset and creates a scatter plot of the first two principal components to reveal hidden clusters.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Navigating EDA in Python is generally smooth, but you might hit a few common snags that can temporarily halt your analysis.

Fixing KeyError when accessing non-existent columns: This error pops up when you try to select a column that doesn't exist, often due to a simple typo. Before you spend too much time debugging, print df.columns to see a list of all available column names and check your spelling.
Handling missing values with .dropna() before visualization: Many visualization libraries can't handle missing data, which pandas represents as NaN. If you try to plot a column with empty cells, you'll likely get an error, so a quick fix is to use the .dropna() method to remove those rows before plotting.
Resolving data type issues with .astype() in calculations: Sometimes, a column that looks numerical is actually stored as text, which will cause a TypeError if you try to use it in calculations. You can fix this by converting the column to a numeric type using the .astype() method.

Fixing `KeyError` when accessing non-existent columns

A KeyError is one of the most frequent issues you'll encounter. It's Python's way of telling you that the column label you're trying to use doesn't exist in your DataFrame, often because of a simple typo in the name.

For example, trying to filter by a misspelled column name like 'petal_size' instead of 'petal_length' will trigger this error. The code below demonstrates what happens when you make this common mistake.

import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') # This will raise a KeyError filtered_df = df[df['petal_size'] > 1.5]

The code fails because the column 'petal_size' doesn't exist in the DataFrame, which triggers the KeyError. This is a frequent issue often caused by a simple typo. The following code demonstrates the correct approach.

import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') column_name = 'petal_size' if column_name in df.columns: filtered_df = df[df[column_name] > 1.5] else: print(f"Column '{column_name}' not found. Available columns: {df.columns.tolist()}")

To avoid a KeyError, you can proactively check if a column exists before using it. The solution uses an if statement to see if the desired name is in df.columns. If the check fails, the else block prints a helpful message with a list of all valid column names. This defensive check is especially useful when dealing with unfamiliar datasets or when column names might change, saving you from unexpected crashes during analysis.

Handling missing values with `.dropna()` before visualization

Missing values are a frequent hurdle in data analysis. When your DataFrame contains NaN values, plotting libraries like seaborn can't render the data and will often raise an error. The code below shows exactly what happens when you try.

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') df.loc[10:15, 'sepal_length'] = None # Introduce missing values sns.scatterplot(data=df, x='sepal_length', y='sepal_width') plt.show()

The code uses df.loc to inject None values into the sepal_length column. When sns.scatterplot tries to plot these empty points, it fails. The following code demonstrates a simple way to handle this issue.

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') df.loc[10:15, 'sepal_length'] = None # Introduce missing values clean_df = df.dropna(subset=['sepal_length']) sns.scatterplot(data=clean_df, x='sepal_length', y='sepal_width') plt.show()

The simplest fix is to remove rows with missing data before plotting. The df.dropna() method creates a new, clean DataFrame by dropping any rows where the specified columns are empty. By using subset=['sepal_length'], you target only the problematic column. This ensures the visualization function receives only valid data points, preventing errors. It's a crucial step whenever you're preparing data for plotting, as many libraries can't handle NaN values automatically.

Resolving data type issues with `.astype()` in calculations

A TypeError often occurs when a column that looks numeric is actually stored as text, preventing mathematical operations. Python can't perform calculations on strings, even if they look like numbers. The code below shows what happens when you try.

import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') df['text_number'] = df['sepal_width'].astype(str) # This fails because text_number is a string result = df['sepal_length'] + df['text_number']

The code converts sepal_width to a string, creating the text_number column. The + operator can't add this string to the numeric sepal_length column, which triggers the error. The following code shows how to fix this.

import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') df['text_number'] = df['sepal_width'].astype(str) # Convert to proper type before calculation result = df['sepal_length'] + df['text_number'].astype(float)

The fix is to explicitly convert the text column to a numeric type before any calculations. The .astype(float) method changes the column's data type on the fly, allowing the addition to proceed without a TypeError. It's a common issue when importing data, as columns that look numeric might be read as strings. You can always check your data types with df.dtypes if a calculation unexpectedly fails, ensuring your data is ready for analysis.

Real-world applications

Beyond debugging, these EDA techniques enable practical applications, from using groupby() for business insights to building a simple RandomForestClassifier model.

Aggregating data with `groupby()` for business insights

The groupby() function is a workhorse for business analysis, letting you segment your data into meaningful groups and calculate summary statistics for each one.

import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') sales_analysis = df.groupby('species').agg({ 'sepal_length': ['mean', 'count'], 'petal_length': ['mean', 'max'] }) print(sales_analysis)

This code uses groupby('species') to split the DataFrame into separate groups for each flower species. The .agg() method then applies a dictionary of custom calculations to each group:

For sepal_length, it calculates both the average (mean) and total number (count).
For petal_length, it finds the average and the largest value (max).

The result is a new, compact DataFrame that summarizes these key metrics, making it easy to compare the different species at a glance.

Building a simple `RandomForestClassifier` model with scikit-learn

You can take your analysis a step further by building a simple predictive model, such as a RandomForestClassifier, to automatically classify new data points based on the features you've explored.

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') X = df.drop('species', axis=1) y = df['species'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) print(f"Model accuracy: {accuracy_score(y_test, model.predict(X_test)):.2f}")

This code builds a model to predict iris species based on its features. It first separates the dataset into features (X) and the target variable to predict (y). The train_test_split function then divides the data, reserving a portion for testing the model on information it hasn't seen before.

A RandomForestClassifier is created and trained on the training data using the .fit() method.
Finally, accuracy_score calculates the model's performance by comparing its predictions on the test data against the actual species labels.

Get started with Replit

Turn your new skills into a real tool with Replit Agent. Describe what you want: “Build a dashboard that generates a correlation heatmap from a CSV” or “Create a utility to find outliers with box plots.”

Replit Agent will write the code, test for errors, and deploy your application for you. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Follow @Replit