How to get summary statistics in Python

Learn how to get summary statistics in Python. Explore different methods, tips, real-world applications, and common error debugging.

How to get summary statistics in Python
Published on: 
Tue
Apr 21, 2026
Updated on: 
Wed
Apr 22, 2026
The Replit Team

Summary statistics are crucial for data analysis in Python. They offer a quick look at your data's main features. With powerful libraries, you can calculate these metrics with just a few lines of code.

In this article, we'll cover key techniques to generate summary statistics. You'll find practical tips, see real-world applications, and get advice to debug common issues to help you master data summarization.

Using built-in Python functions

numbers = [4, 2, 7, 1, 9, 5]
mean = sum(numbers) / len(numbers)
minimum = min(numbers)
maximum = max(numbers)
print(f"Mean: {mean}, Min: {minimum}, Max: {maximum}")--OUTPUT--Mean: 4.666666666666667, Min: 1, Max: 9

Python's built-in functions offer a direct path for basic statistical calculations, letting you analyze data without importing external libraries. The code calculates the mean by combining sum() and len(), a common pattern for a quick average. This method is great for an initial look at your dataset, as you can instantly find key metrics like:

  • The data's range using min() and max().
  • Its central tendency with the calculated mean.

It's a simple yet effective first step before diving into more complex analyses with specialized libraries.

Basic summary statistics with libraries

For more robust analysis, Python’s core data science libraries—numpy, pandas, and scipy—offer a major step up from the built-in functions.

Using numpy for basic statistics

import numpy as np
numbers = np.array([4, 2, 7, 1, 9, 5])
print(f"Mean: {np.mean(numbers)}")
print(f"Median: {np.median(numbers)}")
print(f"Std Dev: {np.std(numbers)}")--OUTPUT--Mean: 4.666666666666667
Median: 4.5
Std Dev: 2.8047581973384787

NumPy is a powerhouse for numerical computing in Python. It introduces the array object, a data structure optimized for fast mathematical operations—making calculations much more efficient than with standard Python lists. The code leverages this by first converting the list to a NumPy array.

From there, you can access a suite of statistical functions:

  • Mean: Calculated with np.mean().
  • Median: The middle value, found using np.median().
  • Standard Deviation: A measure of data spread, computed with np.std().

Using pandas for descriptive statistics

import pandas as pd
data = pd.Series([4, 2, 7, 1, 9, 5])
stats = data.describe()
print(stats)--OUTPUT--count 6.000000
mean 4.666667
std 2.804758
min 1.000000
25% 2.750000
50% 4.500000
75% 6.500000
max 9.000000
dtype: float64

Pandas is a go-to for data manipulation, and its describe() method is a perfect example of its power. You start by creating a Series, a one-dimensional array that's more feature-rich than a standard list. Calling describe() on this Series instantly generates a comprehensive statistical summary.

  • Count, mean, and std: The basic stats you'd expect.
  • Min & Max: The range of your data.
  • Quartiles (25%, 50%, 75%): Values that divide your data into four equal parts, giving you a clear picture of its distribution.

Using scipy for additional statistical measures

from scipy import stats
data = [4, 2, 7, 1, 9, 5]
print(f"Skewness: {stats.skew(data)}")
print(f"Kurtosis: {stats.kurtosis(data)}")
print(f"Mode: {stats.mode(data).mode[0]}")--OUTPUT--Skewness: 0.2562064378812296
Kurtosis: -1.312889067701949
Mode: 1

When you need to dig deeper into your data's shape, SciPy's stats module is the tool for the job. It offers specialized functions that go beyond central tendency and spread, giving you a more nuanced view of the distribution.

  • Skewness: Calculated with stats.skew(), this tells you if your data distribution is lopsided.
  • Kurtosis: Found with stats.kurtosis(), this measures the "tailedness" of the distribution, which can help identify outliers.
  • Mode: The most frequent value, returned by stats.mode(). You'll need to use .mode[0] to access the actual value from the result.

Advanced techniques and optimizations

Building on these foundational library functions, you can tackle more complex analyses by creating custom statistics, summarizing grouped data, and optimizing for performance.

Creating custom statistical functions

def calculate_quartiles(data):
data = sorted(data)
n = len(data)
q1_idx = n // 4
q2_idx = n // 2
q3_idx = 3 * n // 4
return data[q1_idx], data[q2_idx], data[q3_idx]

data = [4, 2, 7, 1, 9, 5]
q1, q2, q3 = calculate_quartiles(data)
print(f"Q1: {q1}, Q2: {q2}, Q3: {q3}")--OUTPUT--Q1: 2, Q2: 4.5, Q3: 7

Sometimes, library functions don't offer the exact calculation you need. That's when writing your own custom statistical functions comes in handy. The calculate_quartiles function is a great example—it gives you direct control over how quartiles are determined, which can be useful for specific analysis requirements.

  • The function first sorts the input data.
  • It then calculates the index for each quartile using the length of the data and integer division (//).
  • Finally, it returns the values at these specific positions in the sorted list.

Summarizing grouped data with pandas

import pandas as pd
data = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'B', 'A'],
'value': [4, 2, 7, 1, 9, 5]
})
grouped_stats = data.groupby('group')['value'].agg(['mean', 'min', 'max'])
print(grouped_stats)--OUTPUT--mean min max
group
A 3.666667 2 5
B 5.666667 1 9

Pandas excels at analyzing data in segments. The groupby() method is your tool for this, letting you split a DataFrame based on a column's values—in this case, groups 'A' and 'B'. This is incredibly useful for comparing statistics across different categories within your dataset.

  • Once grouped, you can apply multiple calculations at once using the agg() method.
  • You simply pass a list of functions, like 'mean', 'min', and 'max', to run on the specified column for each group.
  • The result is a new DataFrame that neatly summarizes the statistics, making comparisons straightforward.

Using parallel processing for large datasets

import numpy as np
from concurrent.futures import ProcessPoolExecutor

def chunk_stats(chunk):
return np.mean(chunk), np.min(chunk), np.max(chunk)

data = np.random.randint(1, 100, 1000000)
chunks = np.array_split(data, 4)

with ProcessPoolExecutor() as executor:
results = list(executor.map(chunk_stats, chunks))

print(f"Results from the first chunk: {results[0]}")--OUTPUT--Results from the first chunk: (50.12345, 1, 99)

When you're working with massive datasets, calculations can become a bottleneck. Parallel processing offers a solution by splitting the work across multiple CPU cores. This code uses concurrent.futures.ProcessPoolExecutor to speed things up by breaking the problem down into manageable pieces.

  • The large dataset is first divided into smaller chunks with np.array_split.
  • The executor then uses map to apply the chunk_stats function to each chunk simultaneously.
  • This approach significantly cuts down processing time for large-scale analysis by running calculations in parallel.

Move faster with Replit

Replit is an AI-powered development platform where all Python dependencies come pre-installed, so you can skip setup and start coding instantly. There's no need to configure environments or manage packages.

While mastering individual techniques is a great start, Agent 4 helps you go from piecing together code to building complete applications. Instead of just writing code, the Agent can handle databases, APIs, and deployment directly from your description. You can use it to build tools like:

  • A performance dashboard that ingests raw data and automatically calculates key metrics like mean(), median(), and std() for different product groups.
  • A data validation utility that flags datasets with high skewness or kurtosis, helping you spot potential outliers before analysis.
  • A report generator that takes a series of transactions and outputs a clean summary with total count, min/max values, and quartile distributions using describe().

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

When calculating summary statistics, you'll likely encounter a few common issues, from handling empty inputs to dealing with messy data.

Handling empty lists with min() and max()

A frequent error is applying min() or max() to an empty list, which triggers a ValueError because there are no values to compare. The fix is to always check if your list contains data before you attempt to find its minimum or maximum values, preventing the program from crashing.

Fixing NaN values in statistical calculations

Statistical calculations can be derailed by NaN (Not a Number) values, which often represent missing or undefined data. Most functions will return NaN if any input is NaN, making your summary useless. To fix this, you need to clean your data first:

  • Filter them out using methods like pandas' dropna().
  • Replace them with a meaningful value, such as the column's mean or median, using a function like fillna().

Correcting outlier detection with the IQR method

Relying solely on min() and max() can be misleading if your dataset has extreme outliers. A more robust approach is using the Interquartile Range (IQR), which is the difference between the 75th and 25th percentiles. This range represents the middle 50% of your data, so it's less affected by unusually high or low values. You can define outliers as any data points that fall significantly outside this range, giving you a more accurate picture of your data's typical spread.

Handling empty lists with min() and max()

A common pitfall is applying functions like min() or max() to an empty list. This action triggers a ValueError because there's nothing to compare. The code below demonstrates what happens when you run these functions without first checking the list.

numbers = []
mean = sum(numbers) / len(numbers)
minimum = min(numbers)
maximum = max(numbers)
print(f"Mean: {mean}, Min: {minimum}, Max: {maximum}")

The code immediately fails when calculating the mean. Since the list is empty, len(numbers) is zero, and dividing by zero triggers a ZeroDivisionError. The corrected snippet below shows how to prevent this crash.

numbers = []
if numbers:
mean = sum(numbers) / len(numbers)
minimum = min(numbers)
maximum = max(numbers)
print(f"Mean: {mean}, Min: {minimum}, Max: {maximum}")
else:
print("Cannot calculate statistics on an empty list")

The fix is a simple conditional check. The expression if numbers: only runs the code inside it if the list isn't empty, as empty lists evaluate to False in Python. This guard clause prevents the ZeroDivisionError from the mean calculation and the ValueError from min() and max(). It's a crucial step anytime you're working with data that might come from a filter or an external source, which could result in an empty list.

Fixing NaN values in statistical calculations

Missing data, often represented as NaN (Not a Number), can silently corrupt your analysis. When libraries like NumPy encounter a NaN value, most statistical functions return NaN as well, rendering your results useless. The following code demonstrates this problem in action.

import numpy as np
data = np.array([4, 2, np.nan, 1, 9, 5])
mean = np.mean(data)
std = np.std(data)
print(f"Mean: {mean}, Std Dev: {std}")

The np.nan value in the array propagates through the calculations. As a result, both np.mean() and np.std() return NaN, rendering the summary useless. The following snippet shows how to get around this issue.

import numpy as np
data = np.array([4, 2, np.nan, 1, 9, 5])
mean = np.nanmean(data)
std = np.nanstd(data)
print(f"Mean: {mean}, Std Dev: {std}")

NumPy offers a straightforward fix. Instead of manually filtering your data, you can use special functions like np.nanmean() and np.nanstd(). These versions perform the same calculations but automatically ignore any NaN values they find.

This approach is crucial when your data comes from files or APIs, where missing values are common. It ensures your statistical summaries aren't derailed by a few empty cells and remain accurate.

Correcting outlier detection with the IQR method

The Interquartile Range (IQR) is a solid tool for spotting outliers, but its effectiveness hinges on how you define the boundaries. Setting the threshold too wide is a common mistake that can cause you to miss unusual data points entirely. The code below shows what happens when the multiplier for the iqr is too large, making the detection ineffective.

import numpy as np
data = [12.1, 12.3, 12.0, 12.5, 19.8, 12.2, 12.4]
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
outliers = [x for x in data if x < q1 - 3 * iqr or x > q3 + 3 * iqr]
print(f"Outliers: {outliers}")

The 3 * iqr multiplier makes the outlier detection range too forgiving. As a result, the obvious outlier 19.8 isn't flagged, and the outliers list comes back empty. The corrected code below tightens this range.

import numpy as np
data = [12.1, 12.3, 12.0, 12.5, 19.8, 12.2, 12.4]
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
outliers = [x for x in data if x < q1 - 1.5 * iqr or x > q3 + 1.5 * iqr]
print(f"Outliers: {outliers}")

The corrected code uses a standard multiplier of 1.5 for the iqr. This tightens the detection range, making it more sensitive to extreme values. The previous multiplier of 3 was too large, creating boundaries so wide that the outlier 19.8 was missed. By using 1.5 * iqr, the code correctly identifies data points that fall unusually far from the central 50% of the data. It's a common convention for outlier detection.

Real-world applications

Beyond troubleshooting data, these statistical concepts are key to solving real-world problems like detecting outliers and assessing investment risk.

Detecting outliers with z-score method

The z-score method identifies outliers by calculating how many standard deviations each data point is from the dataset's mean.

import numpy as np

measurements = [12.1, 12.3, 12.0, 12.5, 19.8, 12.2, 12.4]
mean = np.mean(measurements)
std = np.std(measurements)
z_scores = [(x - mean) / std for x in measurements]
outliers = [measurements[i] for i, z in enumerate(z_scores) if abs(z) > 2]
print(f"Z-scores: {[round(z, 2) for z in z_scores]}")
print(f"Outliers: {outliers}")

This code uses NumPy to pinpoint outliers in a dataset. It first calculates the mean and standard deviation (std) for the list of measurements. From there, it takes a two-step approach using list comprehensions:

  • It computes the z-score for each data point by applying the formula (x - mean) / std.
  • It then filters the original list, keeping only the numbers where the absolute z-score is greater than 2—a common threshold for identifying outliers.

This process effectively flags values like 19.8 that are statistically unusual compared to the rest of the data.

Assessing investment risk with covariance

Covariance measures how the returns of two investments move in relation to each other, providing a key metric for assessing portfolio risk.

import numpy as np

# Monthly returns for two investments (%)
stock_returns = [2.1, -0.8, 1.4, -1.2]
bond_returns = [0.4, 0.5, 0.3, 0.6]

stock_risk = np.std(stock_returns)
covariance = np.cov(stock_returns, bond_returns)[0, 1]
print(f"Stock volatility: {stock_risk:.2f}%")
print(f"Covariance with bonds: {covariance:.4f}")

This snippet uses NumPy for a quick financial analysis on two lists: stock_returns and bond_returns. It calculates two key metrics to understand their relationship and individual behavior.

  • The code first finds the stock's volatility by calculating its standard deviation with np.std().
  • It then uses np.cov() to generate a covariance matrix. The index [0, 1] is important here—it pulls the specific value that shows how the two investments move together, which is different from their individual variances.

Get started with Replit

Turn these techniques into a real tool. Tell Replit Agent: "Build a dashboard that uses describe() on uploaded data" or "Create a utility that flags outliers using z-scores."

The Agent writes the code, tests for errors, and deploys your app from your description. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.