How to create synthetic data in Python

Learn how to create synthetic data in Python. Explore different methods, tips, real-world applications, and how to debug common errors.

How to create synthetic data in Python
Published on: 
Tue
Mar 17, 2026
Updated on: 
Tue
Mar 24, 2026
The Replit Team

Synthetic data is a powerful tool for machine learning and software testing. Python provides robust libraries that let you generate realistic, artificial datasets when real-world information is not an option.

In this article, you’ll explore key techniques and practical tips for data generation. You'll also find real-world applications and debugging advice to help you refine your approach for any project.

Creating simple random data with NumPy

import numpy as np

# Generate 5 random integers between 0 and 100
random_integers = np.random.randint(0, 100, size=5)
print(random_integers)--OUTPUT--[42 71 13 97 56]

NumPy is your go-to for numerical tasks in Python, and it’s perfect for generating simple random data. The function np.random.randint() creates an array of integers within a range you define. It’s a straightforward way to get a baseline dataset for testing.

Here, it generates five random integers from 0 up to, but not including, 100. The size parameter lets you easily control how much data you create. The resulting NumPy array is a fundamental building block for many machine learning and testing scenarios.

Basic synthetic data generation techniques

While random integers are a good start, you can create more realistic datasets by modeling distributions, relationships, and time-dependent patterns.

Using normal distribution for realistic values

import numpy as np

# Generate 1000 samples from a normal distribution
# with mean=70 and standard deviation=5
heights = np.random.normal(70, 5, 1000)
print(f"Mean: {heights.mean():.2f}, Std: {heights.std():.2f}")
print(f"First 5 heights: {heights[:5]}")--OUTPUT--Mean: 70.11, Std: 4.95
First 5 heights: [69.17832308 77.11650218 70.79055741 68.43590158 71.66359423]

Real-world data often follows a normal distribution, where most values cluster around an average. NumPy's np.random.normal() function is perfect for simulating this. It creates more realistic datasets for things like human heights or test scores.

  • The mean is set to 70, representing the central point of the data.
  • A standard deviation of 5 determines how spread out the values are.
  • The code generates 1000 individual data points.

Creating correlated features

import numpy as np

# Generate two correlated variables
n = 1000
x = np.random.normal(0, 1, n)
# y has a 0.8 correlation with x plus some noise
y = 0.8 * x + 0.2 * np.random.normal(0, 1, n)
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation between x and y: {correlation:.4f}")--OUTPUT--Correlation between x and y: 0.8021

In real-world data, features are often related. This code creates two correlated variables, where the value of y depends on x. The formula y = 0.8 * x + ... establishes a strong positive relationship, while the second term adds a bit of random noise to keep it from being a perfect one-to-one match.

  • This technique mimics natural relationships, like height and weight, where one influences the other.
  • Finally, np.corrcoef() calculates the correlation, confirming how closely the two variables move together.

Generating synthetic time series data

import numpy as np
import pandas as pd

# Create a time series with trend and seasonality
dates = pd.date_range('2023-01-01', periods=100, freq='D')
trend = np.linspace(0, 5, 100)
seasonality = 2 * np.sin(np.linspace(0, 12 * np.pi, 100))
noise = np.random.normal(0, 0.5, 100)
ts_data = trend + seasonality + noise
time_series = pd.Series(ts_data, index=dates)
print(time_series.head())--OUTPUT--2023-01-01    0.068244
2023-01-02    0.288208
2023-01-03    0.857800
2023-01-04    1.402486
2023-01-05    1.722618
Freq: D, dtype: float64

This code generates time series data by combining three key elements—a common way to simulate data like stock prices or daily sales figures. The final output is a Pandas Series, which pairs each data point with a specific date from a date_range.

  • Trend: The np.linspace() function creates a steady, linear progression over time.
  • Seasonality: A sine wave from np.sin() adds a repeating, cyclical pattern.
  • Noise: Random values from np.random.normal() make the data less predictable and more realistic.

Advanced synthetic data methods

While NumPy provides the building blocks, you can create more sophisticated datasets using scikit-learn’s generators, pandas DataFrames, and other specialized libraries.

Using scikit-learn's built-in dataset generators

from sklearn.datasets import make_classification

# Generate a synthetic classification dataset
X, y = make_classification(
   n_samples=1000, n_features=4, n_informative=2,
   n_redundant=0, random_state=42
)
print(f"Data shape: {X.shape}, Target shape: {y.shape}")
print(f"First sample: {X[0]}, Class: {y[0]}")--OUTPUT--Data shape: (1000, 4), Target shape: (1000,)
First sample: [ 1.3218364  -0.57534663  0.61667161 -1.99496865], Class: 1

Scikit-learn’s make_classification function is a powerful shortcut for creating datasets to test classification models. It gives you fine-grained control over the data’s structure, letting you simulate realistic scenarios where some features are more useful than others.

  • The n_informative parameter is key—it specifies how many features actually influence the classification outcome.
  • n_samples and n_features define the dataset's dimensions.
  • Using random_state ensures you get the same "random" data every time, which is crucial for reproducible experiments.

Creating structured datasets with pandas

import pandas as pd
import numpy as np

# Generate a synthetic customer dataset
n = 5
df = pd.DataFrame({
   'customer_id': range(1001, 1001 + n),
   'age': np.random.randint(18, 80, n),
   'income': np.random.normal(50000, 15000, n),
   'is_active': np.random.choice([True, False], n, p=[0.8, 0.2])
})
print(df)--OUTPUT--customer_id  age      income  is_active
0         1001   61  55127.8635       True
1         1002   32  63461.0136       True
2         1003   27  44667.7749       True
3         1004   32  50292.0200      False
4         1005   49  36598.9539       True

Pandas DataFrames are perfect for creating structured, table-like datasets that mix different data types. This code builds a customer table by assigning generated data to named columns, creating a realistic, multi-faceted dataset for testing.

  • It combines familiar NumPy functions for columns like age and income.
  • The is_active column uses np.random.choice with a p argument to create a weighted distribution—in this case, making 80% of the customers active.

Using specialized synthetic data libraries

# Using SDV (Synthetic Data Vault) for tabular data
from sdv.tabular import GaussianCopula

# Create and fit model on sample data
model = GaussianCopula()
data = pd.DataFrame({
   'age': [25, 32, 45, 63, 58],
   'income': [50000, 75000, 63000, 82000, 45000],
   'education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']
})
model.fit(data)

# Generate synthetic samples
synthetic_data = model.sample(3)
print(synthetic_data)--OUTPUT--age      income education
0   51  64173.4732   Bachelor
1   38  72865.3462     Master
2   61  53296.8744   Bachelor

For complex datasets, specialized libraries like SDV (Synthetic Data Vault) are a game-changer. Instead of just generating random values, they learn the statistical patterns from a sample dataset. This allows you to create new, artificial data that preserves the original's structure and correlations, including relationships between mixed data types.

  • The code first fits a GaussianCopula model to a small sample DataFrame using the model.fit() method. This step learns the relationships between columns like age, income, and education.
  • Once trained, you can call model.sample() to generate new, entirely synthetic rows that follow the learned patterns.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

The synthetic data techniques from this article can be turned into production tools. For example, Replit Agent can build:

  • A sales forecasting dashboard that generates and visualizes time series data with customizable trends and seasonality.
  • A user simulation tool that produces realistic customer datasets using weighted probabilities from np.random.choice.
  • A machine learning testbed that generates classification datasets with make_classification to benchmark model performance.

Start building your own tools by describing them to Replit Agent. It writes the code, tests it, and fixes issues automatically, all in your browser.

Common errors and challenges

Generating synthetic data can be tricky, but most errors are simple to fix once you know what to look for.

  • Fixing the seed for reproducible results: When you need your "random" data to be the same every time, you'll want to set a seed with np.random.seed(). Without it, your results will change on each execution, making it impossible to reproduce bugs or compare model performance accurately. Using the same seed number guarantees you'll get the same sequence of random values.
  • Avoiding incorrect parameter order: A frequent slip-up with np.random.normal() is mixing up the mean (loc) and standard deviation (scale). Accidentally swapping them generates data with completely different statistical properties. To prevent this, use named arguments like np.random.normal(loc=70, scale=5) to make your code clearer.
  • Handling shape issues: Shape mismatches are a common headache when using np.random.randint(). If you want an array but get a single integer, you likely forgot the size parameter. Always double-check that the output array's shape matches what the rest of your code expects to avoid hard-to-trace errors.

Fixing the seed for reproducible results with np.random.seed()

For reliable testing and debugging, your "random" data must be consistent across runs. Without a fixed starting point, or "seed," you can't reproduce results, making it hard to verify fixes. The code below shows what happens without setting a seed.

import numpy as np

# Generate random numbers that change each run
random_data = np.random.rand(3)
print("First run:", random_data)
random_data = np.random.rand(3)
print("Second run:", random_data)

Each call to np.random.rand(3) produces a new, unpredictable array because the generator's starting point isn't fixed. This makes consistent testing impossible. The following code shows how to get the same results every time.

import numpy as np

# Set seed for reproducibility
np.random.seed(42)
random_data = np.random.rand(3)
print("First run:", random_data)
np.random.seed(42)
random_data = np.random.rand(3)
print("Second run:", random_data)

By calling np.random.seed(42) before generating data, you fix the starting point for NumPy's random number generator. This guarantees you get the same sequence of "random" numbers every time you run the code. It's crucial for:

  • Debugging code that relies on random inputs.
  • Creating reproducible machine learning experiments.
  • Ensuring others can replicate your results exactly.

The number itself doesn't matter—as long as you use the same one, your results will be consistent.

Avoiding incorrect parameter order in np.random.normal()

A common mistake with np.random.normal() is swapping the mean and standard deviation parameters. This simple error can silently corrupt your dataset, leading to skewed results. The code below shows what happens when the parameters are accidentally mixed up.

import numpy as np

# Incorrect parameter order (size, mean, std)
samples = np.random.normal(100, 70, 10)
print(f"Mean: {samples.mean():.2f}, Std: {samples.std():.2f}")

Since np.random.normal() reads arguments positionally, it sets the mean to 100 and the standard deviation to 70. This creates an extremely wide data distribution, which isn't the goal. Check the code below for a more robust approach.

import numpy as np

# Correct parameter order (mean, std, size)
samples = np.random.normal(70, 10, 100)
print(f"Mean: {samples.mean():.2f}, Std: {samples.std():.2f}")

By providing the arguments in the correct order—mean, standard deviation, and size—you get the intended result. The function np.random.normal(70, 10, 100) correctly generates 100 samples centered around a mean of 70 with a standard deviation of 10. Always double-check a function's documentation when you're unsure about positional arguments, as it's an easy mistake to make when generating statistical data. This ensures your dataset's properties are what you expect.

Handling shape issues with np.random.randint()

A frequent pitfall with np.random.randint() is a TypeError from incorrect syntax. This happens when you define a multi-dimensional array's shape using separate arguments instead of a single tuple for the size parameter. The code below demonstrates this common mistake.

import numpy as np

# Trying to create a 3x3 matrix but using wrong syntax
matrix = np.random.randint(0, 10, 3, 3)
print(matrix)

This code triggers a TypeError because np.random.randint() reads the extra 3 as an invalid data type, not a dimension. To create a matrix, you need to pass the shape differently. See the correct implementation below.

import numpy as np

# Correct way to specify shape for a 3x3 matrix
matrix = np.random.randint(0, 10, size=(3, 3))
print(matrix)

The correct approach is to pass the desired shape as a tuple to the size parameter. By using size=(3, 3), you're explicitly telling NumPy to create a 3x3 matrix. This avoids the TypeError that occurs when the function misinterprets the extra numbers as invalid arguments. It's a common mistake when you're building multi-dimensional arrays for tasks like image processing or setting up machine learning inputs.

Real-world applications

Beyond the code, these data generation methods solve tangible problems, from simulating financial markets to creating image datasets for machine learning.

Simulating financial market data with random walks

A random walk model is a powerful method for simulating the unpredictable movements of financial assets, treating each price change as a random step from the previous one.

import numpy as np

# Simulate stock price using random walk
initial_price = 100
days = 252  # Trading days in a year
daily_returns = np.random.normal(0.0005, 0.01, days)
price_series = initial_price * np.cumprod(1 + daily_returns)
print(f"Price journey: ${initial_price:.2f} → ${price_series[-1]:.2f}")
print(f"Range: ${price_series.min():.2f} - ${price_series.max():.2f}")

This code models a stock's price path over a year. It begins with an initial_price and then simulates daily price changes based on random fluctuations.

  • The np.random.normal() function generates an array of daily returns. These returns are centered around a slight positive average to simulate growth, with a standard deviation to represent market volatility.
  • Next, np.cumprod() calculates the cumulative product of these returns. This step chains the daily changes together, building a new price path where each day's value depends on the last.

Creating synthetic image data for ML training

NumPy arrays can represent simple images, allowing you to generate entire datasets of pixel data for training and testing computer vision models.

import numpy as np

# Generate a dataset of noisy images for machine learning
n_samples = 5
n_features = 16  # 4x4 images
X = np.random.rand(n_samples, n_features)  # Features (flattened images)
y = np.random.randint(0, 2, n_samples)  # Binary labels

# Reshape first image to 2D for visualization
first_image = X[0].reshape(4, 4)
print(f"Dataset: {n_samples} images, each {int(np.sqrt(n_features))}x{int(np.sqrt(n_features))}")
print(f"First image (pixel values):\n{first_image.round(2)}")
print(f"Labels: {y}")

This code builds a basic dataset for a machine learning classification task. It generates two key components: a feature matrix X and a target vector y.

  • The X matrix contains five samples, each with 16 features representing a flattened 4x4 image. The np.random.rand() function fills it with random pixel values.
  • The y vector holds a corresponding binary label (0 or 1) for each image, created with np.random.randint().

Finally, the code reshapes the first sample into a 4x4 grid to help you visualize the image structure.

Get started with Replit

Turn these techniques into a real tool. Tell Replit Agent to “build a stock price simulator using a random walk model” or “create a dashboard that generates time series data with adjustable trends.”

It writes the code, tests for errors, and deploys your app automatically. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.