How to load a dataset in Python

Learn how to load datasets in Python. Explore various methods, tips, real-world applications, and common error debugging for your projects.

How to load a dataset in Python
Published on: 
Fri
Feb 20, 2026
Updated on: 
Mon
Apr 6, 2026
The Replit Team

Loading datasets in Python is a foundational skill for any data-driven project. It's the first step in data analysis and machine learning, and Python offers powerful, straightforward tools.

In this article, you'll explore various techniques to load datasets. We'll cover practical tips, real-world applications, and common debugging advice to help you handle data efficiently in your projects.

Using pandas to load CSV files

import pandas as pd
data = pd.read_csv('example.csv')
print(data.head())--OUTPUT--id age name score
0 1 25 John 85.5
1 2 30 Alice 92.0
2 3 22 Bob 78.5
3 4 28 Carol 90.0
4 5 35 Dave 88.5

The pandas library is a cornerstone for data science in Python. Its read_csv() function is the workhorse for loading tabular data from a CSV file directly into a DataFrame. This two-dimensional structure is highly optimized for analysis and manipulation, making it the standard for most data tasks.

Calling data.head() is a common next step. It lets you quickly preview the first few rows of your DataFrame. This is a simple yet effective way to verify that the data loaded correctly and to get a feel for its columns and content without displaying the entire dataset.

Basic data loading approaches

While pandas is a powerhouse for CSVs, you'll also need methods for handling other common formats like Excel, JSON, and numpy arrays.

Reading data with numpy

import numpy as np
data = np.loadtxt('data.txt', delimiter=',')
print(data[:3]) # Print first 3 rows--OUTPUT--[[1.0 2.0 3.0]
[4.0 5.0 6.0]
[7.0 8.0 9.0]]

For purely numerical data, numpy offers a fast and memory-efficient alternative. The loadtxt() function reads data from a text file directly into a multi-dimensional array, which is ideal for mathematical computations.

  • The delimiter parameter is key; it specifies the character separating values in your file, such as a comma.
  • This method works best with homogenous data—where all values are of the same numerical type—as it creates a powerful numpy array rather than a flexible DataFrame.

Loading Excel files

import pandas as pd
excel_data = pd.read_excel('example.xlsx')
print(f"Columns: {excel_data.columns.tolist()[:3]}...")
print(f"Rows: {excel_data.shape[0]}")--OUTPUT--Columns: ['id', 'name', 'age']...
Rows: 100

Handling Excel files is just as simple as CSVs when you're using pandas. The read_excel() function works much like its CSV counterpart, following patterns similar to reading Excel files in Python, loading your spreadsheet directly into a DataFrame for immediate use.

  • You can quickly inspect your data's structure by accessing the columns attribute to see column names.
  • The shape attribute gives you the dimensions of your dataset. Using shape[0] is a handy way to get the total row count.

Working with JSON data

import json
with open('data.json', 'r') as file:
data = json.load(file)
print(f"Available keys: {list(data.keys())}")--OUTPUT--Available keys: ['users', 'settings', 'metadata']

JSON is a popular format for structured data, often used in web APIs. Python handles it natively with its built-in json module. The standard approach involves using a with open() statement to read the file, then passing the file object to json.load().

  • This function parses the JSON content directly into a Python dictionary, which lets you access its structure immediately. For example, calling data.keys() is a great way to see the top-level keys and understand the data's organization.

Advanced loading techniques

Moving beyond local files, your data loading toolkit should also include fetching data from built-in libraries, remote URLs, and databases. These advanced techniques are particularly useful when leveraging AI coding for data-driven applications.

Using built-in datasets from libraries

from sklearn.datasets import load_iris
iris = load_iris()
print(f"Feature names: {iris.feature_names}")
print(f"Number of samples: {iris.data.shape[0]}")--OUTPUT--Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Number of samples: 150

Many data science libraries like scikit-learn come with built-in datasets, which are perfect for practicing models without hunting for external files. The load_iris() function, for instance, returns the classic Iris dataset as a special object that bundles the data with its metadata.

  • This object lets you easily access details like column headers via the feature_names attribute.
  • The actual data is stored in the data attribute, typically as a numpy array, so you can quickly check its dimensions.

Loading data from URLs

import pandas as pd
url = "https://raw.githubusercontent.com/datasets/iris/master/data/iris.csv"
data = pd.read_csv(url)
print(f"Remote data shape: {data.shape}")
print(data.iloc[0])--OUTPUT--Remote data shape: (150, 5)
sepal.length 5.1
sepal.width 3.5
petal.length 1.4
petal.width 0.2
variety setosa
Name: 0, dtype: object

The pandas library simplifies loading data directly from the web. Its versatile read_csv() function isn't limited to local files—you can pass a URL directly to it. This is perfect for pulling datasets from online repositories like GitHub without needing to download them first.

  • Once loaded, the data behaves just like any other DataFrame.
  • You can inspect it using familiar attributes like shape for dimensions or methods like iloc[] to access specific rows by their integer position.

Processing database data with SQLAlchemy

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///example.db')
query = "SELECT * FROM users LIMIT 5"
sql_data = pd.read_sql(query, engine)
print(sql_data)--OUTPUT--id name age
0 1 John 25
1 2 Alice 30
2 3 Bob 22
3 4 Carol 28
4 5 Dave 35

When your data lives in a database, you can use SQLAlchemy and pandas together to fetch it efficiently. The create_engine() function from SQLAlchemy establishes a connection to your database—in this case, a local SQLite file. This engine acts as the bridge for running queries.

  • The pandas function read_sql() takes your SQL query and the database engine you created.
  • It executes the query and loads the results directly into a DataFrame, making database interaction feel as simple as reading a CSV.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. This lets you move from learning individual techniques to building complete, working applications much faster.

Instead of piecing together data loading methods, you can use Agent 4 to build the entire app. Describe what you want to build, and Agent 4 will handle everything from writing the code and connecting to databases to deploying your project live. For example, you could build:

  • A mini-dashboard that automatically fetches a sales CSV from a URL and displays key metrics like total revenue and top-selling items.
  • A configuration validator that reads a JSON file from a project and confirms that all necessary settings are present and correctly formatted.
  • A simple admin panel that connects to a database and shows a live-updating table of the latest user sign-ups.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with the right tools, you might run into a few common roadblocks when loading your data, especially when handling large datasets.

Handling missing values with pd.read_csv()

Real-world datasets are rarely perfect and often contain missing values. The pandas library is smart about this, automatically recognizing common placeholders like empty strings or NA, but sometimes you need to give it a little help.

If your dataset uses custom markers for missing data, like "Not Available" or -1, you can tell read_csv() what to look for.

  • Use the na_values parameter to provide a list of strings that should be interpreted as missing.
  • This ensures that your DataFrame correctly represents empty cells as NaN (Not a Number), making subsequent data cleaning and analysis much smoother.

Resolving encoding errors when reading text files

An UnicodeDecodeError is a frequent hurdle when working with text files. This error pops up when the file's encoding—the way its characters are stored as bytes—doesn't match what Python expects, which is usually 'utf-8'.

The fix is typically straightforward. You just need to specify the correct encoding when you open the file.

  • In functions like pd.read_csv() or the built-in open(), you can use the encoding parameter.
  • If you're unsure of the correct encoding, common alternatives to try include 'latin1' or 'ISO-8859-1', which are more permissive.

Fixing data type issues with dtype parameter

Sometimes pandas might misinterpret a column's data type, or dtype. For example, it might read a column of numerical IDs as floating-point numbers or treat numbers with leading zeros as integers, dropping the zeros.

You can enforce the correct data types right at the loading stage using the dtype parameter in read_csv(). This gives you precise control over how your data is structured.

  • Simply pass a dictionary where keys are the column names and values are the desired types, like {'user_id': str, 'score': float}.
  • This proactive step prevents unexpected errors down the line and ensures your data is ready for analysis from the get-go.

Handling missing values with pd.read_csv()

Missing values can be sneaky. Even if pd.read_csv() runs without errors, unrecognized placeholders can linger in your DataFrame. When you try to perform calculations on a column you believe is numeric, these placeholders will cause a crash, as the following code demonstrates.

import pandas as pd
data = pd.read_csv('missing_data.csv')
# This will fail if there are operations on columns with missing values
result = data['numeric_column'] * 2
print(result)

The multiplication operation, * 2, fails because the numeric_column contains non-numeric text placeholders that can't be used in calculations. The following code shows how to correctly handle these values from the start.

import pandas as pd
data = pd.read_csv('missing_data.csv')
# Fill missing values before operations
data['numeric_column'] = data['numeric_column'].fillna(0)
result = data['numeric_column'] * 2
print(result)

The multiplication fails because the column contains non-numeric text alongside numbers. To fix this, you must clean the data before performing calculations.

  • The solution uses the fillna(0) method to replace all missing values with zero, making the column purely numeric.

This allows the * 2 operation to succeed. It's a good habit to always check for and handle missing data right after loading it and before you start your analysis.

Resolving encoding errors when reading text files

You'll often encounter a UnicodeDecodeError when your data includes international characters or symbols. This error means Python's default text format can't interpret the file's contents. The code below shows how this problem appears when loading a simple CSV.

import pandas as pd
# This might fail with UnicodeDecodeError for files with special characters
data = pd.read_csv('international_data.csv')
print(data.head())

Without an explicit encoding, read_csv() uses a default that can't parse the file's international characters. The following code shows how to provide the necessary instruction to read the data correctly.

import pandas as pd
# Specify the correct encoding
data = pd.read_csv('international_data.csv', encoding='utf-8')
print(data.head())

The solution is to explicitly tell pandas which text format to expect. By adding the encoding='utf-8' parameter to the read_csv() function, you provide the necessary instruction to correctly interpret the file's contents. This simple addition prevents the UnicodeDecodeError from occurring.

  • Keep an eye out for this error when your data includes non-English text, currency symbols, or other special characters.

Fixing data type issues with dtype parameter

It's common for pandas to misinterpret data types, especially with columns that look like numbers but aren't. For instance, it might strip leading zeros from zip codes by treating them as integers. The code below shows this issue in action.

import pandas as pd
# Numbers with leading zeros get treated as numeric and lose the zeros
data = pd.read_csv('user_data.csv')
print(data['zip_code'].head()) # Zip codes like '02134' become 2134

The read_csv() function automatically infers the column is numeric, converting it to an integer and dropping the leading zeros. The code below demonstrates how you can explicitly define the data type to prevent this from happening.

import pandas as pd
# Specify the data type to preserve leading zeros
data = pd.read_csv('user_data.csv', dtype={'zip_code': str})
print(data['zip_code'].head()) # Zip codes like '02134' remain intact

The solution is to explicitly set the data type during loading. By passing a dictionary to the dtype parameter, like {'zip_code': str}, you tell pandas to treat that column as a string, not a number.

  • This simple fix preserves important formatting, like leading zeros in IDs or postal codes.
  • It's a crucial step whenever a column contains numerical-looking data that should be handled as text.

Real-world applications

Now that you can navigate common loading errors, you're ready to tackle more complex scenarios like combining multiple data sources.

Analyzing data from multiple CSV files with pandas

When your data is spread across multiple files, like monthly sales reports, you can use glob.glob() to find them and pd.concat() to merge everything into one DataFrame for analysis, using techniques for merging CSV files.

import pandas as pd
import glob

# Load all monthly sales reports
sales_files = glob.glob('monthly_sales/*.csv')
all_sales = pd.concat([pd.read_csv(f) for f in sales_files])

# Summarize sales by product category
category_totals = all_sales.groupby('category')['amount'].sum().sort_values(ascending=False)
print(category_totals.head(3))

This code automates the analysis of data from multiple reports. It starts by using glob.glob() to dynamically find all CSV files within a specific folder. A list comprehension then efficiently loads each of these files into a separate DataFrame.

  • The pd.concat() function stacks all these individual DataFrames into one master dataset.
  • With all the data combined, you can perform powerful aggregate operations like groupby() to summarize sales by category.
  • Finally, sort_values() organizes the results to quickly identify top-performing categories.

Creating a dataset by combining web data and local files

Often, you'll need to combine a static local file with dynamic data from the web, like using a live API to convert purchase amounts to a different currency.

import pandas as pd

# Load local customer data
customers = pd.read_csv('customers.csv')

# Load currency exchange rates from a public API
exchange_url = "https://open.er-api.com/v6/latest/USD"
exchange_rates = pd.read_json(exchange_url)['rates']

# Convert USD purchases to EUR
customers['amount_eur'] = customers['amount_usd'] * exchange_rates['EUR']
print(customers[['customer_id', 'amount_usd', 'amount_eur']].head())

This code merges data from a local file and a live web API. It first loads customers.csv using pd.read_csv(). Then, it pulls real-time currency data directly from a URL with pd.read_json(), showing how pandas handles remote sources just like local ones. This type of data integration is perfect for vibe coding approaches.

  • A new column, amount_eur, is created by multiplying the amount_usd values with the fetched EUR rate.
  • This operation demonstrates how you can easily perform calculations across different data sources to generate new insights within your DataFrame.

Get started with Replit

Turn your knowledge into a real tool with Replit Agent. Try prompts like, “Build a tool that merges monthly sales CSVs” or “Create a dashboard that fetches and displays live currency exchange rates from an API.”

Replit Agent writes the code, tests for errors, and deploys your app for you. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.