How to sort a dataframe based on a column in Python
Learn how to sort a Python DataFrame by column. Explore various methods, tips, real-world examples, and common error debugging.

To analyze data effectively in Python, you often need to sort a dataframe by a specific column. This step organizes your data for better interpretation and further processing.
In this article, you'll learn key techniques with functions like sort_values(). We'll cover practical tips, real-world applications, and debugging advice to help you master this essential data manipulation skill.
Basic sorting with sort_values()
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 25, 40],
'Salary': [50000, 60000, 75000, 65000, 80000]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by='Age')
print(sorted_df)--OUTPUT--Name Age Salary
0 Alice 25 50000
3 David 25 65000
1 Bob 30 60000
2 Charlie 35 75000
4 Eva 40 80000
The key to sorting is the sort_values() function. When you specify by='Age', you instruct pandas to rearrange all rows in the DataFrame according to the values in the 'Age' column. This method is fundamental for preparing your data for tasks like trend analysis or identifying outliers.
The resulting DataFrame is now ordered by age in ascending order, which is the default behavior. You'll notice the original index is kept with each row, so you can always trace data back to its initial state. For entries with the same age, like Alice and David, their original relative order is preserved.
Basic sorting techniques
Beyond the default ascending sort, you can direct sort_values() with additional parameters to handle more nuanced data organization tasks.
Sorting in descending order with ascending=False
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 25, 40],
'Salary': [50000, 60000, 75000, 65000, 80000]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)--OUTPUT--Name Age Salary
4 Eva 40 80000
2 Charlie 35 75000
1 Bob 30 60000
0 Alice 25 50000
3 David 25 65000
To sort your data from highest to lowest, you just need to add the ascending=False parameter. This simple addition to the sort_values() function reverses the default sort order.
- This is especially useful when you want to quickly find the top values in a column, like identifying the oldest people or highest salaries in your dataset.
As you can see in the output, the DataFrame is now arranged with the oldest person at the top.
Sorting by multiple columns
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 25, 40],
'Salary': [50000, 60000, 75000, 65000, 80000]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by=['Age', 'Salary'])
print(sorted_df)--OUTPUT--Name Age Salary
0 Alice 25 50000
3 David 25 65000
1 Bob 30 60000
2 Charlie 35 75000
4 Eva 40 80000
When a single column isn't enough to organize your data, you can sort by multiple columns. Simply pass a list of column names to the by parameter, like by=['Age', 'Salary']. Pandas sorts the data hierarchically, starting with the first column in the list.
- The primary sort is based on the first column,
'Age'. - For rows with the same age, the second column,
'Salary', is used to determine their order.
In the example, since Alice and David are both 25, they are then sorted by their salary, placing Alice first.
Sorting with different directions for multiple columns
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 25, 40],
'Salary': [50000, 60000, 75000, 65000, 80000]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])
print(sorted_df)--OUTPUT--Name Age Salary
3 David 25 65000
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 75000
4 Eva 40 80000
You can set different sort directions for each column by passing a list of booleans to the ascending parameter. The order of True and False values in this list must correspond to the order of columns you specified in the by parameter.
- In this example,
'Age'is sorted in ascending order because the first value isTrue. - For rows with the same age,
'Salary'is sorted in descending order since the second value isFalse.
This is why David, with a higher salary, now appears before Alice, even though they are the same age.
Advanced sorting techniques
Beyond the foundational sort_values(), you can tackle more complex sorting challenges using efficient functions like nlargest(), custom logic with apply(), and specialized categorical types.
Using nlargest() and nsmallest() for efficient sorting
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 25, 40],
'Salary': [50000, 60000, 75000, 65000, 80000]}
df = pd.DataFrame(data)
top_salaries = df.nlargest(3, 'Salary')
lowest_ages = df.nsmallest(2, 'Age')
print(f"Top 3 salaries:\n{top_salaries}\n\nLowest 2 ages:\n{lowest_ages}")--OUTPUT--Top 3 salaries:
Name Age Salary
4 Eva 40 80000
2 Charlie 35 75000
3 David 25 65000
Lowest 2 ages:
Name Age Salary
0 Alice 25 50000
3 David 25 65000
When you only need the top or bottom records, sorting the entire DataFrame with sort_values() can be inefficient. The nlargest() and nsmallest() functions offer a more direct and memory-efficient approach. They quickly find the rows with the highest or lowest values in a column without sorting everything.
- The code uses
nlargest(3, 'Salary')to pull the three employees with the top salaries. - Similarly,
nsmallest(2, 'Age')efficiently identifies the two youngest individuals in the dataset.
Sorting with a custom function using apply()
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 25, 40],
'Salary': [50000, 60000, 75000, 65000, 80000]}
df = pd.DataFrame(data)
df['Salary_per_year'] = df['Salary'] / df['Age']
sorted_df = df.sort_values(by='Salary_per_year', ascending=False)
print(sorted_df[['Name', 'Age', 'Salary', 'Salary_per_year']])--OUTPUT--Name Age Salary Salary_per_year
3 David 25 65000 2600.00
2 Charlie 35 75000 2142.86
4 Eva 40 80000 2000.00
1 Bob 30 60000 2000.00
0 Alice 25 50000 2000.00
When you need to sort by a custom metric, a straightforward method is to create a new column that represents your logic. In this case, a Salary_per_year column is created by dividing the Salary by the Age. This new column serves as a key for your custom sort.
- The DataFrame is then sorted by this new column using
sort_values().
This technique is powerful because it lets you order your data based on any calculated value, giving you full control over the sorting criteria.
Sorting with categorical data using Categorical dtype
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'IT']}
df = pd.DataFrame(data)
custom_order = ['IT', 'Finance', 'Marketing', 'HR']
df['Department'] = pd.Categorical(df['Department'], categories=custom_order, ordered=True)
sorted_df = df.sort_values('Department')
print(sorted_df)--OUTPUT--Name Department
1 Bob IT
4 Eva IT
2 Charlie Finance
3 David Marketing
0 Alice HR
Sometimes, alphabetical sorting isn't what you need. For columns with a specific, non-alphabetical order, like job levels or product sizes, you can use pandas' Categorical data type. This tells pandas to sort based on a custom sequence you provide, not the default A-Z order.
- First, you define your desired order in a list, like
custom_order. - Then, you convert the column using
pd.Categorical(), passing your list to thecategoriesparameter and settingordered=True.
Now, when you call sort_values(), the DataFrame is arranged according to your custom logic.
Move faster with Replit
Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. Instead of managing environments, you can focus on applying what you've learned.
Mastering individual functions like sort_values() is one thing, but building a complete application is another. This is where Agent 4 helps you move from piecing together techniques to building working products. It handles the coding, database connections, APIs, and deployment directly from your description.
- A sales dashboard that uses
nlargest()to automatically display the top-performing deals each week. - An HR tool that sorts employees first by department, then by salary in descending order, to analyze compensation structures.
- A project tracker that organizes tasks by a custom priority level—like 'Critical' or 'Low'—using categorical sorting logic.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with powerful tools, you'll run into a few common sorting snags; here’s how to navigate them with confidence.
Handling the "KeyError" when using sort_values() with nonexistent columns
A KeyError is Python's way of telling you it can't find the key you've specified. When using sort_values(), this almost always means the column name you provided in the by parameter doesn't exist. It's usually caused by a simple typo.
- First, double-check the spelling and capitalization of the column name.
- If you're unsure, you can always display a list of all available columns by running
print(df.columns)to see the exact names.
Handling NaN values with the na_position parameter
Missing data, which pandas represents as NaN (Not a Number), can affect your sorting results. By default, sort_values() places all NaN values at the end of the sorted output. You can control this behavior with the na_position parameter.
- Set
na_position='first'to group all rows with missing values at the beginning of your DataFrame. - Use
na_position='last'to explicitly move them to the end, which is the default behavior.
Fixing sort_values() when changes don't appear to take effect
If you've run sort_values() and your DataFrame appears unchanged, it's because the function returns a new, sorted copy by default. It doesn't alter the original DataFrame unless you tell it to. You have two options to make the sort permanent.
- Assign the sorted output to a new variable, like
sorted_df = df.sort_values(by='your_column'). This is the most common and often safest approach. - Use the
inplace=Trueparameter to modify the original DataFrame directly:df.sort_values(by='your_column', inplace=True). Be mindful that this permanently changes your DataFrame.
Handling the "KeyError" when using sort_values() with nonexistent columns
You'll encounter a KeyError when you ask pandas to sort by a column it can't find. This common mistake usually boils down to a small typo in the column name or referencing a column that was never created in the first place.
The code below demonstrates this by trying to sort a DataFrame by the 'Salary' column, which doesn't exist in the data.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by='Salary')
print(sorted_df)
The code fails because the DataFrame df only contains 'Name' and 'Age' columns. Calling sort_values(by='Salary') on it triggers a KeyError since the specified column doesn't exist. The corrected code below addresses this issue.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
if 'Salary' in df.columns:
sorted_df = df.sort_values(by='Salary')
else:
sorted_df = df.sort_values(by='Age')
print(sorted_df)
To prevent a KeyError, the corrected code first checks if the 'Salary' column exists using if 'Salary' in df.columns:. This simple conditional statement is a powerful way to handle potential errors gracefully.
- If the column is found, the code sorts by it.
- If not, it defaults to sorting by 'Age', ensuring the program runs without crashing.
This is a great defensive practice when your data's structure might vary.
Handling NaN values with the na_position parameter
Missing data, represented as NaN, can disrupt your sorting logic. By default, pandas places these null values at the end of the sorted output, which might not be what you want. The code below shows this default behavior in action.
import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, np.nan, 92, np.nan]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by='Score')
print(sorted_df)
The code sorts by the 'Score' column, but since two entries are missing (np.nan), they're automatically sent to the bottom. The next example shows how you can explicitly control the placement of these null values.
import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, np.nan, 92, np.nan]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by='Score', na_position='first')
print(sorted_df)
By adding the na_position='first' parameter, you instruct pandas to place all rows with NaN values at the beginning of the sorted DataFrame. This gives you direct control over how missing data is presented.
- This is particularly useful when you need to quickly identify and handle incomplete records before proceeding with your analysis.
- It’s a common step in data cleaning pipelines where missing values require special attention.
Fixing sort_values() when changes don't appear to take effect
A frequent stumbling block is calling sort_values() and seeing no change in your DataFrame. This isn't a bug; it's how the function is designed. By default, it returns a new, sorted copy instead of modifying the original. The following code demonstrates this behavior.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df.sort_values(by='Age', ascending=False)
print("\nAfter sort_values() - still in original order:")
print(df)
The code calls df.sort_values() but doesn't save the result. That's why printing df again shows the original, unchanged DataFrame. The code below shows how to make the sort permanent.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df = df.sort_values(by='Age', ascending=False)
print("\nAfter reassigning sort_values() result:")
print(df)
To make your sort permanent, you must reassign the output of sort_values() back to a variable. The corrected code does exactly this by capturing the sorted result with df = df.sort_values(...), overwriting the original DataFrame with the newly sorted version.
- This is the most common and recommended way to apply sorting, as it makes the change explicit.
Always remember to reassign the result when using pandas functions that return a new DataFrame instead of modifying the original.
Real-world applications
Now that you can navigate common sorting errors, you can apply these skills to real-world scenarios like sales analysis and anomaly detection through vibe coding.
Analyzing sales data to identify top-performing products
Sorting your sales data by the Revenue column is a straightforward way to identify your top-performing products.
import pandas as pd
sales_data = {
'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
'Category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Accessories'],
'Units_Sold': [120, 310, 95, 55, 75],
'Revenue': [120000, 155000, 47500, 16500, 7500]
}
sales_df = pd.DataFrame(sales_data)
top_products = sales_df.sort_values(by='Revenue', ascending=False)
print("Products ranked by revenue:")
print(top_products)
This code takes a dictionary of sales figures and converts it into a structured pandas DataFrame. The key action happens with sort_values(by='Revenue', ascending=False), which reorders the DataFrame to rank the data for AI-powered analysis.
- The
by='Revenue'parameter tells pandas which column to use as the sorting key. - Setting
ascending=Falseis essential for this kind of ranking, as it flips the default order to show the highest values first.
The final, sorted data is assigned to the top_products variable, leaving the original sales_df untouched.
Finding anomalies in time-series data with sort_values()
You can also use sort_values() to quickly isolate unusual data points, a common task in anomaly detection for time-series analysis.
import pandas as pd
dates = pd.date_range('2023-01-01', periods=10, freq='D')
temperatures = [20, 22, 21, 35, 23, 20, 19, 18, 21, 22] # 35 is an anomaly
weather_data = pd.DataFrame({
'Date': dates,
'Temperature': temperatures
})
sorted_temps = weather_data.sort_values(by='Temperature', ascending=False)
threshold = sorted_temps['Temperature'].mean() + sorted_temps['Temperature'].std()
anomalies = sorted_temps[sorted_temps['Temperature'] > threshold]
print("All data sorted by temperature:")
print(sorted_temps)
print("\nIdentified anomalies (temp > threshold):")
print(anomalies)
This code identifies outliers by first sorting the data by temperature in descending order with sort_values(). It then establishes a simple statistical threshold to define what's considered an anomaly.
- The threshold is calculated by taking the mean of all temperatures and adding one standard deviation.
- Finally, the code filters the DataFrame, keeping only the rows where the
Temperatureis greater than this calculated threshold.
This approach effectively isolates the unusually high temperature reading, which in this case is 35.
Get started with Replit
Now, turn your sorting skills into a real tool. Describe what you want to Replit Agent, like "build a dashboard that ranks products by sales" or "create a script to find top customer reviews."
Replit Agent will write the code, test for errors, and deploy your application for you. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



