How to extract text from a PDF in Python

Learn to extract text from PDFs in Python. Explore different methods, tips, real-world applications, and how to debug common errors.

Published on:

Fri

Feb 20, 2026

Updated on:

Mon

Apr 6, 2026

The Replit Team

ON THIS PAGE

Example H2

To extract text from PDF files is a key task for data processing and automation. Python’s libraries provide robust tools to handle this, turning complex documents into usable plain text for your projects.

In this article, we'll cover several techniques for text extraction. You'll find practical tips, real-world applications, and debugging advice to help you select the right approach for your specific use case.

Basic extraction with `PyPDF2`

import PyPDF2 with open('sample.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) page = reader.pages[0] text = page.extract_text() print(text[:100])--OUTPUT--Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisi. Maecenas fermentum magna qu

The PyPDF2 library offers a direct way to pull text from PDFs. The code opens the file in binary read mode ('rb'), which is essential because PDFs have a complex binary structure, unlike plain text files. A PdfReader object is then created to parse the document.

The core of the extraction is the page.extract_text() method. It attempts to pull all text content from the selected page object—in this case, the first page accessed via reader.pages[0]. This approach is great for simple, text-based PDFs but may struggle with complex layouts or scanned images.

Working with common PDF libraries

When you need more precision than PyPDF2 can offer, you can turn to specialized libraries like pdfminer.six, pdfplumber, and PyMuPDF for better results.

Using `pdfminer.six` for better text extraction

from pdfminer.high_level import extract_text text = extract_text('sample.pdf') print(text[:100])--OUTPUT--Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisi. Maecenas fermentum magna qu

The pdfminer.six library simplifies text extraction with its high-level extract_text function. Unlike the multi-step process in PyPDF2, you don't need to manually open the file or iterate through pages—this function handles the entire document in a single call. It's generally more accurate because it analyzes the PDF's layout to better reconstruct text flow.

It excels at parsing complex documents with multiple columns.
It provides more precise character and spacing analysis, which prevents jumbled output.

Extracting text with `pdfplumber`

import pdfplumber with pdfplumber.open('sample.pdf') as pdf: page = pdf.pages[0] text = page.extract_text() print(text[:100])--OUTPUT--Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisi. Maecenas fermentum magna qu

The pdfplumber library offers a great balance of simplicity and power. Its code structure is straightforward—you use pdfplumber.open() to access the file and page.extract_text() to pull content from a specific page, which may feel familiar. Where it really shines is in its ability to work with the visual layout of a PDF.

It's built on top of pdfminer.six, inheriting its robust layout analysis.
It's particularly effective for extracting tables and other structured data, not just blocks of text.

Using `PyMuPDF` (fitz) for fast text extraction

import fitz doc = fitz.open('sample.pdf') text = "" for page in doc: text += page.get_text() print(text[:100])--OUTPUT--Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla facilisi. Maecenas fermentum magna qu

For raw performance, PyMuPDF (imported as fitz) is a top contender. The code uses fitz.open() to access the document, then iterates through each page to pull content with page.get_text(). This direct loop makes processing the entire document very efficient.

Its speed is its main advantage, as it’s a Python binding for MuPDF—a high-performance C library.
It's an excellent choice for applications that need to process a large number of PDFs quickly.

Advanced PDF text extraction techniques

Now that you're familiar with the core libraries, you can tackle more complex real-world problems like preserving layouts, handling password-protected files, and batch processing.

Extracting text with layout preservation

import fitz doc = fitz.open('sample.pdf') page = doc[0] blocks = page.get_text("blocks") for block in blocks[:2]: # First two blocks print(f"Block: {block[4][:50]}...")--OUTPUT--Block: Lorem ipsum dolor sit amet, consectetur adipis... Block: Nulla facilisi. Maecenas fermentum magna qu...

PyMuPDF can do more than just pull raw text. When you call page.get_text("blocks"), it doesn't just return a single string. Instead, you get a list where each item represents a distinct text block from the page, helping you preserve the document's original structure.

Each block is a tuple containing layout metadata like coordinates.
The actual text is stored at a specific index, which is why the code uses block[4] to access it.

This approach is perfect when you need to analyze the spatial relationship between different text elements.

Working with password-protected PDFs

import PyPDF2 with open('protected.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) if reader.is_encrypted: reader.decrypt('password123') text = reader.pages[0].extract_text() print(text[:100])--OUTPUT--This is content from a password-protected PDF document. Access granted with correct password.

Handling password-protected files with PyPDF2 is straightforward. Before trying to extract text, you'll want to check if the document is encrypted using the reader.is_encrypted property.

If it returns True, you can unlock the file by calling the reader.decrypt() method and passing the password as a string.
After that, text extraction works exactly the same, allowing you to use extract_text() as you normally would.

Batch processing multiple PDF files

import os import PyPDF2 pdf_folder = 'pdf_files' for filename in os.listdir(pdf_folder): if filename.endswith('.pdf'): with open(os.path.join(pdf_folder, filename), 'rb') as file: reader = PyPDF2.PdfReader(file) print(f"{filename}: {len(reader.pages)} pages")--OUTPUT--document1.pdf: 5 pages report.pdf: 12 pages sample.pdf: 1 page

Automating tasks across many files is a common need, and Python's os module is perfect for it. This script sets up a loop to handle every file in a directory, making it memory-efficient to apply the same logic to an entire collection of PDFs.

The os.listdir() function gets a list of all items in the pdf_folder.
An if statement filters this list, ensuring only files ending with .pdf are processed.
Using os.path.join() is a reliable way to create the correct file path before opening each document.

From there, you can apply any PDF operation you need to each file in the sequence, similar to how you might iterate through files when reading CSV files in Python.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. This environment lets you move beyond individual techniques, like using extract_text(), and focus on building complete applications.

Instead of piecing together scripts, you can use Agent 4 to build a finished product directly from your idea. Describe the app you want, and the Agent handles the rest—from writing the code to managing deployment.

An invoice parser that automatically extracts data from a folder of PDFs and organizes it into a spreadsheet.
A research assistant that ingests academic papers and generates summaries of their key findings.
A document converter that reads text from multiple PDFs and outputs it as clean, editable files.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

When extracting text from PDFs, you'll likely run into a few common issues, but they're all manageable with the right approach.

Handling file not found errors with `try-except`

A FileNotFoundError is one of the first hurdles you'll face. It happens when your script tries to open a PDF that doesn't exist at the specified path, often due to a simple typo or the file being in a different folder.

To prevent your program from crashing, wrap your file-opening logic in a try-except block. This allows you to "catch" the error and handle it gracefully, perhaps by printing a message to the user instead of halting execution.

Preventing `IndexError` when accessing PDF pages

An IndexError occurs if you try to access a page that isn't there—for example, requesting reader.pages[5] in a three-page document. This is a common slip-up when working with files of varying lengths.

You can avoid this by always checking the document's length with an expression like len(reader.pages) before attempting to access a specific page index. This simple check ensures your code only requests pages that actually exist.

Handling empty text extraction with fallback methods

Sometimes, a call to extract_text() returns an empty string even if the PDF visibly contains text. This often means the content is an image, not selectable text, or the layout is too complex for the library to parse correctly.

When this happens, your first step should be to try a different library. If PyPDF2 fails, PyMuPDF or pdfplumber might succeed due to their more advanced layout analysis. If all libraries return empty strings, the PDF likely requires Optical Character Recognition (OCR) to convert the image-based text into machine-readable characters.

Handling file not found errors with `try-except`

It's easy to mistype a filename, which will crash your script with a FileNotFoundError. This happens when the path is wrong or the file isn't where your code expects. Before exploring the fix, see what this error looks like in action. Understanding different methods for checking if a file exists in Python can help prevent these errors.

import PyPDF2 # This will raise FileNotFoundError if the file doesn't exist with open('non_existent.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) text = reader.pages[0].extract_text() print(text)

The script fails because the open() function can't find non_existent.pdf, which immediately stops the program. The following code demonstrates how to manage this common error without crashing your application.

import PyPDF2 import os pdf_path = 'non_existent.pdf' try: if os.path.exists(pdf_path): with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = reader.pages[0].extract_text() print(text) else: print(f"File not found: {pdf_path}") except Exception as e: print(f"Error processing PDF: {e}")

This solution prevents crashes by wrapping the logic in a try-except block. It proactively checks if a file exists with os.path.exists() before attempting to open it. If the file is missing, the code prints a helpful message instead of halting. This is essential when processing files from user input or in batches, where you can't always guarantee a file path is correct. The broad except also catches any other processing errors, similar to techniques used in handling multiple exceptions in Python.

Preventing `IndexError` when accessing PDF pages

An IndexError occurs when you request a page that doesn't exist, like asking for page 10 in a five-page PDF. This often happens when looping with a fixed range that exceeds the document's actual length. The code below shows how this error can crash your script.

import PyPDF2 with open('sample.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) # This causes IndexError if PDF has fewer than 5 pages for i in range(5): text = reader.pages[i].extract_text() print(f"Page {i+1}: {text[:20]}...")

The for i in range(5) loop forces the script to access five pages, even if the PDF has fewer. This hardcoded range is what triggers the IndexError. The code below shows how to handle this correctly.

import PyPDF2 with open('sample.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) page_count = len(reader.pages) for i in range(min(5, page_count)): text = reader.pages[i].extract_text() print(f"Page {i+1}: {text[:20]}...")

This solution dynamically adjusts the loop's range to prevent errors. It first gets the document's total pages using len(reader.pages). The loop then iterates up to min(5, page_count), ensuring it never tries to access a page index that doesn't exist. This makes your code robust, especially when batch processing PDFs of unknown or varying lengths, as it won't crash on shorter documents. These principles apply broadly to iterating through a list in Python safely.

Handling empty text extraction with fallback methods

Sometimes, extract_text() returns an empty string, even when a PDF visibly contains text. This often happens with scanned documents where the text is part of an image, causing downstream functions to fail. The following code demonstrates this problem in action.

import PyPDF2 with open('scanned_document.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) text = reader.pages[0].extract_text() # This might be empty for scanned documents analysis_result = process_text(text) print(analysis_result)

The problem is that process_text() receives an empty string when extract_text() finds no machine-readable text, which can cause errors. The code below shows how you can build a more resilient workflow.

import PyPDF2 import fitz # PyMuPDF with open('scanned_document.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) text = reader.pages[0].extract_text() if not text.strip(): # Try alternative library if PyPDF2 extraction is empty doc = fitz.open('scanned_document.pdf') text = doc[0].get_text() analysis_result = process_text(text) print(analysis_result)

This solution builds a fallback system for more reliable extraction. It first attempts to pull text with PyPDF2. If the result is empty—a common issue with scanned PDFs—the code checks this with if not text.strip(). Instead of failing, it then tries again using fitz, a library known for its robust parsing. This layered approach makes your script more resilient through code repair, ensuring you get text from a wider variety of documents without manual intervention.

Real-world applications

With your extraction code now resilient against common errors, you can build powerful tools for searching documents and analyzing data using vibe coding.

Searching for keywords in PDF documents

You can easily build a search function by looping through each page’s extracted text and checking for your keyword.

import PyPDF2 def search_in_pdf(pdf_path, keyword): with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) results = [] for page_num, page in enumerate(reader.pages): text = page.extract_text().lower() if keyword.lower() in text: results.append(page_num + 1) return results matches = search_in_pdf('report.pdf', 'important') print(f"Found 'important' on pages: {matches}")

The search_in_pdf function automates finding a keyword within a document. It processes the PDF page by page, using enumerate to get both the content and the page number simultaneously.

To ensure the search is case-insensitive, it converts both the extracted text and the keyword to lowercase with the .lower() method before comparing them.
When the keyword is found using the in operator, the function stores the corresponding page number in a list, adding 1 to correct for the zero-based index.

Finally, it returns a list of all page numbers where the keyword appears.

Extracting financial data from PDF reports

You can combine text extraction with regular expressions to pinpoint and pull specific financial figures, like revenue and profit, from a report.

The extract_financial_figures function uses Python’s re module to parse the raw text for specific data. It first reads the entire document and consolidates the text from all pages into a single string, creating a searchable block of content.

The function then deploys re.search() with a regular expression like r'Total Revenue:?\s*\$?([\d,]+\.?\d*)' to find and capture the numbers associated with financial labels.
Once a match is found, match.group(1) isolates the captured number string, which might still contain characters like commas.
To prepare the data for calculations, the code cleans the string using .replace(',', '') and converts it to a float.

For more complex pattern matching scenarios, you might want to explore using regex in Python for additional text extraction techniques.

This approach effectively turns unstructured text into usable data points. By extracting the revenue and profit, you can immediately perform further analysis, such as calculating the profit margin, directly within your script.

import PyPDF2 import re def extract_financial_figures(pdf_path): with open(pdf_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() revenue_match = re.search(r'Total Revenue:?\s*\$?([\d,]+\.?\d*)', text) profit_match = re.search(r'Net Profit:?\s*\$?([\d,]+\.?\d*)', text) revenue = float(revenue_match.group(1).replace(',', '')) if revenue_match else None profit = float(profit_match.group(1).replace(',', '')) if profit_match else None return revenue, profit revenue, profit = extract_financial_figures('annual_report.pdf') print(f"Revenue: ${revenue:,.2f}, Profit: ${profit:,.2f}") profit_margin = (profit / revenue) * 100 if revenue else 0 print(f"Profit Margin: {profit_margin:.2f}%")

The extract_financial_figures function first consolidates text from all pages into a single string, allowing it to search the entire document at once. It then uses regular expressions to pinpoint specific financial figures like revenue and profit. The code is designed to be robust and handle real-world inconsistencies.

It safely manages missing data by assigning None if a figure isn’t found, preventing your script from crashing.
The final profit margin calculation checks for a valid revenue figure first, avoiding a potential division-by-zero error.

Get started with Replit

Turn your knowledge into a tool with Replit Agent. Describe what you want: "Build a tool to extract tables from PDFs into CSVs" or "Create an app that finds financial figures in reports."

The Agent writes the code, tests for errors, and deploys your app from a simple description. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started free

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Follow @Replit