How to parse HTML in Python

Learn how to parse HTML in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

Published on:

Fri

Feb 20, 2026

Updated on:

Mon

Apr 6, 2026

The Replit Team

ON THIS PAGE

Example H2

You often need to parse HTML in Python to scrape websites, extract data, and analyze content. Python offers powerful libraries to simplify navigation and manipulation of complex HTML documents.

In this article, we'll cover key techniques and libraries to parse HTML. We'll share practical tips, show real-world applications, and offer advice to help you debug your code and master data extraction.

Using `BeautifulSoup` for basic HTML parsing

from bs4 import BeautifulSoup html = "<html><body><h1>Hello World</h1><p>This is a paragraph.</p></body></html>" soup = BeautifulSoup(html, 'html.parser') print(soup.h1.text) print(soup.p.text)--OUTPUT--Hello World This is a paragraph.

This snippet creates a BeautifulSoup object, which turns the raw HTML string into a navigable Python object. By specifying 'html.parser', you're telling the library to use Python's built-in parser to build this object tree.

With the HTML parsed, you can access the first instance of a tag directly, like an attribute. For example, soup.h1 finds the <h1> tag. The .text property then extracts only the human-readable content, leaving the HTML markup behind. Learn more about using Beautiful Soup in Python.

Basic HTML parsing techniques

While direct tag access is convenient, you can gain more control and speed by using different parsers and finding elements with specific CSS selectors.

Parsing HTML with Python's built-in `html.parser`

from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print(f"Start tag: {tag}") def handle_data(self, data): if data.strip(): print(f"Data: {data.strip()}") parser = MyHTMLParser() parser.feed("<html><body><h1>Hello World</h1></body></html>")--OUTPUT--Start tag: html Start tag: body Start tag: h1 Data: Hello World

Python's built-in html.parser module offers a low-level, event-driven approach. You create a custom class that inherits from HTMLParser and override methods to define how to handle different parts of the HTML document as it's read.

The handle_starttag() method is triggered for every opening tag, like <html> or <h1>.
The handle_data() method processes the text content found between tags.

You then pass your HTML string to the parser's feed() method, which processes the document sequentially and calls your custom methods along the way.

Using `lxml` for faster HTML parsing

from lxml import html content = "<html><body><h1>Hello World</h1><p>This is a paragraph.</p></body></html>" tree = html.fromstring(content) h1_text = tree.xpath('//h1/text()') p_text = tree.xpath('//p/text()') print(h1_text[0]) print(p_text[0])--OUTPUT--Hello World This is a paragraph.

The lxml library is a high-performance parser, often faster than other options because it's built on C libraries and is memory-efficient. It uses the html.fromstring() function to convert your HTML into an element tree. This tree represents the document's structure, making it easy to navigate. Instead of direct attribute access, lxml uses XPath expressions to query this tree.

You use the xpath() method with a specific query, like '//h1/text()', to precisely target the content you need.
This method is especially useful for extracting data from complex or deeply nested HTML, returning the results in a list.

Finding elements with CSS selectors in `BeautifulSoup`

from bs4 import BeautifulSoup html = '<div class="content"><a href="https://example.com">Link</a><p>Text</p></div>' soup = BeautifulSoup(html, 'html.parser') link = soup.select_one('div.content a') print(link['href']) print(soup.select('div p')[0].text)--OUTPUT--https://example.com Text

BeautifulSoup allows you to use CSS selectors for precise element targeting, a common method in web scraping. This approach gives you more flexibility than accessing tags directly.

Use the select_one() method to find the first element that matches your query. For example, soup.select_one('div.content a') locates the first <a> tag inside a <div> with the class content.
Use the select() method to get a list of all matching elements. You can then access items from this list, like soup.select('div p')[0], to work with a specific element.

Advanced HTML parsing techniques

Building on these fundamentals, you can handle more demanding parsing jobs with libraries that offer jQuery-like syntax or focus on high-performance scraping.

Using `PyQuery` for jQuery-like HTML parsing

from pyquery import PyQuery as pq html = '<div class="content"><a href="https://example.com">Link</a><p>Text</p></div>' doc = pq(html) link = doc('div.content a') print(link.attr('href')) print(doc('div p').text())--OUTPUT--https://example.com Text

If you're familiar with jQuery, you'll find PyQuery intuitive. It lets you parse HTML using a similar API. After creating a PyQuery object from your HTML, you can chain methods to find and extract data.

Select elements using CSS selectors, like doc('div.content a').
Grab an attribute's value with the .attr() method.
Extract clean text content using the .text() method.

Navigating complex HTML structures with `BeautifulSoup`

from bs4 import BeautifulSoup html = '<ul><li>Item 1</li><li>Item 2<ul><li>Subitem</li></ul></li></ul>' soup = BeautifulSoup(html, 'html.parser') items = [item.text for item in soup.find_all('li', recursive=False)] parent = soup.find('li', string='Subitem').parent.parent print(items) print(parent.find_previous('li').text.strip())--OUTPUT--['Item 1', 'Item 2\nSubitem'] Item 1

BeautifulSoup excels at navigating nested HTML, letting you control search depth. By setting recursive=False in find_all(), you can limit the search to only direct children of a tag. This is useful for isolating top-level elements and ignoring those nested deeper inside.

You can also traverse the document tree in other directions:

The .parent attribute moves you from a child element up to the tag that contains it.
Methods like find_previous() help you locate sibling elements that appear before your current tag in the document.

High-performance parsing with `selectolax`

from selectolax.parser import HTMLParser html = '<div><p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p></div>' parser = HTMLParser(html) paragraphs = parser.css('p') for i, p in enumerate(paragraphs): print(f"Paragraph {i+1}: {p.text()}") print(f"Total paragraphs: {len(paragraphs)}")--OUTPUT--Paragraph 1: Paragraph 1 Paragraph 2: Paragraph 2 Paragraph 3: Paragraph 3 Total paragraphs: 3

When speed is your top priority, selectolax is an excellent choice. It’s a fast and lightweight parser built around Google's Gumbo parser. You simply pass your HTML to the HTMLParser to create a traversable object.

The css() method lets you find all elements matching a CSS selector, returning them as a list of nodes.
You can then loop through these results and use the text() method to extract the content from each node.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. Instead of piecing together parsing techniques, you can describe the app you want to build, and Agent 4 will take it from idea to working product.

A price monitor that scrapes e-commerce sites and notifies you when a product's price drops.
A content dashboard that pulls the latest headlines and links from multiple news sites into a single, clean view.
A data extractor that scrapes contact information from a directory of websites and organizes it into a CSV file.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Parsing HTML isn't always straightforward; you'll likely run into a few common roadblocks while extracting data.

Handling missing elements in `BeautifulSoup`

When an element you're searching for doesn't exist, methods like find() and select_one() return None. If you then try to access an attribute or method on this result, your script will crash with an AttributeError. Always check if the returned object is None before you try to work with it to prevent unexpected errors. For robust scraping applications, consider handling multiple exceptions in Python.

Fixing encoding issues with HTML documents

HTML documents can use different character encodings, and if your parser guesses the wrong one, you'll see garbled text instead of the correct characters. While libraries often auto-detect encoding, it’s not foolproof. If you encounter strange characters, you may need to specify the correct encoding manually when you load the document to ensure the text is parsed correctly.

Dealing with `NavigableString` vs `Tag` objects

In BeautifulSoup, you'll work with two main object types. A Tag object represents an HTML element, like <p>, and has methods for searching, like find(). A NavigableString object represents the text inside a tag. Trying to call a Tag method on a NavigableString will fail, so it's important to know which type of object you're handling.

Handling missing elements in `BeautifulSoup`

It's a common pitfall: your script crashes because an element you're searching for doesn't exist. When you try to directly access a missing tag, like soup.h1, Python raises an AttributeError because you can't get a property from nothing. The code below demonstrates this exact error.

from bs4 import BeautifulSoup html = "<div><p>Some text</p></div>" soup = BeautifulSoup(html, 'html.parser') title = soup.h1.text print(title)

The script fails because the HTML has no <h1> tag, so soup.h1 is None. Trying to access .text on this None value triggers the AttributeError. The following snippet demonstrates how to handle this situation correctly.

from bs4 import BeautifulSoup html = "<div><p>Some text</p></div>" soup = BeautifulSoup(html, 'html.parser') title_tag = soup.find('h1') title = title_tag.text if title_tag else "No title found" print(title)

To fix this, use the find() method, which returns None if the tag is missing instead of causing an AttributeError. The corrected code first assigns the result to a variable, like title_tag. It then checks if this variable is not None before trying to access its .text property. It's a crucial defensive coding practice whenever you're scraping web pages with inconsistent structures, ensuring your script doesn't crash unexpectedly. Another approach is using try and except in Python to catch these errors.

Fixing encoding issues with HTML documents

When a parser misinterprets a document's character encoding, you'll see garbled text instead of the content you expect. Libraries often guess the encoding, but it's not always accurate. The following code demonstrates this issue when parsing a live web page.

from bs4 import BeautifulSoup import requests response = requests.get('https://example.com') soup = BeautifulSoup(response.text, 'html.parser') print(soup.title.text)

The problem is that response.text forces requests to guess the encoding. If that guess is wrong, BeautifulSoup receives garbled text it can't fix. The following snippet shows how to handle this correctly.

from bs4 import BeautifulSoup import requests response = requests.get('https://example.com') soup = BeautifulSoup(response.content, 'html.parser') print(soup.title.text)

The solution is to use response.content instead of response.text. The response.text property makes requests guess the encoding, which can be unreliable. By passing the raw bytes from response.content directly to BeautifulSoup, you allow the parser to perform its own, more accurate encoding detection. This is a safer bet when scraping live websites, as it helps prevent the garbled text that results from incorrect encoding. For more details on HTTP requests, see our guide on calling APIs in Python.

Dealing with `NavigableString` vs `Tag` objects

When parsing with BeautifulSoup, you'll handle two main object types: Tag for elements and NavigableString for text. A common mistake is treating a NavigableString like a Tag, which causes an error. The following code demonstrates this exact issue.

from bs4 import BeautifulSoup html = '<div>Text <span>inside span</span> more text</div>' soup = BeautifulSoup(html, 'html.parser') for tag in soup.div.contents: print(tag.name)

This code fails because soup.div.contents contains text nodes (NavigableString objects) which don't have a .name attribute like Tag objects do. This mix of types causes an error. The following code demonstrates the correct approach.

from bs4 import BeautifulSoup html = '<div>Text <span>inside span</span> more text</div>' soup = BeautifulSoup(html, 'html.parser') for content in soup.div.contents: if hasattr(content, 'name'): print(content.name) else: print("Text node:", content.strip())

The solution is to check each item’s type before accessing its attributes. The soup.div.contents property returns a list containing both Tag objects and text nodes. Using hasattr(content, 'name') lets you test if an item is a tag, which has a .name, or just a string. This defensive check prevents errors when iterating through elements that contain a mix of tags and loose text—a common situation when parsing real-world HTML.

Real-world applications

By applying these robust parsing techniques and vibe coding, you can build useful tools that scrape weather data or compare prices across different websites.

Scraping weather data with `BeautifulSoup`

You can use BeautifulSoup's select() method for web scraping to target and extract specific data, like a daily weather forecast, directly from an HTML structure.

from bs4 import BeautifulSoup weather_html = '<div class="forecast"><ul><li>Monday: 72°F, Sunny</li><li>Tuesday: 68°F, Cloudy</li></ul></div>' soup = BeautifulSoup(weather_html, 'html.parser') forecast_items = soup.select('div.forecast li') for item in forecast_items: print(item.text)

This snippet shows how to extract multiple items from an HTML document. After parsing the HTML, the code uses the select() method with a CSS selector—'div.forecast li'—to pinpoint specific data.

This selector targets all <li> elements located anywhere inside a <div> with the class forecast.
The select() method returns a list of all matching tags.

Finally, a loop iterates through this list, and the .text property extracts and prints the clean text from each list item, giving you the weather for each day.

Building a simple price comparison tool with `BeautifulSoup`

You can also use BeautifulSoup to extract and compare data from multiple similar elements, like product prices from different online stores.

The code parses HTML containing the same product listed by two different stores to find the better price. It uses the select() method with a CSS attribute selector, div[class^="store"], to target all <div> elements whose class name starts with "store". This approach is flexible because it works even if the full class names are different, like store1 and store2.

The script then uses two list comprehensions to quickly extract the necessary information:

It builds a list of store names by grabbing the text from each <h2> tag.
It creates a parallel list of prices by finding the text in each <p> tag, removing the dollar sign with replace(), and converting the result to a float for numerical comparison.

With the data organized, the code finds the lowest price using min(). It then uses the index of that minimum price to find the name of the corresponding store, effectively identifying and printing the best deal. You could then export this data using techniques for reading CSV files in Python.

from bs4 import BeautifulSoup html = '''<div class="store1"><h2>Store A</h2><div class="product"><h3>Headphones</h3><p>$89.99</p></div></div> <div class="store2"><h2>Store B</h2><div class="product"><h3>Headphones</h3><p>$79.99</p></div></div>''' soup = BeautifulSoup(html, 'html.parser') stores = [store.h2.text for store in soup.select('div[class^="store"]')] prices = [float(div.p.text.replace('$', '')) for div in soup.select('div[class^="store"] .product')] best_deal = stores[prices.index(min(prices))] print(f"Best price for Headphones: ${min(prices)} at {best_deal}")

This code finds the best price by treating related data as parallel sequences. It systematically extracts information from the HTML to build two corresponding lists.

First, it creates a list of all store names in the order they appear.
Second, it builds a list of prices, cleaning the text to get pure numbers for comparison.

Because both lists share the same order, the script can find the lowest price and use its position to look up the correct store name. It’s an efficient way to identify the best deal.

Get started with Replit

Now, turn these techniques into a real tool. Tell Replit Agent: “Scrape the top posts from a news site” or “Extract all links from a URL and list them.”

The Agent writes the code, tests for errors, and deploys your app from a single prompt. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started free

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Follow @Replit