How to parse HTML in Python
Learn how to parse HTML in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

You often need to parse HTML in Python to scrape websites, extract data, and analyze content. Python offers powerful libraries to simplify navigation and manipulation of complex HTML documents.
In this article, we'll cover key techniques and libraries to parse HTML. We'll share practical tips, show real-world applications, and offer advice to help you debug your code and master data extraction.
Using BeautifulSoup for basic HTML parsing
from bs4 import BeautifulSoup
html = "<html><body><h1>Hello World</h1><p>This is a paragraph.</p></body></html>"
soup = BeautifulSoup(html, 'html.parser')
print(soup.h1.text)
print(soup.p.text)--OUTPUT--Hello World
This is a paragraph.
This snippet creates a BeautifulSoup object, which turns the raw HTML string into a navigable Python object. By specifying 'html.parser', you're telling the library to use Python's built-in parser to build this object tree.
With the HTML parsed, you can access the first instance of a tag directly, like an attribute. For example, soup.h1 finds the <h1> tag. The .text property then extracts only the human-readable content, leaving the HTML markup behind. Learn more about using Beautiful Soup in Python.
Basic HTML parsing techniques
While direct tag access is convenient, you can gain more control and speed by using different parsers and finding elements with specific CSS selectors.
Parsing HTML with Python's built-in html.parser
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(f"Start tag: {tag}")
def handle_data(self, data):
if data.strip():
print(f"Data: {data.strip()}")
parser = MyHTMLParser()
parser.feed("<html><body><h1>Hello World</h1></body></html>")--OUTPUT--Start tag: html
Start tag: body
Start tag: h1
Data: Hello World
Python's built-in html.parser module offers a low-level, event-driven approach. You create a custom class that inherits from HTMLParser and override methods to define how to handle different parts of the HTML document as it's read.
- The
handle_starttag()method is triggered for every opening tag, like<html>or<h1>. - The
handle_data()method processes the text content found between tags.
You then pass your HTML string to the parser's feed() method, which processes the document sequentially and calls your custom methods along the way.
Using lxml for faster HTML parsing
from lxml import html
content = "<html><body><h1>Hello World</h1><p>This is a paragraph.</p></body></html>"
tree = html.fromstring(content)
h1_text = tree.xpath('//h1/text()')
p_text = tree.xpath('//p/text()')
print(h1_text[0])
print(p_text[0])--OUTPUT--Hello World
This is a paragraph.
The lxml library is a high-performance parser, often faster than other options because it's built on C libraries and is memory-efficient. It uses the html.fromstring() function to convert your HTML into an element tree. This tree represents the document's structure, making it easy to navigate. Instead of direct attribute access, lxml uses XPath expressions to query this tree.
- You use the
xpath()method with a specific query, like'//h1/text()', to precisely target the content you need. - This method is especially useful for extracting data from complex or deeply nested HTML, returning the results in a list.
Finding elements with CSS selectors in BeautifulSoup
from bs4 import BeautifulSoup
html = '<div class="content"><a href="https://example.com">Link</a><p>Text</p></div>'
soup = BeautifulSoup(html, 'html.parser')
link = soup.select_one('div.content a')
print(link['href'])
print(soup.select('div p')[0].text)--OUTPUT--https://example.com
Text
BeautifulSoup allows you to use CSS selectors for precise element targeting, a common method in web scraping. This approach gives you more flexibility than accessing tags directly.
- Use the
select_one()method to find the first element that matches your query. For example,soup.select_one('div.content a')locates the first<a>tag inside a<div>with the classcontent. - Use the
select()method to get a list of all matching elements. You can then access items from this list, likesoup.select('div p')[0], to work with a specific element.
Advanced HTML parsing techniques
Building on these fundamentals, you can handle more demanding parsing jobs with libraries that offer jQuery-like syntax or focus on high-performance scraping.
Using PyQuery for jQuery-like HTML parsing
from pyquery import PyQuery as pq
html = '<div class="content"><a href="https://example.com">Link</a><p>Text</p></div>'
doc = pq(html)
link = doc('div.content a')
print(link.attr('href'))
print(doc('div p').text())--OUTPUT--https://example.com
Text
If you're familiar with jQuery, you'll find PyQuery intuitive. It lets you parse HTML using a similar API. After creating a PyQuery object from your HTML, you can chain methods to find and extract data.
- Select elements using CSS selectors, like
doc('div.content a'). - Grab an attribute's value with the
.attr()method. - Extract clean text content using the
.text()method.
Navigating complex HTML structures with BeautifulSoup
from bs4 import BeautifulSoup
html = '<ul><li>Item 1</li><li>Item 2<ul><li>Subitem</li></ul></li></ul>'
soup = BeautifulSoup(html, 'html.parser')
items = [item.text for item in soup.find_all('li', recursive=False)]
parent = soup.find('li', string='Subitem').parent.parent
print(items)
print(parent.find_previous('li').text.strip())--OUTPUT--['Item 1', 'Item 2\nSubitem']
Item 1
BeautifulSoup excels at navigating nested HTML, letting you control search depth. By setting recursive=False in find_all(), you can limit the search to only direct children of a tag. This is useful for isolating top-level elements and ignoring those nested deeper inside.
You can also traverse the document tree in other directions:
- The
.parentattribute moves you from a child element up to the tag that contains it. - Methods like
find_previous()help you locate sibling elements that appear before your current tag in the document.
High-performance parsing with selectolax
from selectolax.parser import HTMLParser
html = '<div><p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p></div>'
parser = HTMLParser(html)
paragraphs = parser.css('p')
for i, p in enumerate(paragraphs):
print(f"Paragraph {i+1}: {p.text()}")
print(f"Total paragraphs: {len(paragraphs)}")--OUTPUT--Paragraph 1: Paragraph 1
Paragraph 2: Paragraph 2
Paragraph 3: Paragraph 3
Total paragraphs: 3
When speed is your top priority, selectolax is an excellent choice. It’s a fast and lightweight parser built around Google's Gumbo parser. You simply pass your HTML to the HTMLParser to create a traversable object.
- The
css()method lets you find all elements matching a CSS selector, returning them as a list of nodes. - You can then loop through these results and use the
text()method to extract the content from each node.
Move faster with Replit
Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. Instead of piecing together parsing techniques, you can describe the app you want to build, and Agent 4 will take it from idea to working product.
- A price monitor that scrapes e-commerce sites and notifies you when a product's price drops.
- A content dashboard that pulls the latest headlines and links from multiple news sites into a single, clean view.
- A data extractor that scrapes contact information from a directory of websites and organizes it into a CSV file.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Parsing HTML isn't always straightforward; you'll likely run into a few common roadblocks while extracting data.
Handling missing elements in BeautifulSoup
When an element you're searching for doesn't exist, methods like find() and select_one() return None. If you then try to access an attribute or method on this result, your script will crash with an AttributeError. Always check if the returned object is None before you try to work with it to prevent unexpected errors. For robust scraping applications, consider handling multiple exceptions in Python.
Fixing encoding issues with HTML documents
HTML documents can use different character encodings, and if your parser guesses the wrong one, you'll see garbled text instead of the correct characters. While libraries often auto-detect encoding, it’s not foolproof. If you encounter strange characters, you may need to specify the correct encoding manually when you load the document to ensure the text is parsed correctly.
Dealing with NavigableString vs Tag objects
In BeautifulSoup, you'll work with two main object types. A Tag object represents an HTML element, like <p>, and has methods for searching, like find(). A NavigableString object represents the text inside a tag. Trying to call a Tag method on a NavigableString will fail, so it's important to know which type of object you're handling.
Handling missing elements in BeautifulSoup
It's a common pitfall: your script crashes because an element you're searching for doesn't exist. When you try to directly access a missing tag, like soup.h1, Python raises an AttributeError because you can't get a property from nothing. The code below demonstrates this exact error.
from bs4 import BeautifulSoup
html = "<div><p>Some text</p></div>"
soup = BeautifulSoup(html, 'html.parser')
title = soup.h1.text
print(title)
The script fails because the HTML has no <h1> tag, so soup.h1 is None. Trying to access .text on this None value triggers the AttributeError. The following snippet demonstrates how to handle this situation correctly.
from bs4 import BeautifulSoup
html = "<div><p>Some text</p></div>"
soup = BeautifulSoup(html, 'html.parser')
title_tag = soup.find('h1')
title = title_tag.text if title_tag else "No title found"
print(title)
To fix this, use the find() method, which returns None if the tag is missing instead of causing an AttributeError. The corrected code first assigns the result to a variable, like title_tag. It then checks if this variable is not None before trying to access its .text property. It's a crucial defensive coding practice whenever you're scraping web pages with inconsistent structures, ensuring your script doesn't crash unexpectedly. Another approach is using try and except in Python to catch these errors.
Fixing encoding issues with HTML documents
When a parser misinterprets a document's character encoding, you'll see garbled text instead of the content you expect. Libraries often guess the encoding, but it's not always accurate. The following code demonstrates this issue when parsing a live web page.
from bs4 import BeautifulSoup
import requests
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text)
The problem is that response.text forces requests to guess the encoding. If that guess is wrong, BeautifulSoup receives garbled text it can't fix. The following snippet shows how to handle this correctly.
from bs4 import BeautifulSoup
import requests
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.title.text)
The solution is to use response.content instead of response.text. The response.text property makes requests guess the encoding, which can be unreliable. By passing the raw bytes from response.content directly to BeautifulSoup, you allow the parser to perform its own, more accurate encoding detection. This is a safer bet when scraping live websites, as it helps prevent the garbled text that results from incorrect encoding. For more details on HTTP requests, see our guide on calling APIs in Python.
Dealing with NavigableString vs Tag objects
When parsing with BeautifulSoup, you'll handle two main object types: Tag for elements and NavigableString for text. A common mistake is treating a NavigableString like a Tag, which causes an error. The following code demonstrates this exact issue.
from bs4 import BeautifulSoup
html = '<div>Text <span>inside span</span> more text</div>'
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.div.contents:
print(tag.name)
This code fails because soup.div.contents contains text nodes (NavigableString objects) which don't have a .name attribute like Tag objects do. This mix of types causes an error. The following code demonstrates the correct approach.
from bs4 import BeautifulSoup
html = '<div>Text <span>inside span</span> more text</div>'
soup = BeautifulSoup(html, 'html.parser')
for content in soup.div.contents:
if hasattr(content, 'name'):
print(content.name)
else:
print("Text node:", content.strip())
The solution is to check each item’s type before accessing its attributes. The soup.div.contents property returns a list containing both Tag objects and text nodes. Using hasattr(content, 'name') lets you test if an item is a tag, which has a .name, or just a string. This defensive check prevents errors when iterating through elements that contain a mix of tags and loose text—a common situation when parsing real-world HTML.
Real-world applications
By applying these robust parsing techniques and vibe coding, you can build useful tools that scrape weather data or compare prices across different websites.
Scraping weather data with BeautifulSoup
You can use BeautifulSoup's select() method for web scraping to target and extract specific data, like a daily weather forecast, directly from an HTML structure.
from bs4 import BeautifulSoup
weather_html = '<div class="forecast"><ul><li>Monday: 72°F, Sunny</li><li>Tuesday: 68°F, Cloudy</li></ul></div>'
soup = BeautifulSoup(weather_html, 'html.parser')
forecast_items = soup.select('div.forecast li')
for item in forecast_items:
print(item.text)
This snippet shows how to extract multiple items from an HTML document. After parsing the HTML, the code uses the select() method with a CSS selector—'div.forecast li'—to pinpoint specific data.
- This selector targets all
<li>elements located anywhere inside a<div>with the classforecast. - The
select()method returns a list of all matching tags.
Finally, a loop iterates through this list, and the .text property extracts and prints the clean text from each list item, giving you the weather for each day.
Building a simple price comparison tool with BeautifulSoup
You can also use BeautifulSoup to extract and compare data from multiple similar elements, like product prices from different online stores.
The code parses HTML containing the same product listed by two different stores to find the better price. It uses the select() method with a CSS attribute selector, div[class^="store"], to target all <div> elements whose class name starts with "store". This approach is flexible because it works even if the full class names are different, like store1 and store2.
The script then uses two list comprehensions to quickly extract the necessary information:
- It builds a list of store names by grabbing the text from each
<h2>tag. - It creates a parallel list of prices by finding the text in each
<p>tag, removing the dollar sign withreplace(), and converting the result to afloatfor numerical comparison.
With the data organized, the code finds the lowest price using min(). It then uses the index of that minimum price to find the name of the corresponding store, effectively identifying and printing the best deal. You could then export this data using techniques for reading CSV files in Python.
from bs4 import BeautifulSoup
html = '''<div class="store1"><h2>Store A</h2><div class="product"><h3>Headphones</h3><p>$89.99</p></div></div>
<div class="store2"><h2>Store B</h2><div class="product"><h3>Headphones</h3><p>$79.99</p></div></div>'''
soup = BeautifulSoup(html, 'html.parser')
stores = [store.h2.text for store in soup.select('div[class^="store"]')]
prices = [float(div.p.text.replace('$', '')) for div in soup.select('div[class^="store"] .product')]
best_deal = stores[prices.index(min(prices))]
print(f"Best price for Headphones: ${min(prices)} at {best_deal}")
This code finds the best price by treating related data as parallel sequences. It systematically extracts information from the HTML to build two corresponding lists.
- First, it creates a list of all store names in the order they appear.
- Second, it builds a list of prices, cleaning the text to get pure numbers for comparison.
Because both lists share the same order, the script can find the lowest price and use its position to look up the correct store name. It’s an efficient way to identify the best deal.
Get started with Replit
Now, turn these techniques into a real tool. Tell Replit Agent: “Scrape the top posts from a news site” or “Extract all links from a URL and list them.”
The Agent writes the code, tests for errors, and deploys your app from a single prompt. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



