How to read an XML file in Python
Learn how to read XML files in Python with our guide. We cover various methods, tips, real-world applications, and common error debugging.

XML files are a standard for data exchange and configuration. To work with them, Python provides robust libraries that help you parse and extract information efficiently for various software applications.
In this article, you'll explore several techniques to read XML files. You'll find practical tips, see real-world applications, and get advice to debug common issues you might face.
Basic parsing with ElementTree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
print(f"{child.tag}: {child.text}")--OUTPUT--item: First item
item: Second item
Python's built-in xml.etree.ElementTree module offers a direct way to handle XML. The ET.parse() function reads your file and converts it into a tree-like object in memory, which you can then navigate programmatically, similar to how you approach reading CSV files with structured data parsing.
By calling tree.getroot(), you access the document's top-level element. The code then iterates through each direct child of this root. For each child, it prints the element's tag with child.tag and its text content with child.text. This approach is great for quickly extracting simple data from the first level of your XML structure.
Common XML parsing approaches
While ElementTree is a solid starting point, Python’s standard library and third-party packages offer more specialized tools for different XML parsing scenarios.
Using xml.dom.minidom for DOM parsing
from xml.dom import minidom
xmldoc = minidom.parse('sample.xml')
items = xmldoc.getElementsByTagName('item')
for item in items:
print(item.firstChild.data)--OUTPUT--First item
Second item
The xml.dom.minidom module implements the Document Object Model (DOM) standard. This approach loads the entire XML file into memory, letting you navigate the full tree structure.
- The
getElementsByTagName()method is particularly useful. It finds all elements with a specific tag, like'item', no matter how deeply they're nested. - Once you have an element, you can access its contents by navigating its child nodes, such as with
item.firstChild.datato get the text.
Using xml.sax for event-based parsing
import xml.sax
class XMLHandler(xml.sax.ContentHandler):
def characters(self, content): self.content = content
def endElement(self, tag):
if tag == "item": print(self.content)
parser = xml.sax.make_parser()
parser.setContentHandler(XMLHandler())
parser.parse('sample.xml')--OUTPUT--First item
Second item
The xml.sax module offers a memory-efficient, event-driven approach. It reads the file sequentially, triggering events as it encounters elements rather than loading the entire document at once. This makes it ideal for parsing very large files, similar to how iterating through lists processes elements one by one.
- You define a custom
ContentHandlerto tell the parser what to do for each event. - In this code, the
characters()method captures text content. TheendElement()method then acts when it finds a closing</item>tag, printing the content it just saved.
Reading XML with the lxml library
from lxml import etree
tree = etree.parse('sample.xml')
root = tree.getroot()
for element in root.findall('.//item'):
print(element.text)--OUTPUT--First item
Second item
The lxml library is a popular third-party alternative that combines the speed of C libraries with a Pythonic interface. It’s often faster and more feature-rich than the standard library's ElementTree, making it a go-to for performance-critical tasks.
- The code uses
findall('.//item')to locate elements. This is an XPath expression where.//tells the parser to find allitemelements anywhere in the tree, not just direct children of the root.
This powerful feature makes lxml excellent for navigating complex XML structures with deeply nested data.
Advanced XML processing techniques
Beyond the basics, Python offers powerful tools for tackling more demanding XML challenges, such as processing huge files, running complex queries, and ensuring data integrity. AI coding with Python can help automate these complex parsing workflows.
Working with large XML files using iterparse
import xml.etree.ElementTree as ET
context = ET.iterparse('large_sample.xml', events=('end',))
for event, elem in context:
if elem.tag == 'item':
print(elem.text)
elem.clear() # Free memory--OUTPUT--First item
Second item
...
When you're working with massive XML files, loading the entire document into memory isn't practical. The iterparse() function offers a memory-efficient solution by parsing the file incrementally, letting you process elements as they're read without consuming all your RAM.
- By setting
events=('end',), you instruct the parser to act on elements only after they are fully read, giving you access to their content. - The key to managing memory is the
elem.clear()call. It discards the element after you're done, which keeps memory usage low and prevents your application from slowing down or crashing.
Using XPath queries to navigate XML
from lxml import etree
tree = etree.parse('sample.xml')
# Find all items with specific text
items = tree.xpath('//item[contains(text(), "Second")]')
for item in items:
print(f"Found matching item: {item.text}")--OUTPUT--Found matching item: Second item
XPath is a powerful language for selecting specific nodes from an XML document. With lxml, you can use the xpath() method to run these queries, giving you far more flexibility than just searching by tag name.
- The expression
'//item'tells the parser to find allitemelements, no matter where they are in the document tree. - The predicate
[contains(text(), "Second")]acts as a filter, selecting only those elements whose text content includes the word "Second".
This approach is incredibly efficient for pinpointing exact data within complex or deeply nested XML structures.
Validating XML against a schema
from lxml import etree
schema_root = etree.parse('schema.xsd')
schema = etree.XMLSchema(schema_root)
parser = etree.XMLParser(schema=schema)
try:
tree = etree.parse('sample.xml', parser)
print("XML is valid according to the schema")
except etree.XMLSyntaxError as e:
print(f"XML validation error: {e}")--OUTPUT--XML is valid according to the schema
Validating your XML against a schema ensures its structure and content are correct, which is vital for data integrity. A schema, often an XSD file, acts as a blueprint for your XML document. This process uses lxml to check if your file follows the rules you've defined.
- First, you load the schema and create a validator object with
etree.XMLSchema(). - You then create a special parser using
etree.XMLParser(schema=schema)that incorporates these rules. - When
etree.parse()runs with this parser, it automatically validates the XML. If the file doesn't conform, it raises anetree.XMLSyntaxError.
Move faster with Replit
Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. Instead of just learning techniques, you can use Agent 4 to build a complete application from a simple description, handling everything from code and databases to APIs and deployment.
Instead of piecing together XML parsing methods, you can describe the app you want to build and let the Agent take it from concept to a working product:
- A configuration utility that parses an XML settings file and extracts specific values, like database credentials or API keys, based on user queries.
- A data migration tool that reads a large XML export from an old system, using an incremental parser like
iterparse, and converts the records into a new format. - A content dashboard that pulls product information from an XML feed, validates it against a schema, and displays the items and their attributes in a clean interface.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with the right tools, you might encounter issues like missing elements, namespace conflicts, or encoding errors when parsing XML files.
- Handling missing elements: When a method like
find()can't locate an element, it returnsNone. If you try to use this result, your script will raise anAttributeError. To avoid this, always check if the result isNonebefore you try to access its attributes or text, ensuring your code handles missing data gracefully. - Resolving namespace issues: Namespaces can make element selection tricky because they change an element's official name. If a simple tag search fails, it’s likely because of a namespace. You'll need to define a dictionary that maps a prefix to the namespace URI and then use that prefix in your search queries to target elements correctly.
- Fixing encoding problems: A
UnicodeDecodeErrorusually signals a mismatch between the file's encoding and the one your parser is using. Most XML files declare their encoding in the first line. If your parser doesn't pick it up automatically, you may need to specify it when opening the file to prevent errors.
Handling missing elements with find() method
When the find() method fails to locate an element, it returns None. Attempting to access an attribute like .text on this result will trigger an AttributeError, crashing your script. This is a common issue with inconsistent XML data. The code below shows this error in action.
import xml.etree.ElementTree as ET
tree = ET.parse('users.xml')
root = tree.getroot()
for user in root.findall('user'):
name = user.find('name').text
email = user.find('email').text
print(f"User: {name}, Email: {email}")
The script crashes if a user element is missing an email tag, as user.find('email') returns None. Attempting to access .text on this result causes the error. The following code demonstrates a safe way to proceed.
import xml.etree.ElementTree as ET
tree = ET.parse('users.xml')
root = tree.getroot()
for user in root.findall('user'):
name = user.find('name')
email = user.find('email')
name_text = name.text if name is not None else "Unknown"
email_text = email.text if email is not None else "No email"
print(f"User: {name_text}, Email: {email_text}")
The corrected code prevents crashes by first checking if the element returned by find() exists. It uses a conditional expression—name.text if name is not None else "Unknown"—to safely access the .text attribute only when the element isn't None.
If the element is missing, it assigns a default string instead. This defensive check is crucial when parsing XML from sources where data consistency isn't guaranteed, preventing unexpected errors in your application.
Resolving namespace issues in XML parsing
Namespaces in XML prevent naming conflicts but can complicate parsing. When an element belongs to a namespace, a simple search like findall('element') won't work because the tag’s name is technically different. The following code demonstrates this common pitfall.
import xml.etree.ElementTree as ET
tree = ET.parse('namespace_data.xml')
root = tree.getroot()
elements = root.findall('element')
for elem in elements:
print(elem.text)
This script produces no output because findall('element') returns an empty list. The parser doesn't recognize the elements without their namespace. The following code shows how to correctly target them.
import xml.etree.ElementTree as ET
tree = ET.parse('namespace_data.xml')
root = tree.getroot()
ns = {'ns': 'http://example.org/namespace'}
elements = root.findall('.//{http://example.org/namespace}element')
# Or alternatively:
# elements = root.findall('.//ns:element', ns)
for elem in elements:
print(elem.text)
The corrected code works because it tells the parser how to handle the namespace. You can either include the full namespace URI directly in your query, like '{http://example.org/namespace}element', or define a namespace map. Using a map with a prefix, such as 'ns:element', often makes your queries cleaner. Understanding creating dictionaries is essential for building these namespace mappings effectively. This issue is common when you're parsing XML from web services or industry-standard formats, so always check for a namespace declaration at the top of the file.
Fixing encoding problems when parsing XML files
A UnicodeDecodeError is a classic sign that your parser is reading a file with the wrong text encoding. XML files typically declare their encoding, but if it’s not detected, the parser defaults to UTF-8, which can’t handle certain characters.
The code below triggers this error when trying to parse a file containing special characters without the correct encoding specified.
import xml.etree.ElementTree as ET
tree = ET.parse('special_chars.xml')
root = tree.getroot()
for elem in root:
print(elem.text)
The ET.parse() function fails because its default encoding can't process the file's characters, triggering the error. The corrected code below shows how to properly instruct the parser and prevent this crash.
import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
try:
tree = ET.parse('special_chars.xml', parser=parser)
root = tree.getroot()
for elem in root:
print(elem.text)
except ET.ParseError as e:
print(f"XML parsing error: {e}")
The corrected code prevents this crash by explicitly telling the parser which encoding to use. You create a custom parser with ET.XMLParser(encoding="utf-8") and pass it to the ET.parse() function. This ensures the file is read correctly, even with special characters. For more robust error handling, consider handling multiple exceptions that may occur during XML parsing. It's a crucial step when you're dealing with XML from different systems or international sources, where encoding isn't always standard.
Real-world applications
Beyond troubleshooting, these parsing methods are essential for everyday tasks like extracting information from RSS feeds or converting XML data into JSON format. With vibe coding, you can quickly prototype these data transformation utilities.
Extracting RSS news with ElementTree.parse()
The ElementTree.parse() function can also process data directly from a web request, which is ideal for parsing live content like an RSS news feed.
import xml.etree.ElementTree as ET
import urllib.request
url = 'http://feeds.bbci.co.uk/news/rss.xml'
response = urllib.request.urlopen(url)
tree = ET.parse(response)
root = tree.getroot()
for item in root.findall('./channel/item')[:2]:
print(f"Headline: {item.find('title').text}")
This script shows how to target specific information within an online RSS feed. The key is the findall('./channel/item') method, which uses a path to navigate the XML tree and isolate only the news articles.
- It targets
<item>elements nested inside a<channel>tag. - The
[:2]slice then limits the loop to just the first two articles, making the operation efficient.
For each article, the code finds the corresponding <title> tag and prints its text content as a headline.
Transforming XML to JSON with findall() and json.dumps()
Another practical application is converting XML to JSON, which you can do by using findall() to structure your data into a list of dictionaries and then passing it to json.dumps() for the final conversion.
import xml.etree.ElementTree as ET
import json
tree = ET.parse('customers.xml')
root = tree.getroot()
customers = []
for customer in root.findall('./customer'):
customer_data = {
'id': customer.get('id'),
'name': customer.find('name').text,
'email': customer.find('email').text
}
customers.append(customer_data)
print(json.dumps(customers, indent=2))
This script systematically extracts customer details from an XML file by looping through each <customer> element found with root.findall(). It then organizes the data into a Python list of dictionaries, making it easy to work with.
- The
customer.get('id')method pulls the value from an element's attribute. - The
customer.find()method retrieves nested elements likenameandemail, while.textaccesses their content.
Finally, json.dumps() formats this list into a clean, readable string for printing.
Get started with Replit
Now, turn theory into practice. Describe a tool to Replit Agent, like “build a utility to parse an RSS feed and display headlines” or “convert our product XML data into a JSON API.”
The Agent will write the code, test for errors, and deploy your application for you. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



