Search

Constructing An XML Parser In Python

XML, or extensible markup language, is a markup language defining a set of rules for encoding documents such that they can be both human-readable and machine-readable. The World Wide Web Consortium has published a set of standards that define XML. You can reference the specifications here. Although XML was initially designed for documents, its use case has included several types of media and files.

python xml parser

 

A well-formed XML document among other things would include elements and attributes. An element, like in HTML, is a logical document component that begins with a start-tag and ends with an end-tag. A start-tag is denoted as <tag_name> while the end-tag as </tag_name>. An empty tag is a combination of both and is denoted as <tag_name />. An element could also have attributes within the start-tags or empty tags. Attributes are name-value pairs for the document and each name can only have one value. Example of an element with an attribute is <subtitle lang=’en’> where the subtitle element has the lang attribute with ‘en’ value. At the top of the XML document is a root element which is the entry into the document.

Now, in our code that parses XML we will only be dealing with elements and attributes.

To parse XML, python has an API for doing that. The module that implements the API is the xml.etree.ElementTree module, so you would have to import this module into your python file to use the API.

What the xml.etree.ElementTree module contains

This module is a simple API for parsing and creating XML in python. Although it is robust, it is not secure against maliciously constructed data. So, take note. Among several classes of interest, for our parsing activity we will be concentrating on two classes in this module – ElementTree which represents the whole XML document as a tree, and Element which represents a single node in the tree.

To import an XML document you could import it from a file or pass it as a string.

To import it from a file use the following code:

    
import xml.etree.ElementTree as etree
tree = etree.parse('data.xml')
root = tree.getroot()

while to get it directly from a variable as a string use the following code:

    
import xml.etree.ElementTree as etree
xml = 'data as string'
root = etree.fromstring(xml)

The root variable above refers to the root element in the XML document.

The ElementTree constructor

We will be using the ElementTree constructor to get to the root of our XML document, so it is worth mentioning here. The syntax for the constructor is xml.etree.ElementTree.ElementTree(element=None, file=None). The constructor can accept an element which serves as the root element as argument or you could pass it a file that contains the XML document. What it returns is the XML document as a tree that could be interacted with.

One interesting method of this class is the getroot() method. When you call this method on an ElementTree root, it returns the root element in the XML document. We will use the root element as our doorway into the XML document. So, take note of this method because we will be using it in our parsing code below.

That’s all we need from ElementTree class. The next class we will need is the Element class.

Objects of the Element class.

This class defines the Element interface. It’s constructor is xml.etree.ElementTree.Element(tag, attrib={}, **extra). But we will not be creating any elements but just using the attributes and methods. Use the constructor to create an element. But you can see from the constructor definition that an XML document element has two things: a tag and a dictionary of attributes. Objects of this class defines every element in the XML document.

Some interesting attributes and methods we will be using from this class are:

a. Element.attrib: This returns a dictionary that represents the attributes of the said element. What is included in the dictionary are name-value pairs of attributes in the Element or what some call Node in an XML document.

b. Element.iter(tag=None): this is the iterator for each element. It recursively iterates through the children of the element and gives you all the children, even the children of its children recursively. You could filter which result it can give by providing a tag argument that specifies the specific tag whose children you want to receive information about. It iterates over the element’s children in a depth-first order. But if you do not want to get the children in a recursive fashion but only want the first level children of any element, then you can use the next method below.

c. List(element): This is casting an element to a list. This casting returns a list of all the children, first level only, of the element. This method replaces the former Element.getchildren() method which is now deprecated.

So, I believe you now have a simple introduction into some of the features of the xml.etree.ElementTree module. Now, let’s implement this knowledge by parsing some XML documents.

The XML document we are going to parse is a feed for a blog. The XML document is given below:

    
<feed xml:lang='en'>
        <title>SolvingIt?</title>
        <subtitle lang='en'>
               Programming and Technology Solutions
                     </subtitle>
        <link rel='alternate' type='text/html' 
         href='https://emekadavid-solvingit.blogspot.com' />
        <updated>2020-09-12T12:00:00</updated>
        <entry>
            <author>
                <name>Michael Odogwu</name>
                <uri>
                https://emekadavid-solvingit.blogspot.com
                </uri>
            </author>
        </entry>
    </feed>   

You can reference this document in the code while reading the code. You can see that the XML document has elements or nodes and the root tag is named feed. The elements also have attributes.

The first task we are going to do is that we are going to find the score of the XML document. The score of the XML document is the sum of the score of each element. For any element, the score is equal to the number of attributes that it has.

The second task is to find the maximum depth of the XML document. That is, given an XML document, we need to find the maximum level of nesting in it.

So, here is the code that prints out the score and maximum depth of the XML document above. I want you to run the code and compare the result with what you would have calculated yourself. Then, after running the code, the next section is an explanation of relevant points in the code along with a link to download the script if you want to take an in-depth look at it.

Now, for an explanation of the relevant sections of the code. I will use the lines in the code above to explain it.

Line 1: We import the module, xml.etree.ElementTree and name it etree.

Lines 23-35: The XML document.

Line 36, 37: Using the fromstring method of the module, we import the xml document and pass it to the ElementTree constructor which then constructs a tree of the document. Then from the tree created we get the root element (or node) so that we can parse the document starting from the root element.

Line 39: We pass the root element to our function, get_attr_number, that calculates the score of the XML document.

Lines 3-8: What the get_attr_number function does is that it takes the root element or node and recursively iterates through it using node.iter() to get all the children, even the nested children. For each child element, it calculates the score for that child by finding out the length of the attribute dictionary in it, len(i.attrib) and then adds this score to the total score. It then returns the total score as the total variable.

Next is to find the maximum depth. In the XML tree, we take the root element, feed, to be a depth of 0. Take note.

Lines 41,42: Here the depth function is called, passing it the root element of the tree and the default level is noted as -1. Then maxdepth, a global variable, is printed out after the depth function has finished execution. I now describe the depth function.

Lines 12-20: When this function is called, it increases the level count by 1 and checks to see if the level is greater than the maxdepth variable in order to update maxdepth. Then for each node or element, if that element has children, list(elem), it calls the function, depth, recursively.

You can download the above code here, xmlparser.py.

Now, I believe you understand how the code works. I want you to be creative. Think of use cases of how you can use this module with other XML functions like creating XML documents, or writing out your own XML documents and parsing them in the manner done above. You can also check out another parser I wrote, this time an HTML parser.

Happy pythoning.

No comments:

Post a Comment

Your comments here!

Matched content