Search

Parsing HTML Using Python

HTML, also known as Hypertext Markup Language, is used for creating web pages and is a standard markup language. If you have ever seen the code of any website or blog, you most probably was reading the HTML code of the page. In this post, we want to parse some HTML strings. By parsing, I mean analyzing the strings syntactically for specific symbols based on the components in the string.

python html parser

 

We will be using the python class HTMLParser that is contained in the module html.parser for this activity.

All instances or objects of the HTMLParser class are able to parse the markup of text that contains HTML and identify the tags and references that are contained in the text.

To use the HTMLParser class we need to first import it from html.parser module like this: from html.parser import HTMLParser. After that, any user has to inherit from the HTMLParser class and then override any method he desires to use.

But before we start showing that in code, I will first describe some of the methods of the HTMLParser class we will be overriding.

Methods of the HTMLParser class

The HTMLParser class has several methods which are divided into methods that belong to every instance of the class and methods that are called when HTML markup is encountered.

1. Methods of HTMLParser instance.

The two methods that are of interest to us for this post are the HTMLParser.feed() method and HTMLParser.close() method.

The HTMLParser.feed method has the syntax HTMLParser.feed(data) where the data is the text containing markup you want to parse. Processing takes place provided there are complete elements in data and the data is buffered when incomplete until the close() method is called.

The HTMLParser.close method processes all data as if the end-of-file marker has been encountered, thereby closing processing. This method can be redefined in user defined classes but when redefined, it must also be called again.

So, that’s it. Let’s move on to the methods that are used for handling encountered markup.

2. Methods for operating on HTML markup.

The HTMLParser class has several methods for operating on HTML markup, but I will deliberate on only those of interest in the code I will be writing in this blog. They are the common ones. When you know the basics, you can apply other methods to your choosing.

The methods are:

a. Method to handle start tags.

The syntax for this method is HTMLParser.handle_starttag(tag, attrs) and when a start tag in the markup is encountered, it is called with the tag and its corresponding attributes as arguments. So, in your handler code you can directly reference the tag. The attributes are denoted as a tuple of name, value like (name,value) enclosed in a list. So, you can use your for loop or any other method to extract the items in the tuple.

b. Method to handle end tags.

The syntax is HTMLParser.handle_endtag(tag) and the tag is always converted to lower case when markup is encountered.

c. Method to handle tags that have no corresponding end tags, or empty tags.

There are some HTML tags that do not have corresponding end tags like the <br \> tag. This method is designed to handle those. These tags are styled like in XHTML. The syntax for the method is HTMLParser.handle_startendtag(tag, attrs). The structure of the tags and attributes arguments are similar to that of the method to handle start tags.

d. Method to handle comments.

HTML markup contain comments which can be parsed. This method is used to handle them. The syntax is HTMLParser.handle_comment(data) where data is any text that is contained within the <!—Data --> comment.

e. Method to handle data between tags

When you have text that have to be rendered that exist between start and end tags, this method is used to handle those text. The syntax is HTMLParser.handle_data(data) where data is the text that is contained in between start and end tags.

Now that the methods are outlined, let me show how to apply them.

Let’s write a simple code that handles start and end tags when they are encountered, as well as tags that do not have corresponding end tags, or empty tags. The code will ignore comments when encountered in the markup. You can run the code to see the output.

Let me explain relevant parts of the code based on their lines. On line 1 I imported HTMLParser and then on line 3 I created a class, MyHtmlParser, that inherits from HTMLParser. Then inside the class I overrode the methods to handle start tags, empty tags, and end tags, along with comments. For the comments methods, I told the code to ignore all comments. For the end tag handler, I just printed the name of the end tag but preceded by the text, End. For the start tag and empty tags, I first printed out the tag and then printed out the attributes and value, taking a cue from the fact that the attribute and values are stored as pair of tuples.

That’s it. The driver code starts from lines 30 to 35 where I created an instance of MyHtmlParser class called parser and then fed the instance the html text in the variable, text, through the instance’s feed method.

That code was cool but plain simple.

Let’s take another simple code. This time a code that would count the tags and their frequencies in a markup. This promises to be more cool. Just run it to see how it goes.

This time in the code, since we are counting tags, we leave out the end tags handler since when we have identified a start tag that is enough so as not to double count. What the code does is that when any tag is identified, it checks to see if it is in the dictionary, parser_dict, and if it is not in it, it adds it and increments the count by 1 but if it is already in the dictionary, it only increments the count by 1. Just cool.

I hope you enjoyed my code for today. I promise to be bringing you more interesting code that shows how python works and how you can use python to its fullest. Just subscribe to the blog to get updates.

Happy pythoning.

No comments:

Post a Comment

Your comments here!

Matched content