Search

Parsing HTML Using Python

HTML, also known as Hypertext Markup Language, is used for creating web pages and is a standard markup language. If you have ever seen the code of any website or blog, you most probably was reading the HTML code of the page. In this post, we want to parse some HTML strings. By parsing, I mean analyzing the strings syntactically for specific symbols based on the components in the string.

python html parser

 

We will be using the python class HTMLParser that is contained in the module html.parser for this activity.

All instances or objects of the HTMLParser class are able to parse the markup of text that contains HTML and identify the tags and references that are contained in the text.

To use the HTMLParser class we need to first import it from html.parser module like this: from html.parser import HTMLParser. After that, any user has to inherit from the HTMLParser class and then override any method he desires to use.

But before we start showing that in code, I will first describe some of the methods of the HTMLParser class we will be overriding.

Methods of the HTMLParser class

The HTMLParser class has several methods which are divided into methods that belong to every instance of the class and methods that are called when HTML markup is encountered.

1. Methods of HTMLParser instance.

The two methods that are of interest to us for this post are the HTMLParser.feed() method and HTMLParser.close() method.

The HTMLParser.feed method has the syntax HTMLParser.feed(data) where the data is the text containing markup you want to parse. Processing takes place provided there are complete elements in data and the data is buffered when incomplete until the close() method is called.

The HTMLParser.close method processes all data as if the end-of-file marker has been encountered, thereby closing processing. This method can be redefined in user defined classes but when redefined, it must also be called again.

So, that’s it. Let’s move on to the methods that are used for handling encountered markup.

2. Methods for operating on HTML markup.

The HTMLParser class has several methods for operating on HTML markup, but I will deliberate on only those of interest in the code I will be writing in this blog. They are the common ones. When you know the basics, you can apply other methods to your choosing.

The methods are:

a. Method to handle start tags.

The syntax for this method is HTMLParser.handle_starttag(tag, attrs) and when a start tag in the markup is encountered, it is called with the tag and its corresponding attributes as arguments. So, in your handler code you can directly reference the tag. The attributes are denoted as a tuple of name, value like (name,value) enclosed in a list. So, you can use your for loop or any other method to extract the items in the tuple.

b. Method to handle end tags.

The syntax is HTMLParser.handle_endtag(tag) and the tag is always converted to lower case when markup is encountered.

c. Method to handle tags that have no corresponding end tags, or empty tags.

There are some HTML tags that do not have corresponding end tags like the <br \> tag. This method is designed to handle those. These tags are styled like in XHTML. The syntax for the method is HTMLParser.handle_startendtag(tag, attrs). The structure of the tags and attributes arguments are similar to that of the method to handle start tags.

d. Method to handle comments.

HTML markup contain comments which can be parsed. This method is used to handle them. The syntax is HTMLParser.handle_comment(data) where data is any text that is contained within the <!—Data --> comment.

e. Method to handle data between tags

When you have text that have to be rendered that exist between start and end tags, this method is used to handle those text. The syntax is HTMLParser.handle_data(data) where data is the text that is contained in between start and end tags.

Now that the methods are outlined, let me show how to apply them.

Let’s write a simple code that handles start and end tags when they are encountered, as well as tags that do not have corresponding end tags, or empty tags. The code will ignore comments when encountered in the markup. You can run the code to see the output.

Let me explain relevant parts of the code based on their lines. On line 1 I imported HTMLParser and then on line 3 I created a class, MyHtmlParser, that inherits from HTMLParser. Then inside the class I overrode the methods to handle start tags, empty tags, and end tags, along with comments. For the comments methods, I told the code to ignore all comments. For the end tag handler, I just printed the name of the end tag but preceded by the text, End. For the start tag and empty tags, I first printed out the tag and then printed out the attributes and value, taking a cue from the fact that the attribute and values are stored as pair of tuples.

That’s it. The driver code starts from lines 30 to 35 where I created an instance of MyHtmlParser class called parser and then fed the instance the html text in the variable, text, through the instance’s feed method.

That code was cool but plain simple.

Let’s take another simple code. This time a code that would count the tags and their frequencies in a markup. This promises to be more cool. Just run it to see how it goes.

This time in the code, since we are counting tags, we leave out the end tags handler since when we have identified a start tag that is enough so as not to double count. What the code does is that when any tag is identified, it checks to see if it is in the dictionary, parser_dict, and if it is not in it, it adds it and increments the count by 1 but if it is already in the dictionary, it only increments the count by 1. Just cool.

I hope you enjoyed my code for today. I promise to be bringing you more interesting code that shows how python works and how you can use python to its fullest. Just subscribe to the blog to get updates.

Happy pythoning.

Dynamic Programming In Python: Using Fibonacci Sequence

Dynamic programming is a programming method that aims to optimize solutions to problems. It was developed by Richard Bellman in the 1950s and has since become popular. In dynamic programming, we aim to break down a complicated bigger problem into simpler sub-problems in a recursive manner. If in breaking down the problem into sub-problems we encounter those sub-problems again, the problem is said to be overlapping. Then we try to find the optimal solutions to the sub-problems which solutions will then be combined with other sub-problems to get the overall solution. Thereby, the problem is said to have an optimal substructure.

dynamic programming in python

 

Most times when we have solved the smaller sub-problems we store the results of the solution so that when similar sub-problems are encountered, the results can be reused. This concept in dynamic programming is called memoization. Memoization not only speeds up computing, it also saves on computer resources.

In dynamic programming, we will be using the memoization technique a lot.

To illustrate how dynamic programming is used in practice let us take a problem that has both overlapping sub-problems and an optimal substructure. That is the problem of finding the Fibonacci sequence. We will be using the Fibonacci sequence as an example of dynamic programming in python.

The first ten Fibonacci sequence are: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34. Each item in the sequence is derived from the sum of the two previous items except for the case when n is 0 or 1. Then in those instances, when n is 0 the sequence is 0 and when 1 the sequence is 1.

Traditionally, the Fibonacci sequence can simply be solved using a recursive algorithm. Just like the one below:

In the code, lines 4 and 5 are used to state the cases where n is either 0 or 1. Then if n is neither of these, then recursively we call the sum of the previous two Fibonacci numbers.

This solution or algorithm is simple but not elegant. This is because the complexity of the algorithm is exponential. Therefore, it does not scale for large Fibonacci numbers. If you call it on n being 120, it would take more than 250,000 years to finish. Quite some age. Therefore, it would do with some optimization.

Now, let’s see how dynamic programming with python can help.

First, we state that dynamic programming will be applicable when a problem has overlapping sub-problems and that each sub-problem has an optimal solution which will be combined to give the overall solution. That is what we are doing in the above recursive calls. Let’s illustrate it with some pictures. For example when we are looking for the Fibonacci of 6. Look at the number of calls that are generated.

python fibonacci sequence diagram


Each call is a sub-problem. You can see that fib 2 was called 5 times while fib 3 was called 3 times before getting the final result. Since we can generate solution for each of the sub-problems, it would be nice to just generate the solution once, an optimal solution for the sub-problem, and use that result for every time that sub-problem was called. The operation of storing this result and using it for subsequent sub-problems is called memoization. Creating a memo of previously encountered problems that was solved.

Using this concept, we can now write a second code for the Fibonacci sequence based on dynamic programming using python.

As you can see from the dynamic programming code above, we are using memoization with a dictionary, fib_dict, to store already known results of sub-problems and these results are called when similar sub-problems are encountered. In line 9 we created the dictionary and gave it the initial values of the Fibonacci sequence. Then in line 11 we called the Fibonacci function, fast_fib. Lines 1 to 7 contains the code for the Fibonacci computing function, fast_fib. In the code, we first check to see if the key, n, is already in the dictionary, if it is, we return it but if it is not we compute the Fibonacci sequence recursively and store the result so that it can be used in subsequent calls. This is done consecutively as any n is needed. Finally, it returns the value for the key, n, in the dictionary where n is the Fibonacci number we are looking for.

This dynamic programming implementation runs in linear time and it scales considerably. I just so love it. It is much better than the earlier recursive Fibonacci code.

I hope you now know how dynamic programming works and how to implement it in other problem spaces.

Happy pythoning.

Matched content