Search

A Microscopic View Of Python’s Lookbehind and Lookahead Regex Assertions

Any discussion on regular expressions, or regex, is not complete without taking note of the lookaround assertions. Lookaround assertion in regex are assertions that state that at the current position of the string, check whether so and so pattern exists before or after the string. Note that when doing lookarounds, the string used in the lookaround is not consumed and the current position in the string does not change.

Now we will be making use of four types of lookaround assertions in regex today. They are the positive lookbehind, the negative lookbehind, the lookahead, and lastly the negative lookahead assertion.

Python Regex with lookahead and lookbehind assertions
 

The Positive and Negative lookbehind assertions.

In lookbehind assertions, we are only looking for what precedes the current position in the string that we want to match. The pattern in the lookbehind assertion does not participate in the match, or as it is said, is not consumed in the match. It only helps in asserting that the match is true. Lookbehind can be positive or negative. In positive lookbeind assertions, we are asserting that the pattern is present before the string. In negative lookbehind assertions, we are asserting that the pattern is not present before the string.

The syntax for positive lookbehind assertion is (?<=foo) where a match is found if foo precedes the current position of the string that is to be matched and foo ends at the current position.

Let’s illustrate this with some example. For example, let’s assert some currency figures. If we have a string like ‘USD100’ and we only want to match 100. We can assert that USD should come before the number 100 with this code: (?<=USD)\d{3} which states to match a digit consisting of exactly 3 characters which is preceded by the string, USD.

The syntax for the negative lookbehind assertion is (?<!foo) which matches the current position of the string if foo is not before the string. If we could continue with the string, ‘USD100’, then to say that the EURO should not be in the string we could use the code: (?<!EURO)\d{3}. I believe by now you must understand what the pattern represents.

Now, we will go on the the second set of lookaround assertions which are the lookahead assertion and negative lookahead assertion.

The lookahead and negative lookahead assertions.

The lookahead assertions are just the opposite of the lookbehind assertions. The lookahead assertions look for the pattern or non-existence of the pattern ahead of the current position in the string.

The lookahead assertion looks for the existence of the specified pattern from the current position. The syntax of the lookahead assertion is (?=foo) where from the current position of the string we are looking ahead if foo exists. Let’s take our 'USD100' string again. If we want to do a lookahead to see if the number exists after the dollar symbol, we could use the following code: w{3}(?=\d{3}). But the 100 number is not consumed in the match, we are taking out only the USD, just that we only want to assert that the 100 comes after the USD.

The negative lookahead assertion is an opposite of the lookahead assertion. If included in a pattern, it asserts that the pattern in the assertion does not come after the current position in the string. For example, if we have the following string 'USD100' and we want to assert that it is not 'USD200' we could use the following code: \w{3}(?!200). The code states that we are matching any string that has three letters but without 200 following it literally.

So, that is what we can take from the lookaround assertions in python. Now, let’s use our knowledge to solve a problem.

Assuming you are given the following string, rabcdeefgyYhFjkIoomnpOeorteeeeet, and you want to match all substrings of the string that contain 2 or more vowels on the condition that each of the substrings must lie between two consonants and must only contain vowels. How do you go about it.

If you look at the question, it involves lookbehind and lookahead assertions. That is, a consonant must lie before the vowel (lookbehind) and must also lie ahead of the vowels (lookahead). When you understand this, your work is nearly done. Then we must denote what it means to have a consonant. That means, it must be any letter that does not lie within the set of vowels, [aeiou]. We will be doing a case-insensitive match, so we will have to raise the Ignore case flag for the regex search pattern, re.I.

Here is how the code is written:

There is nothing new in the code. I have already explained most of the code in another blog post, entitled: The Big Advantage of Understanding Python Regex Methods.

You can download the script here if you want to take a deeper look at it.

I hope you have a nice day. If you want to keep receiving python updates like this, just subscribe to the blog using your email.

No comments:

Post a Comment

Your comments here!

Matched content