Have you ever wondered how to do things the easy way when dealing with strings? Maybe you want to match some strings and you get stomped.
For example, take this hackerrank.com challenge. The challenge says that given a string, find out if it is a float where floats have the following requirements:
1.Number can start with +, -, or . symbol. 2.Number must contain at least one decimal symbol 3.Number must have exactly one . symbol 4.Number must not give any exceptions when converted using the float(N) expression where N is the string.
Now, those are some requirements. We could code this literally in python but it would be unbeautiful and illogical. That is where the beauty of regular expressions come to play.
What are regular expressions?
A regular expression, sometimes denoted as regex, is a sequence of strings that defines a search pattern. For example, if you want to search for the word ‘bat’ in battered, you could use ‘bat’ as a regular expression. Therefore, for the challenge above, we would be using regular expressions to solve it simply and beautifully.
Introduction to python regex.
You can get the full details of the use of python regex at the python regex documentation page here.
But I will just briefly cover the main points.
In python, regular expressions can be composed of metacharacters and also the ability to do repetitions on characters.
First the metacharacters.
The basic metacharacters are:
- [ ] which is the set metacharacter. Anything within the [] character class will be included in the search. For example having [mnc] means to search for the occurrence of an m, n, or c in the original string. This metacharacter can also be a range. For example [a-c] means to match the set of characters between a and c. if denoted by [a-z] that means to match any lower case character.
- Complementary to the above is the complement of the set, [^ ]. This means that anything within this complementary set should not be included in the search. For example if you get [^ cab] it means when searching do not include the lower case letter c, a, or b in the search.
- \d is the metacharacter that says search only decimal digits. It is equivalent to [0-9]
- \D says match any non-digit character. It is equivalent to the complement set, [^0-9]
- \s says match any white space while \S is the complement that says do not match any whitespace.
- \w says match any alphanumeric character while \W says do not match alphanumeric characters.
Now, let’s talk about repeating matches.
There are four symbols for match repetition.
- * metacharacter says that the preceding character matches zero or more time. That means you can match it 0, 1, 2 etc infinite times. For example do*g will match dg, dog, doog, doooog, etc. Now you get it.
- + metacharacter says match the preceding character one or more times. Note its difference from *. That means it can match 1, 2, 3 etc times to infinity but can never match 0 time. Examples are for the above again, if we have do+g it will match dog, doog, dooog etc but not dg. Get it?
- ? metacharacter says match one or zero times. It is called the optional repeating character. For example having da?d will match dd and also match dad and nothing else.
- Now the last repeating character is {m,n} character. It says match the preceding character at least m number of times and at most n number of times.
Now, I believe that is what we need to start building our matching pattern.
From our challenge above we say we have a date with floats. Let’s build a matching pattern from a string that fulfils starting from +, - or . symbols. Then has at least one decimal symbol and exactly one . symbol. That gives us the following regex pattern:
pattern = r'[+\-]?[0-9]*.{1,1}\d+'
Let's explain the pattern slowly. [+\-]?
says that the float could start with optional + or -. Notice that we backslashed the – because this would make python to recognize the character. If we didn’t backslash it, python would think it is a range because it is within a set metacharacter, [ ].
Next is [0-9]?
. This pattern says that any digit can occur zero of more times where it starts or does not start with + or -. Notice that we left off the . character in the first space. This is because if a float starts with the period, ., character, then it would not have any digit. So, that is why the period is coming after the digit.
The .{1,1} says that we will have at least and at most 1 period character. That is exactly one. Then the pattern says that it ends with a digit. There must be at least one digit. That is what the ending pattern, \d+, means. Notice that I interchanged [0-9] and \d. I just wanted you to realize that they are the same patterns.
One tool that you can use to build your regex patterns is found at regex101.com. This is a great site for using regular expressions. It has all the tools you need to understand regular expressions.
Now that we have created the regular expressions according to the challenge above, let us test it to see that it works. To do so, I compiled the pattern and created a comparison_list for the list of strings we want to check the pattern against. Then using a for loop I went through each of the items in the list looking for matches.
You would notice that the challenge involves casting the string to a float. I did not put that into the regex because that is not the function of regex but in the try block there is a conditional that checks for that.
Here is the code for the challenge.
You could create your own test suites and check out how your skill with regular expressions. They are a fun way to code. For other methods of the python regex, you could check out this post on python re match and python re search.
You could download the script for the solution here. Enjoy your date with regular expressions and floats.