Using Python Regex To Validate Roman Numerals

Python regex, or sometimes called python regular expressions, are expressions written in python that are made to match a specific pattern in a string. They are a widely used feature in the world of UNIX and is provided by many programming languages. Python is not left out. Some of the advantages of using python regex are that with just one pattern you can validate any kind of input. Something we will be doing in this post. It keeps your code cleaner because it usually involves fewer lines of code, and furthermore saves you the stress of writing numerous lines of if else statements.

If you want a guide to regular expressions in python and some functions that come with the use of python regex, I will encourage you to read it up in this post, that describes the basic syntax, and then this other post on the methods we will be using, the python re match method.

In today’s post, we are going to show how to use python regex to validate Roman numerals based on its rules.

Roman Numerals and Its Rules

Roman numerals are a numeral system that originated in ancient Rome. They were popular and became the usual ways of writing numbers even down to the late middle ages in Europe. The numbers use Latin alphabets to represent numbers and these alphabets are combined according to set rules. In the modern usage of Roman numerals, seven alphabets are used to designate numbers and they are:

Symbol	Value
I	1
V	5
X	10
L	50
C	100
D	500
M	1000

Some of the rules for writing valid Roman numerals which we will be using for validation are:

The Roman numerals I, X and C can be repeated up to 3 times in succession to form the numbers but repetition of V, L, or D is invalid.
To form numbers a digit of lower value can be placed before or after the digit of higher value and digits of lower value that can be used for this are I, X, and C.
You should add up all the digits in a group when a digit of lower value is placed after or to the right of a digit of higher value. Digits of similar values placed together are also added.
Subtract the value of lower digit from the value of higher value when a digit of lower value is placed to the left or before a digit of higher value. Note that V is never written to the left of X.

So, now that we have the rules we need to form the python regular expressions, let’s do the Roman numerals validation which is the juicy part.

Validating any Roman numeral

When you run the code below, you need to input a string as a Roman numeral when you are prompted. You will get a result indicating whether the string is a valid Roman numeral or not. If it is an invalid Roman numeral, you will get a message that says: “Invalid Roman Numeral” but if it is valid, you will get a message that says: “Your roman numeral was valid. Welcome.”

Now, let’s run it and have fun. After you have tried running it, I will give a brief explanation of the lines of code. Note that this code takes only 8 lines. If I had needed to use a python if else statement, that would have taken more than that which would not be clean.

Now, that you have taken some time running the above code and seeing how it works, let me explain some of the parts. I think I don’t need to explain the python re match method because you have read it from the link I gave above. So, I will just explain the pattern.

The key to the pattern matching above is the python regex pattern which is denoted as:

regex_pattern = r"^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$"

The ^ symbol that starts the pattern states that we should start from the beginning of the string while the corresponding $ symbol at the end says that we end at the end of the string. So, we presume that each string passed to the code will only be a single regular expression pattern, otherwise you will get invalid code. Now after the ^ symbol is a lookahead assertion, (?=[MDCLXVI])). Read up this blog post on python lookahead assertions if you want a refresher.

What the python lookahead assertion does is that it says starting at the beginning of the string we want to look ahead and state that any symbol we will be getting must either be an M, D, C, L, X, V or I. Yes, the only symbols that should be allowed to start the string are the seven symbols of the Roman numerals and nothing else. Note that the characters in python lookahead assertion are not captured. So, right now, we have not captured any match.

The next symbol is to match the thousands place. I denote this with the pattern: M*. It states that for the thousands place in the number, we need to match for M either 0 or more times. If the number is not a thousand or multiple of it, then M is zero but if it is then M is 1 or more, so we get a match for this. Unfortunately, I cannot guarantee you that this pattern will match beyond 3999, this is because from 4000, we need a very special thousand Roman numeral symbol to denote this which the pattern cannot cover. But you can try 1999 (MCMXCIX) and see that it matches. Because of the limitation in the thousands place, we could replace M* above with M{0,3} to state that we cannot go beyond 3999.

The next symbol to match is the hundreds place from 100 to 999. I denote the hundreds place with (C[MD]|D?C{0,3}) pattern. What this pattern says is the for a hundred place match, either C (100), should be to the left of M (1000) or D(500), or C should come after an optional D (500), but not more than three consecutive Cs.

The next is the tens place which runs from 10 to 99. The symbol for it is: (X[CL]|L?X{0,3}). This states that the tens place can either be an X (10) before a C (100) or L (50), or it can come after an optional L (50) and if this is the case in not more than 3 consecutive Xs.

The next is the units place which is between 1 and 9. Remember there is no 0 in roman numerals. The symbol for it is: (I[XV]|V?I{0,3}). What the symbol is stating is that the units place is denoted either by an I (1) appearing to the left of an X (10) or V (5), or it appears to the right of an optional V (5) and if that is the case not more than 3 times.

Well, that is it. Enjoy validating your Roman numerals with this simple tool.

I hope you do leave a comment about your results.

Happy pythoning.

Using The Python String Format Method: Format Specifications Part 2

In an earlier post, I showed how to use field names and conversion fields to format values in replacement fields. Today, I will continue that discussion by showing how to use format specifications, the optional and last feature of replacement fields, in the python string format method.

The format specification for python string format method

The format specifications represent how the value in the replacement field should be presented. It includes details such as the width of the field, its alignment, padding, conversion etc. Each value type is given its own specification. Also note that each format specification can include nested replacement fields but the level of nesting should not be deep.

You use a colon, :, to denote the start of a format specification. The format specification has 8 flags and I will denote each of them in their order of precedence. Note that each of the flags are optional.

The fill flag

This flag is used as the first flag. Use it to denote what you want to use to fill the space in the presentation of the value of the object. Any character can be used as the fill character and if it is omitted, it defaults to a space. Note that a curly brace cannot be a fill character except the curly brace is in a nested replacement field. The fill character kicks in when the value of the object cannot fill the specified width of the replacement field otherwise it doesn’t apply. So, you use it with other flags.

The align flag

The represents the alignment of the value of the object. You could either right align, left align, center or cause a padding to fill the available space. The different options are presented below:

<	Used for left alignment of the value in the available space. The default for most objects
>	used for right alignment of the value in the available space. The default for numbers.
=	Forces a padding to be placed after the sign but before the digits. Only valid for numeric types. If you precede the field width (explained below) with 0, then this becomes the default.
^	forces the value to be centered within the available space.

To make the alignment option meaningful, you must specify a minimum field width. Here are some examples. They all come with minimum field width of 20.

The sign flag

This is only used for numeric values. The various options are:

+	Use a sign for both positive and negative values.
-	Use a sign only for negative values (this is the default behavior)
Space	Show a leading space for positive numbers and a minus sign on negative numbers.

Here are some examples.

The alternate flag, #.

Use this flag when you are doing value conversion and you want the alternate option to be specified. It is valid for integers, floats, decimal and complex types. We will come back to this when we get to the conversion flag and show how the alternate forms can be specified.

The grouping flag.

The grouping flag specifies the character to be used as a thousands separator. It has two options:

_	Use this as a thousands separator for the integer types when ‘d’ is specified as the type flag (to be explained later) and floating point types. When the type flag for integer types is either ‘b’, ‘o’, ‘x’, or ‘X’, the separator is inserted after every four digits.
,	Use a comma as the thousands separator. You could use the ‘n’ type flag instead if you want a locale aware separator.

Now some examples. I included the third example with ‘b’ as a type flag. ‘b’ as type flag means convert value to base 2. This will be explained below under type flags.

The precision flag

The precision flag is a decimal number that indicates how many digits should be displayed “after” the decimal point for a floating point value that has the type flag ‘f’ or ‘F’, or before and after the decimal point for a floating point value that has the type flag ‘g’ or ‘G’. Note that there is no precision for integer types. If the value is a non-numeric type, then this indicates the maximum field size of the replacement field.

Now for some examples. Notice how it truncates the string type, s, when the precision is smaller than the number of characters.

The type flag

The type flag determines how the data should be presented. The type flag is specified for string types, integer types, and floating point types.

For string types: The available options are...

s	The default type for strings and may be omitted
None	The same as ‘s’

For integer presentation types: The options are...

b	Outputs the number in binary format
c	Converts the value to the corresponding Unicode character before printing.
d	Output the number in base 10 before printing.
o	Octal format. Output the number in base 8 and print.
x	Hex format. Output the number in base 16 using lower case letters for digits above 9
X	Hex format. Output the number in base 16 using Upper case letters for digit above 9.
n	Decimal format. The same as ‘d’ but it uses locale aware setting to insert appropriate thousands separator for the locale.
None	Same as ‘d’

Note that except for ‘n’ and None, you can use any of the options above in addition to the floating point types below for integers. That is, you can have a mixture of both integers and floating points.

Now, let’s use some examples.

When discussing the alternate flag, #, I stated that there are times when you want alternate conversion forms to be specified. For example, for binary, octal and hexadecimal outputs the alternate flag, #, will result in an output of ‘0b’, ‘0o’, and ‘0x’. Let’s show this with examples.

The alternate flag can also be applied to floats and complex numbers.

Now finally, the options for floating point presentation types are:

e	Exponent notation. Print the number in scientific notation using the exponent, e, to denote it. The default precision is 6.
E	Exponent notation. Print the number in scientific notation using the exponent, E, to denote it.
f	Displays the number in fixed point notation. The default precision is 6.
F	Fixed point notation, just like ‘f’ but converts nan to NAN and inf to INF.
g	This is the general format. Uses fixed point or scientific format depending on the magnitude of the number.
G	General format, but in uppercase.
n	Same as ‘g’ but is locale aware in inserting appropriate thousands separator.
%	Percentage. Multiplies the number by 100 and displays it in fixed format, ‘f’, with a percent sign (%) following it.
None	Similar to ‘g’ except that fixed point notation when used has at least one digit past the decimal point.

The following examples uses precision 2 then the default 6.

I hope you get creative in using this format specifications. They are very helpful when representing values. Note that python’s literal string formatting method, f-strings, are similar to the python string format method described here. You can interchange the two.

Using The Python String Format Method Like A Pro Part 1

How you format your text is important in text processing and python is not left out, giving you several options to make your output appear presentable. I decided to delve into the issue of python formatting in today’s post while reading some code. I appreciated the way the author applied python string formatting. So, I decided to devote two posts to string formatting because I believe my readers would be interested in it.

python string format method makes output presentable

In python you format your output using the format method of the string class. What is also called the python str.format method (or python string format method) to differentiate it from the python literal f-strings. A format string contains two types of features that would have to be sent to the output: literal text and replacement fields. Replacement fields are surrounded by curly braces, {}, and refers to objects that have to be formatted, while literal text refers to whatever you want to leave unchanged in the output. So, what we are interested in are replacement fields.

To give you an idea of what replacement fields are, read and run the following code:

You will see that in the string part of the python format method in the code above, there are two curly braces and they serve as replacement fields whose values are provided by the parameters, name and age, of the python format method. We are going to be discussing how you can format your output based on the replacement fields and parameters.

The syntax of the python string format method

The syntax of the python string format method is: template.format(p0, p1, k0=v0, k1=v1) where template refers to the string you want to format. As I said before, the template consists of both literal text and replacement fields. Replacement fields are denoted by whatever is in curly brackets, {}. The arguments p0 and p1 refers to the positional arguments while k0 and k1 refers to the keyword arguments. Positional and keyword arguments are used to insert values into the replacement fields in the template. We will cover all these and give you ideas on how to use them.

The replacement fields have three optional features: field names, conversion fields that are preceded by an exclamation point, !, and format specifications. Today’s post will cover how to specify the field names and conversion fields while the next post will be on format specifications.

The field names in the string replacement fields.

The replacement field starts with an optional field name. The field name refers to the object whose value is to be inserted. The object is specified in the parameter of the format method. The field name is either a number or a keyword.

Where the field name is a number:

An example to illustrate this is below:


name = 'Michael'
age = 29
print('Hello, you name is {0} and your age is {1}'.format(name, age))

You can see that in the template above, there are two curly braces or replacement fields. The first has the number 0 and the second has the number 1. The curly brace with 0 refers to the first positional argument which is found as a parameter to the format method and here this is the variable, name, while the curly brace with 1 refers to the second positional argument which is the variable, age.

If you so desire, you can choose to leave out the numbering of the curly braces and python will insert them on your behalf. Like this:


name = 'Michael'
age = 29
print('Hello, you name is {} and your age is {}'.format(name, age))

Where the field name is a keyword.

The python string format method provides for instances where you can specify keyword arguments as parameters and the replacement fields requires you to specify the keywords. An example is below:

print('Hello, you name is {name} and your age is {age}'.format(name='Michael', age=29))

You can see now that I have inserted the keywords into the curly braces because the parameters are keyword arguments.

Using keywords as arguments is super powerful. It gives you the ability to change the ordering of the parameters in the replacement fields. For example, instead of following the ordering of the positional arguments, I could order the replacement fields as it suits my fancy:

Check out the code above and the one before it. See how I interchanged the ordering of the keyword arguments in the replacement fields. We could try another example to show you how powerful this is.

print('In {country}, there are {number} million people speaking {language}.'.format(language='English', number=300, country='USA'))

Now, let’s insert it into the embedded python interpreter so you can run it:

With keyword arguments you are not constrained to any sort of ordering. You choose how you want it to be. You can check out this post if you want a refresher on positional and keyword arguments.

Note: What if you want to have the brace as a literal text in the template? Simple, just double brace it.

print('This is doubling the braces {{{name}}} for {name}'.format(name='Michael'))

I doubled the braces for the first replacement field. Let’s run it to see how it would appear on the embedded interpreter.

When you run it, you will notice that braces now literally appears in the output.

Now, what if your parameters are lists or an object with attributes whose value you want to show on output? The next two sections below will show you how.

Where the parameter to format is a list.

To make the output appear as you want it to, you can specify the parameter as a keyword argument or a positional argument. Look at the code below and see how. First, I specify it as a keyword argument. That means, you need to implicitly specify the list in the parameter and index it in the replacement field. But if you want it as a positional argument, you need to specify the index as parameter.

What python does when you specify it either way is to call the __getitem__() method of the list. I discussed about this method in an earlier post on sequences.

When the object has attributes with values.

When the object in the parameter has an attribute whose value you want to format, you can directly call the attribute in the replacement field. The code below shows how in the method get_fruit. What the 0.index and 0.fruit does is call the getattr() function of the object, self, in order to get the required value. In the code below I created a fruit class with a class attribute, index, so that whenever a fruit is created it is tagged with an index (instead of creating a list) and then the index is incremented to tag the next fruit.

Be creative. Play with your own objects to test how format calls attributes from the replacement field.

I think that’s all for field names. After the field names come an optional conversion field.

Syntax of the conversion field

The conversion field is optional, but if specified, it is preceded by an exclamation point, !, to differentiate it from the field name. It causes type conversion before any formatting of the replacement fields takes place. But one may ask – doesn’t every object have a default __format__() method? Yes, they do. But the creators of python realized that sometimes you want to force a specific string representation of an object.

There are three types of specifiers for the conversion field: !s, !r, and !a specifiers.

The !s specifier:

The !s conversion specifier gives you a string representation of the object in the replacement field. What it does is call str() on the object in the replacement field, converting it to a string. This is the default string formatting.

The !r specifier

You can use this when you want the true string representation of an object to be specified, and not just outputting it as a string. This representation contains information about the object such as the type and the address of the object. This specifier calls the repr() method of the object.

The !a specifier

This specifier also outputs the true string representation of an object but it replaces all non-ascii characters with \x, \u or \U. This specifier calls the ascii() method of the object. It works like the !r specifier if you have no non-ascii characters in the object.

Here is an example illustrating all three types. Notice how the object type appeared in the output for !r and !a.

As another illustration, you can compare the output of the !s and !r in a string with quotes showing or not showing.

In my use of the conversion fields, I have found that making them optional has served me well. So, they just come in for special cases of formatting.

Now, the third and last feature of the replacement field option is the format specifier which is explained in this post. This is where the real juice of replacement fields are stored.

Search