I encountered a problem recently of how to check for email validity. When doing web scraping or handling data sets, you get to play with emails a lot and having a handy function that does this check in an efficient way would be useful. So, I decided to come up with a function that does it.
In this article, I will be using a filter and a list of emails to do the check. I decided on using a list of emails because it would better help you understand what I am doing. Most times, if you are scraping data, you might have to implement a file handler for it. But not to worry, converting from one to the other is easy.
What are valid emails?
Valid emails have the form: username@websitename.extension
. Each of the elements have their different rules. For example, username can usually contain letters, digits, dashes or underscores. Website name can only have letters and digits while the extension is usually at least 2 and maximum 3 letters in length. So, having that we will use regular expressions, or regex, to match for valid emails. Remember earlier posts, finding a match with regular expressions and also understanding python search and match regex methods where I told you that when you have to search strings think regex first? This is a handy application.
So, this is the regex we will be using. I have modularized it in a handy function called fun.
def fun(s):
# return True if s is a valid email, else return False
pattern = r'^[\w-]+@[a-zA-Z0-9]+\.[a-z]{2,3}$'
m = re.match(pattern,s)
if m:
return True
else:
return False
Now let me explain the regex above in case you have not read the earlier posts on regex. Just look at the value for the variable, pattern. I will be explaining each of the strings in the pattern. "^[\w-]+" is the pattern to match the username. It says start from the beginning of the string. That is why we are using match in the next line. Match one or more letters, digits, or underscores (as \w), or match one or more dashes (as -). That’s for the username validation. The next string is "@". Just match @ after the earlier username match. Now comes the website name matching string. "[a-zA-Z0-9]" states that after the @ symbol, we should look for a website name as one or more letters or digits. That’s that. Next comes "\.". What I want to match here is the single period, ., after the website name, but I have to backslash it. Without a backslash the "." symbol says match any character but with a backslash it means to match the period, ., character. Then, lastly is the end of the pattern, the extension match. I denoted this with "[a-z]{2,3}$" and what this means is that match at least 2 or at most 3 lowercase letters which are at the end of the string. Note that this must be at the end of the string. Sometimes we could get an extension like .server and that would not be a match unless it is a .com, .co, .io, .net etc.
So, that’s for validating the email. Now how does a filter come in handy to do this.
What the filter function does.
The syntax of the filter function is filter(function, iterable)
. What it does is to take a sequence and filters it with the help of the given function. Each item in the sequence is filtered against the function and then a list is returned of the filtered sequence. So, what we are doing here is taking each email and filtering it against the fun function using the regex pattern match. The filter code is just a one liner – just filter it using the given function.
def filter_mail(emails):
''' input a sequence of email string
returns the elements in the sequence that returns True
when filtered for email validity.'''
return list(filter(fun, emails))
In the complete code, I also included a print_email function. What the print_email function does is take an email, call the filter function which returns a list of the filtered emails, sort the list and then print the list. I wanted to make the print_email function a helper function to the filter_email function but decided it might not be readable that way, so had to leave it as is.
def print_email(emails):
filtered_emails = filter_mail(emails)
filtered_emails.sort()
print(filtered_emails)
So, that is it – validating emails using a filter function with regular expressions. I have included two lists, a list containing invalid email and one containing valid emails for you to run and see how it was done.
You can download the script if you want to run it on your machine.
If you have any questions, you can leave it in the comments. Also, be sure to subscribe to my blog so you can receive new articles as they are posted.
So useful to me. Thanks for this
ReplyDeleteHello Emekadavid,
ReplyDeleteGreetings.
I am new to python. was going through your blog.
pattern = r'^[\w-]+@[a-zA-Z0-9]+\.[a-z]{2,3}$'
pattern = re.compile(r'^[\w-]+@[a-zA-Z0-9]+\.[a-z]{2,3}$') -- This is not working in finding the emails.
pattern = re.compile(r'[\w-]+@[a-zA-Z0-9]+\.[a-z]{2,3}') -- This is working.
The second pattern variable says the extension should be the end of the string. The third pattern variable says the extension should not be the end of the string. the second works based on how the string you are using is framed.
DeleteExcellent read, I just passed this onto a friend who was doing some research on that.
ReplyDeleteValidate Email | Email Validator
Thanks
Delete