Search

Email Validity Check Using A Python Filter

I encountered a problem recently of how to check for email validity. When doing web scraping or handling data sets, you get to play with emails a lot and having a handy function that does this check in an efficient way would be useful. So, I decided to come up with a function that does it.

 

In this article, I will be using a filter and a list of emails to do the check. I decided on using a list of emails because it would better help you understand what I am doing. Most times, if you are scraping data, you might have to implement a file handler for it. But not to worry, converting from one to the other is easy.

What are valid emails?

Valid emails have the form: username@websitename.extension. Each of the elements have their different rules. For example, username can usually contain letters, digits, dashes or underscores. Website name can only have letters and digits while the extension is usually at least 2 and maximum 3 letters in length. So, having that we will use regular expressions, or regex, to match for valid emails. Remember earlier posts, finding a match with regular expressions and also understanding python search and match regex methods where I told you that when you have to search strings think regex first? This is a handy application.

So, this is the regex we will be using. I have modularized it in a handy function called fun.


def fun(s):
    # return True if s is a valid email, else return False
    pattern = r'^[\w-]+@[a-zA-Z0-9]+\.[a-z]{2,3}$'   
    m = re.match(pattern,s)
    if m:
        return True
    else:
        return False 

Now let me explain the regex above in case you have not read the earlier posts on regex. Just look at the value for the variable, pattern. I will be explaining each of the strings in the pattern. "^[\w-]+" is the pattern to match the username. It says start from the beginning of the string. That is why we are using match in the next line. Match one or more letters, digits, or underscores (as \w), or match one or more dashes (as -). That’s for the username validation. The next string is "@". Just match @ after the earlier username match. Now comes the website name matching string. "[a-zA-Z0-9]" states that after the @ symbol, we should look for a website name as one or more letters or digits. That’s that. Next comes "\.". What I want to match here is the single period, ., after the website name, but I have to backslash it. Without a backslash the "." symbol says match any character but with a backslash it means to match the period, ., character. Then, lastly is the end of the pattern, the extension match. I denoted this with "[a-z]{2,3}$" and what this means is that match at least 2 or at most 3 lowercase letters which are at the end of the string. Note that this must be at the end of the string. Sometimes we could get an extension like .server and that would not be a match unless it is a .com, .co, .io, .net etc.

So, that’s for validating the email. Now how does a filter come in handy to do this.

What the filter function does.

The syntax of the filter function is filter(function, iterable). What it does is to take a sequence and filters it with the help of the given function. Each item in the sequence is filtered against the function and then a list is returned of the filtered sequence. So, what we are doing here is taking each email and filtering it against the fun function using the regex pattern match. The filter code is just a one liner – just filter it using the given function.


def filter_mail(emails):
    ''' input a sequence of email string 
    returns the elements in the sequence that returns True
    when filtered for email validity.'''
    return list(filter(fun, emails))  

In the complete code, I also included a print_email function. What the print_email function does is take an email, call the filter function which returns a list of the filtered emails, sort the list and then print the list. I wanted to make the print_email function a helper function to the filter_email function but decided it might not be readable that way, so had to leave it as is.


def print_email(emails):
    filtered_emails = filter_mail(emails)
    filtered_emails.sort()
    print(filtered_emails)

So, that is it – validating emails using a filter function with regular expressions. I have included two lists, a list containing invalid email and one containing valid emails for you to run and see how it was done.

You can download the script if you want to run it on your machine.

If you have any questions, you can leave it in the comments. Also, be sure to subscribe to my blog so you can receive new articles as they are posted.

Internet Security: Is There Such A Thing As An Unbreakable Code?

For centuries, people have been searching for ways to keep information from getting into the hands of the public. Cryptography gave them an answer to that. Cryptography has been used, both in its basic and sophisticated forms to hide sensitive information. Egyptian hieroglyphics contain the first known and verified example of ancient cryptography. In our age where internet is so rampant and people want to keep their personal information private, cryptography is gaining traction. But one cycle exists for all cryptography. First, someone finds a good code and starts using it, it becomes effective for some time and eventually someone somewhere breaks the code, rendering it ineffective. Because of this, people ask: Is there such a thing as an unbreakable code?

 

Can all encryption be broken

To help them answer this question and solve it, scientists came up with the concept of one-way functions. One-way functions are functions that are easy to compute on the given inputs but hard to invert. That is, you cannot get the inputs from the output when reversing it. One-way functions could make good candidates for code that cannot be easily broken. That is, it would be close to impossible to find an algorithm that would revert the output. Unfortunately, one-way functions are just a conjecture. But that conjecture has been behind much tools that have been built in cryptography, authentication, personal identification, and other data security applications.

Getting a one-way function that is feasible has huge ramifications in the internet age. It could solve the Internet security problem for good. Industries such as banking, telecommunications, and e-commerce would be in a hurry to apply it. Yes, it has been elusive but that is not to say that there have not been candidates.

One well known candidate for one-way functions involves the multiplication and factoring of prime numbers. To get the outputs, two prime numbers are given to a function and their product is computed. This function takes a quadratic time complexity. It is really hard to factor out the prime numbers given the output although it can be done in exponential time. Another candidate is the Rabin function which gave rise to the Rabin cryptosystem on the assumption that the Rabin function is one-way.

The two candidates above can be broken though if a really good mathematician knows how to write an efficient algorithm.

This problem was what Rafael Pass, Professor of computer science at Cornell Tech wanted to tackle. He believes that if he could find a really good and valid one-way function, then all internet security problem could be solved. Internet encryption would be safe for all. According to his postulate, a good one-way function is like lighting a match. After a match is lit, you cannot get back the sticks. They are now ashes. So, a good one-way function would be an encryption scheme in which the decryption would lie only in the hands of the person who encrypted it. To get a candidate, he looked to mathematics and to a field that is unrelated to cryptography – quantifying the amount of randomness in a string of numbers, or what is known as the Kolmogorov complexity.

The Kolmogorov complexity of an object is defined as the length of the shortest computer program that can generate that object as an output. The Kolmogorov complexity of a string that has a definite pattern to it, like ababababababab, which is writing ab 7 times, can easily be computed. But what if you have some random string? asdwer2345tgdhncjmckkjkd? How do you compute the Kolmogorov complexity in an efficient manner? It has been found that the Kolmogorov complexity for such random strings is computationally close to impossible. What makes it more infeasible is computing the time bounds of such an algorithm.

Taking from this idea, Professor Pass focused his research on whether an algorithm can solve the time-bounded Kolmogorov complexity. If such an algorithm exists, his research posits, then all cryptography can be broken. On the other hand, if no efficient algorithm exists for such a time-bound Kolmogorov complexity, then one-way functions do exist and they can be found.

His research has implications for encryption schemes that are widely used in the Internet. Popular social media platforms use encryption to make their platforms more secure, banks in e-banking platforms rely on encryption being more unbreakable, and overall, we depend on making sure our internet lives are kept free from the prying public. So, Professor Pass’ theory is of great interest and only time will tell when a really good algorithm can be found based on his research that would make sure our Internet security is compromised no matter what platform we are using.

Source for this article was from Cornell University.

The Big Advantage Of Understanding Python Regex methods

Regular expressions, or regex, in python is fun. It is a very fast way to search through a string for a given pattern. Whenever I have to search and I am dealing with a string, the first thing I do is to look for a solution in regex. If you know regular expressions, so many string operations will be easy. 

My earlier post, How To Find A Match When You Are Dating Floats, explains the basic syntax and use of regex. In this post, I will highlight two functions in using regex that could confuse anyone unless they understand how they work. I will also add a third method that serves as an extension to the two. 

So, what are the two methods? They involve searching for a pattern in a string and the two methods are re.search() and re.match(). They both do the same thing: search for a pattern in a string. 

How python re.match() works.

The syntax of python re.match() is re.match(pattern, string, flags=0). What it does is take a pattern, as the first argument and a string as the second argument and search for the pattern in the string. You could add in other flags if you want to like make it search multiline or ignore string case. 

Now, the subtlety of re.match() is that it returns a match object only if the pattern is at the beginning of the string. Else, if it is not at the beginning of the string, it returns None. This is very important to remember because many unsuspecting pythonistas have found themselves thinking their pattern was wrong when their match returned None. 

Let me illustrate this with a little code. 

 

From the code above, I changed the patterns. The first pattern started from the beginning of the string, line 4, and it returned a match object when I printed the object. But the second pattern, line 10, did not start from the beginning of the string. When you ran the code, you would have noticed that it printed None for this case. 

So, always remember, re.match() searches for the pattern at the beginning of the string. 

How python re.search() works

Now, the second method for searching for patterns is re.search(). The syntax is similar to re.match() but different from re.match() because it searches for the pattern anywhere in the string. Even if the string is multiline, it would still return a match if the pattern exists in the string. But it does not return a match for all locations where the pattern can be found in the string. Rather, it returns only the first match for the pattern. 

 

If you run the code above, you can see that it both gets a match at the beginning of the string and in the middle of the string. It gets a match anywhere in the string but returns a match object that corresponds only to the first match. 

So, remember the difference between these two useful methods and don’t make the mistake of fighting your terminal trying to understand why a pattern you thought was well formed turned out not to be giving you a match object. 

The bonus python regex method. 

This is a bonus method because it is the one I use most often. It is quite different from the earlier two. Remember the earlier two only return a single match object or None where there is no match. The bonus method is python re.findall(). This method, re.findall(), will scan the string from left to right for matches and will return all the matched patterns as a list of strings. Not a match object, but a list of strings. That comes quite useful several times you might say. I just love this method. Here is some little code to illustrate this. 

 

Notice that I am using the same code but just changing the methods. 

So you can see how powerful re.findall() is. It gives you the ability to see all the matches in a list, something that re.match() and re.search() do not make possible. 

I limited this post to just the rudimentary functionalities of all three methods. You can experiment with them now that you know how they work. Make out your own code with various concepts. 

And don’t forget to subscribe to my blog so that you can get updated articles as I publish them daily. The submit textbox is at the topright.

Matched content