Regular expressions are the business! {in progress}

Ever seen something like this floating around on the internet:

^[A-Za-z0-9._%+-]+@[A-Z0-9.-]+\.[A-Za-z]{2,4}$

Of course you haven’t. Who goes searching for random stuff like that?

…

You’ve probably already guessed that I do. That crazy jumble of characters above is called a regular expression (abbreviated to regex) and is, in this case, used to validate whether an email address is of the correct format. Let me give you a scenario: Mrs Bloggs hires you to fix her pineapple sales mailing list. She is upset because people are signing up with silly email addresses like “gzgzgzgz@pineapples-suck.they-really-do” and “bananas@4@eva” which clearly aren’t email addresses. How do you write a program that stops this? Sure, you could write a big mess of string searching tests like so:

No code found ?

or you could use a single regular expression:

No code found ?

I know which one I would pick. A regular expression sets out the pattern to search for, so if we search for it in the email and we get a result, it must be a valid email!

(note: please don’t send mail me about this not being RFC2822 compliant; I don’t really care :) )

Writing some regexes (python style)

This is a crash course in regular expressions, so let’s crash right in:

.*
.+

are probably the most useful directives in regular expressions. A dot matches any character, so matching the string “abcdef” with the regular expression r”.” would give the following 6 results:

a
b
c
d
e
f

The star means “match the character before me as many times as you like (including 0 times)”. Matching “abcdef” with the expression r”.*” would result in:

abcdef

Notice the difference? The plus character is similar, but it has to match the previous character at least once. That means if we smatch “abcdef” with the expression r”.+”, we will get:

abcdef

But if we match “” with r”.+” we will get no results. Want some proof?

Python 2.6.4 (r264:75706, Jan 23 2010, 21:22:15)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import re

>>> re.match(r'.*', '')
<_sre.SRE_Match object at 0x6dad0c9476b0>

>>> re.match(r'.+', '')
>>>

told you.