Saturday, February 23, 2013

How to (not) Validate an Email Address

A common requirement to a software developer often follows the lines of "accept a valid email address". Somewhat appropriately, use of a regular expression is typically considered to fulfill this requirement. Unfortunately, many times this is made more complicated than necessary - and only causes additional issues due to becoming too restrictive.

This scenario is best explained by "I Knew How To Validate An Email Address Until I Read The RFC" (Phil Haack, 2007-08-21, haacked.com). I highly encourage you to read through the entire article, but here are a few significant quotes:

What I found out was surprising. Nearly 100% of regular expressions on the web purporting to validate an email address are too strict.

These are all valid email addresses!

"Abc\@def"@example.com
"Fred Bloggs"@example.com
"Joe\\Blow"@example.com
"Abc@def"@example.com
customer/department=shipping@example.com
$A12345@example.com
!def!xyz%abc@example.com
_somename@example.com

The article goes on to suggest the following regular expression (complete with unit tests!):

^(?!\.)("([^"\r\\]|\\["\r\\])*"|([-a-z0-9!#$%&'*+/=?^_`{|}~] |(?@[a-z0-9][\w\.-]*[a-z0-9]\.[a-z][a-z\.]*[a-z]$

However - this still misrepresents the real issue. There is only one way to validate an email address: Send something containing a cryptographically secure, one-time token to the email address in question, and prompt the user to provide it (either by input back to the web page, and/or by clicking on a link to invoke the validation). If and only if the user is able to complete this step, it can be reasonably assured that the user owns or at least maintains control of the provided email address.

As such, any related requirement should be relaxed to only ensure that whatever is being entered at least "looks" like an email address. This isn't so much for the purpose of ensuring that the address is valid - but that the proper data is being entered into the correct fields, such that an email address is being entered when prompted, instead of a name or a phone number, for example. To accomplish this, the following regular expression is sufficient:

.+@.+\..+

This basically indicates that in order for a given text to match against this pattern, it must contain one or more characters, followed by the '@' sign, followed by one or more characters, followed by a dot ('.'), followed by one or more characters. (Any of the "one or more characters" will also allow for one or more additional '@' signs or dots ('.').)

So for example, the below is an example validation implemented in JavaScript that "makes sense":

var emailValidation = /.+@.+\..+/;
console.log(emailValidation.exec("user@example.com") != null); // Result: "true"
console.log(emailValidation.exec("www.example.com") != null); // Result: "false"

The only exception to this should probably be for a system that needs to validate email addresses for registration into an email system itself. I.E., for registration of an address into a company's webmail system, and it is well-known that this particular system doesn't accept special characters in the email address, for example. Otherwise, why should a given site be coded to reject what otherwise would be a perfectly valid and legitimate email address, just because the local system's requirements don't "like" certain characters or formats? (Hint: If this is the case, the local system's requirements are probably due for review.)

Real-world issues

One of my reminders of this issue, and part of this inspiration of this post, comes from Liferay Portal. In older versions of Liferay, this was the regular expression (implemented within Java code) for validating an email address:

"([\\w-]+\\.)*[\\w-]+@([\\w-]+\\.)+[A-Za-z]+

This incorrectly caused email addresses containing a '+' sign to be rejected - a practice somewhat common with Gmail, for example. As an enterprise Liferay customer, I requested a patch, which replaced the above with the following regular expression to provide support for the '+' sign (and some other things):

([\\w!#%&-/=_`~\\Q.$*+?^{|}\\E]+)*@([\\w-]+\\.)+[A-Za-z]+

Unfortunately, this patch only led to worse issues. At least with the newer version of the pattern, it is a non-optimized regular expression and is subject to catastrophic backtracking. (Please refer to http://www.regular-expressions.info/catastrophic.html and/or http://www.codinghorror.com/blog/2006/01/regex-performance.html for additional details regarding catastrophic backtracking.) In particular, in these situations, each additional character that needs to be backtracked over will cause an exponential growth in required CPU iterations. This expression seems to "fall apart" starting at about 25 characters. This caused us to run into periodic instances (both production and non-production) where our Liferay JVM would completely hang, causing 100% CPU utilization. I captured a stack trace that a provided to them - but the JVM was basically getting stuck on the following call: com.liferay.portal.kernel.util.Validator.isEmailAddress(String).

I proceeded to escalate this as a critical security issue, due to the potential of an easy-to-execute denial-of-service (DOS) attack for anyone running versions of Liferay using this patch or newer versions containing this code.

A minimal example that can be used to reproduce the issue outside of Liferay - but using the same later regular expression shown above:

 Pattern emailAddressPattern = Pattern.compile("([\\w!#%&-/=_`~\\Q.$*+?^{|}\\E]+)*@([\\w-]+\\.)+[A-Za-z]+");
 // "Accidentally" type an email address in with the wrong punctuation.
 Matcher m = emailAddressPattern.matcher("mark.ziesemer.myexamplecompany@com");
 System.out.println(m.matches());

I'm not even exactly sure what this new regular expression was aiming to accomplish, but quite honestly, I believe Liferay went the wrong direction / took the wrong approach with this modification. I also provided them with what I've provided above, including the minimal regular expression. My advice was not immediately accepted. The latest patch uses the following pattern:

[\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?

... which is almost as bad as Phil Haack's example (above) - actually, probably worse, since I doubt it was tested to the same standards as his (if Liferay even has unit tests around this at all...) To their credit, they did file a feature request around this (LPS-30849) - though it doesn't include any of the rationale for its existence, nor has any effort been demonstrated on the ticket yet.

1 comment:

yage said...

While academically, that complicated regular expression might be the "correct" way to validate an email address, I propose that most "valid" email addresses are not things you would be wanting to allow into your system. i.e. Sure maybe "james smith"@example.com is a valid email address, but its not something I am going to let my servers send email to.

I know this code could probably be optimised, but I prefer to so something like this, which allows you to provide slightly more useful feedback to the users:

https://github.com/corydoras/Base/blob/master/src/base/email/EmailAddressParse.java

The main point being we want to reject basically anything that is suspicious.