You are here: Foswiki>Tasks Web>Item11059 (17 Dec 2011, GeorgeClark)Edit Attach

Item11059: Email address followed by a dot generates email link with dot included

Priority: Normal
Current State: Closed
Released In: 1.1.4
Target Release: patch

Applies To: Engine
Component:
Branches:

Reported By: ArthurClemens
Waiting For:
Last Change By: GeorgeClark

-- ArthurClemens - 23 Aug 2011

Is this causing an issue? A trailing dot in a domain name is technically correct and indicates that the name is fully qualified and at least in a browser, should not be auto-completed using the DNS search path configuration.

http://www.dns-sd.org/TrailingDotsInDomainNames.html

However since SMTP requires that only fully qualified domain names be used, the trailing dot probably doesn't have any practical use. I tried sending some emails with the trailing dot and they seemed to be handled correctly by the servers in the path.

-- GeorgeClark - 23 Aug 2011

We are dealing with users that do not know the specs. It looks broken and it looks as if things will not work. For this reason I myself removed the dot from the sentence to be sure. But that was not what I wanted to achieve.

It is about that experience why it should be fixed, regardless of the spec.

-- ArthurClemens - 23 Aug 2011

The site http://www.regular-expressions.info/email.html has some example regular expressions for email matching. I tried substituting the last example into Foswiki.pm

+    $regex{emailAddrRegex} = 
+      qr"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b"i;

It appears to resolve your issue and strengthens our email matching a bit. Spot-checking the unit tests and things appear to pass.

If this is used, it probably needs to be a configure setting to permit additions of new top level domains.

-- GeorgeClark - 23 Aug 2011

ManageDotPM unit tests fail with that example. However the following works:

      qr"\b[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.(([0-9]{1,3})|([a-z]{2,3})|(aero|asia|coop|info|jobs|mobi|museum|name))\b"i;

See http://regexlib.com/DisplayPatterns.aspx?cattabindex=0&categoryId=1

-- GeorgeClark - 24 Aug 2011

One more try. This regex works, and also supports quoted names, "Joe user"@blah.com and IP addresses. joe@192.168.1.1. It doesn't pick up the trailing dot. Also added unit tests for these formats.

      qr`(?:(?:[_a-z0-9-]+(\.[_a-z0-9-]+)*)|(?:"[^"]+?"))@[a-z0-9-]+(\.[a-z0-9-]+)*\.(([0-9]{1,3})|([a-z]{2,3})|(aero|asia|coop|info|jobs|mobi|museum|name))\b`i;

mailto:pitiful@example.com
At endSentence@some.museum. Above regex omits . from link
byIP@192.168.1.10
badname.@192.168.1.10 Above regex doesn't match - bad user name
badformat@@example.com
badTLD@example.porn Above regex doesn't match
"Some Name"@blah.com Matched by regex

If there are other sample emails that we should test, please post here.

-- GeorgeClark - 24 Aug 2011

Good catch.

I think I've seen something of the form foo+somewhere.com@example.com before.

Is it possible to avoid [a-z] ranges at all? This isn't I18N friendly in the slightest

In these cases \X might usually be a better alternative than [a-z]

-- PaulHarvey - 24 Aug 2011

René.Descartes@example.com

-- PaulHarvey - 24 Aug 2011

Paul's suggestion using \p{Alphabetic} works - as does \p{Alnum}. This latest regex seems to handle all of the exceptions pointed out by RFC:3696 but could stand being split apart with white space and documented. One minor change to Render.pm needed to encode % in email address as %25, and note that $Foswiki::cfg{AntiSpam}{EntityEncode} = 1; breaks things. Needs to be disabled or extra encoding causes some issues.

      qr`(?:(?:[_\p{Alnum}\-\!\$+=/\<\>\#\%\{\}\|\\\^\~\`]+(?:\.[_\p{Alnum}\-\!\$+=/\<\>\#\%\{\}\|\\\^\~\`]+)*)|(?:"[^"]+?"))@[a-z0-9-]+(\.[a-z0-9-]+)*\.(([0-9]{1,3})|([a-z]{2,3})|(aero|asia|coop|info|jobs|mobi|museum|name))\b`i;

-- GeorgeClark - 24 Aug 2011

Wow, thanks for the efforts so far!

-- ArthurClemens - 24 Aug 2011

Big question - is this too much overhead? The email matching is much better but at the expense of a much more complex regular expression. Here is the latest version:

    # Email regex, e.g. for WebNotify processing and email matching
    # during rendering.
    my $validChars = qr([\p{Alnum}\Q_:-!\$+=/<>#%{}|\^~`\E])i;   # Valid characters in email per RFC 3696
    my $validTLD = qr(aero|asia|coop|info|jobs|mobi|museum|name)i;

    $regex{emailAddrRegex} =
      qr(
       (?:                            # LEFT Side of Email address
         (?:$validChars+                  # Valid characters left side of email address
           (?:\.$validChars+)*            # And 0 or more groupings of valid characters following a dot.
         )
       |
         (?:"[^"]+?")                     # or a quoted string
       )
       @
       (?:                          # RIGHT side of Email address
         (?:                           # FQDN 
           [a-z0-9-]+                     # hostname part
           (?:\.[a-z0-9-]+)*              # 0 or more alphanumeric domains following a dot.
           \.(?:                          # TLD
              (?:[a-z]{2,3})                 # 2-3 digit TLD 
              |
              $validTLD                      # well known longer TLD's 
           )
         )
         |
           (?:[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})      # dotted triplets IP Address
         )
         \b                               # Boundary
       )oxi;

And the list of tested addresses:

These all should work

mailto:pitiful@example.com
At endSentence@some.museum. trailing dot must not be in the link
~~byIP@192.168.1.10~~ wrong - needs square brackets: byIP@[192.168.1.10]
"Some Name"@blah.com - Quoted user name with spaces
colon:name@blah.com
_somename@example.com - Leading underscore okay - but messes up rendering
mailto:_somename@example.com Enter with explicit mailto if underscore on same line
$A12345@example.com
def!xyz%abc@example.com
blah%abc@example.com
blah!asdf@example.com
customer/department=shipping@example.com
user+mailbox@example.com
René.Descartes@example.com

-- GeorgeClark - 24 Aug 2011

Committed fix to trunk - This topic appears to render correctly now.

-- GeorgeClark - 25 Aug 2011

There is CPAN:Email::Validate, or in fact at least CPAN:Mail::RFC822::Address, which build THE proper regular expression (read "Mastering regular expressions" to know why I call it that way) to valid emails as per rfc822.

Either we do it right, or we don't do it at all. And I'd rather use something which has been around for a long time, scrutinized and tested, than try and come up with something that matches what I think it should. Hint: read RFC822 just to see how little you knew of what is allowed in an e-mail address.

-- OlivierRaginel - 25 Aug 2011

Fine. Reverting. BTW RFC:822 is obsoleted by RFC:2822, and further clarified in RFC:3696. I was not trying to validate every possible email address, only try to find which ones in a topic should be clickable links, and ensure that they are properly escaped and passed to an email client in a usable fashion. The suggested CPAN modules don't "find" addresses, they validate a string that has been identified as an address.

I'll let someone else handle this one.

-- GeorgeClark - 25 Aug 2011

The more I look at this the more I believe that I was on the right track. You have clearly identified that the regex I've been building doesn't validate email addresses. And have pointed out some well tested validation routines. But that's not the problem I was trying to solve. And I don't think it needs solving. For whatever reason, you've chosen to make it a bit of a personal attack - "Read RFC822 just to see how little you know...". Thanks. That's really motivational. If I was actually writing code to validate email addresses you might have a point.

So now that I know your opinion on my knowledge and reading skills, let's review the challenge. I don't believe that we should find and auto-link every possibly valid email address. That's why we have square-bracket notation. The render routines should find and correctly auto-link common email addresses written in-line. Anything more complex can be handled with the square brackets.

RFC:2822 obsoletes several email formats defined in RFC:822. We should not support the obsoleted formats. This includes "mail routing" and other deprecated formats. RFC:2822 also permits an address built with a free-form Display Name followed by the address in angle-brackets. I don't believe we should support this format either. There is no clear starting point for the display name.

Please contact Joe the Admin <joe@blah.com>

(Where does the address begin?)

RFC:3696 Section 4.3 covers the encoding of email addresses into the MAILTO: URL. That is the RFC I was working from - as that's the purpose of our rendering code - to transform a casually written email address into a valid MAILTO: URL.

The changes I made:

Finds many common email addresses
Purposely does NOT try to find ALL supported formats.
Resolves the appended dot issue identified by ArthurClemens
Resolves an I18N issue identified by PaulHarvey
Is NOT a validation routine and does not claim to be

I'm going to re-apply my changes, and it would be helpful if you could make some constructive comments on the solution.

-- GeorgeClark - 27 Aug 2011

I support your approach to this task, George. Very good work.

-- KennethLavrsen - 28 Aug 2011

Unfortunately the I18N issue - matching René.Descartes@example.com - does not appear to be legal in any of the standards that I can find. The latest email standard - RFC:5322 - still does not permit I18N characters either in atoms or inside quoted strings.

Paul, since I don't have any internationalization experience, are you sure that the accented characters in your example are legal? Do you know how they should be encoded or quoted? This is to the point that Olivier makes.

-- GeorgeClark - 30 Aug 2011

Some of the RFCs are either not up to date or wrong.

Even domain names can contain none A-Z names. It is possible in Denmark to register a domain name with æøå. It is not recommended for obvious reasons but some have done it.

So at all cost - avoid A-Za-z in the regexes both before and after the @

-- KennethLavrsen - 31 Aug 2011

Is just allowing these characters in the mailto: URL sufficient? And should it be configurable - depending upon whether or not localization is enabled? From what I could gather from google, the I18N support is primarily an email client requirement. It's up to the client to translate the non-ASCII domain name into xn-- punycode style domains.

I suspect that some of this is falling into "feature request" territory, but:

Make the Alnum vs. A-Z0-9 regex selected by {UserInterfaceInternationalisation}
Move the TLD list into the config? (Or just accept any TLD?)

Anything else?

There also seems to be a fallback email address format. < non-ascii@idn <fallback-ascii@ascii> > or something like that. Do we need to deal with this?

-- GeorgeClark - 31 Aug 2011

RFC:5335 tries to allow accented characters in the address, but I doubt anybody has ever implemented it yet. http://en.wikipedia.org/wiki/E-mail%5Faddress#Internationalization explains the background.

Also Kenneth, yes, DNS supports 'accented' characters, but as wikipedia points out:

Although the Domain Name System supports non-ASCII characters,
applications such as e-mail and web browsers restrict the characters which can
be used as domain names for purposes such as a hostname. Strictly speaking it
is the network protocols these applications use that have restrictions on the
characters which can be used in domain names, not the applications that have
these limitations or the DNS itself. To retain backwards compatibility with the
installed base the IETF IDNA Working Group decided that internationalized
domain names should be converted to a suitable ASCII-based form that could be
handled by web browsers and other user applications. IDNA specifies how this
conversion between names written in non-ASCII characters and their ASCII-based
representation is performed.

So, I vote for: ASCII is fine for the next 5 years, and let's revisit once people start using funky email addresses (might happen the same year as IPv6).

-- OlivierRaginel - 31 Aug 2011

In thinking about this a bit, we probably need to be very careful about linking these addresses as well because of the security implications. See this blog for one description of the attack path:

Dodgy Domain Names: IDNs & how email clients deal with the massive threat

All we are doing is converting the email address into a clickable link. From that aspect we don't have any responsibility of dealing with the IDNA domains, or an I18N "local part" of the address. That responsibility falls on the email client that handles the link. However especially for public sites, it would be another vector for evil people to insert links that appear as one domain but send to another.

So I'll agree with Olivier with the caveat. If we have a business customer who needs I18N email addresses linked for their internal email, then we should make it configurable.

-- GeorgeClark - 31 Aug 2011

There's something we're all forgetting when reading these RFCs. Many of them are detailing lower-level stuff that frankly, as a thing that merely generates (X)HTML documents, Foswiki shouldn't have to care about. None of the RFCs mentioned so far in this task, are mentioned in the XHTML 1.0 or HTML 4.0.1 standards documents that I've briefly searched in.

AFAICT the only RFC we need to care about when generating (X)HTML documents, when it comes to generating links, is RFC:2396, which XHTML 1.0 refers to explicitly.
I think it's odd to try to second-guess what a web browser/E-mail client would/should do with I18N addresses. But as we're talking about autolinked email addresses, I guess an [a-z] type regex is probably fine. I think it's worth remembering, though, that even wget and curl know how to properly escape international characters into the required escaped unicode entities (Eg. %20...) when building an HTTP request, regardless of terminal charset - I think we tend to worry too much (forgetting separation of concerns) when building (X)HTML markup in Foswiki.
Having said that, it seems there's some sentiment to deliberately cripple mailto <a ... links in an (X)HTML document over security concerns, if so, why doesn't this apply to http, ftp, https links?

Having said all this

I don't claim to have even 5% of the knowledge/experience required to make a confident call on this. I could be totally wrong

-- PaulHarvey - 03 Sep 2011

For now I'm leaving the regex as standard ASCII. It's a simple change to change it to I18N alnum for the left or right side of the email address. Or it can be made configurable. Checked into Release 1.1 now too.

-- GeorgeClark - 22 Sep 2011

ItemTemplate edit

Summary	Email address followed by a dot generates email link with dot included
ReportedBy	ArthurClemens
Codebase	1.1.3, trunk
SVN Range
AppliesTo	Engine
Component
Priority	Normal
CurrentState	Closed
WaitingFor
Checkins	distro:b4a66cb87b3d distro:a607d40be597 distro:60033e18fa3c distro:a9066ea841c9 distro:adb48741ae35
TargetRelease	patch
ReleasedIn	1.1.4