r75682 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r75681‎ | r75682 | r75683 >
Date:22:03, 29 October 2010
Author:hashar
Status:resolved (Comments)
Tags:brion 
Comment:
Follow up r75627. Implements r75670 in PHP to validate emails.

* Server side validation of email according to an HTML5 specifications provided by Simetrical :
http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-type-attribute.html#e-mail-state
* This is NOT a fix of bug 959 (which wants RFC 2822 validation)
* Basic unit tests
Modified paths:
  • /trunk/phase3/includes/User.php (modified) (history)
  • /trunk/phase3/maintenance/tests/phpunit/includes/UserIsValidEmailAddrTest.php (added) (history)

Diff [purge]

Index: trunk/phase3/maintenance/tests/phpunit/includes/UserIsValidEmailAddrTest.php
@@ -0,0 +1,63 @@
 2+<?php
 3+
 4+class UserIsValidEmailAddrTest extends PHPUnit_Framework_TestCase {
 5+
 6+ private function testEmail( $addr, $expected = true, $msg = '') {
 7+ $this->assertEquals(
 8+ $expected,
 9+ User::isValidEmailAddr( $addr ),
 10+ $msg
 11+ );
 12+ }
 13+ private function valid( $addr, $msg = '' ) {
 14+ $this->testEmail( $addr, true, $msg );
 15+ }
 16+ private function invalid( $addr, $msg = '' ) {
 17+ $this->testEmail( $addr, false, $msg );
 18+ }
 19+
 20+ function testEmailWellKnownUserAtHostDotTldAreValid() {
 21+ $this->valid( 'user@example.com' );
 22+ $this->valid( 'user@example.museum' );
 23+ }
 24+ function testEmailWithUpperCaseCharactersAreValid() {
 25+ $this->valid( 'USER@example.com' );
 26+ $this->valid( 'user@EXAMPLE.COM' );
 27+ $this->valid( 'user@Example.com' );
 28+ $this->valid( 'USER@eXAMPLE.com' );
 29+ }
 30+ function testEmailWithAPlusInUserName() {
 31+ $this->valid( 'user+sub@localdomain' );
 32+ }
 33+ function testEmailWithWhiteSpacesBeforeOrAfterAreInvalids() {
 34+ $this->invalid( " user@host" );
 35+ $this->invalid( "user@host " );
 36+ $this->invalid( "\tuser@host" );
 37+ $this->invalid( "user@host\t" );
 38+ }
 39+ function testEmailWithWhiteSpacesAreInvalids() {
 40+ $this->invalid( "User user@host" );
 41+ $this->invalid( "first last@mycompany" );
 42+ $this->invalid( "firstlast@my company" );
 43+ }
 44+ function testEmailDomainCanNotBeginWithDot() {
 45+ $this->invalid( "user@." );
 46+ $this->invalid( "user@.localdomain" );
 47+ $this->valid( "user@localdomain." );
 48+ $this->valid( "user.@localdomain" );
 49+ $this->valid( ".@localdomain" );
 50+ $this->valid( ".@a............" );
 51+ }
 52+ function testEmailWithFunnyCharacters() {
 53+ $this->valid( "\$user!ex{this}@123.com" );
 54+ }
 55+ function testEmailTopLevelDomainCanBeNumerical() {
 56+ $this->valid( "user@example.1234" );
 57+ }
 58+ function testEmailWithoutAtSignIsInvalid() {
 59+ $this->invalid( 'useràexample.com' );
 60+ }
 61+ function testEmailWithOneCharacterDomainIsInvalid() {
 62+ $this->invalid( 'user@a' );
 63+ }
 64+}
Index: trunk/phase3/includes/User.php
@@ -648,8 +648,19 @@
649649 if( !wfRunHooks( 'isValidEmailAddr', array( $addr, &$result ) ) ) {
650650 return $result;
651651 }
 652+ $rfc5322_atext = "a-z0-9!#$%&'*+-\/=?^_`{|}—~" ;
 653+ $rfc1034_ldh_str = "a-z0-9-" ;
652654
653 - return strpos( $addr, '@' ) !== false;
 655+ $HTML5_email_regexp = "/
 656+ ^ # start of string
 657+ [$rfc5322_atext\\.]+ # user part which is liberal :p
 658+ @ # 'apostrophe'
 659+ [$rfc1034_ldh_str] # Domain first character
 660+ [$rfc1034_ldh_str\\.]+ # Second char and following can include dot
 661+ $ # End of string
 662+ /ix" ; // case Insensitive, eXtended
 663+
 664+ return (bool) preg_match( $HTML5_email_regexp, $addr );
654665 }
655666
656667 /**

Follow-up revisions

RevisionCommit summaryAuthorDate
r75690Follow up r75682 : fix private function naming....hashar22:42, 29 October 2010
r75876Fix misinterpration of HTML5 specification for email validation....hashar20:39, 2 November 2010
r80694Hack invalid w3 spec to validate @localhost email...hashar18:01, 21 January 2011
r80913User::isValidEmailAddr comment update...hashar20:31, 24 January 2011

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r75627Lame user side email validator using JQuery....hashar21:01, 28 October 2010
r75670Follow up r75627 (code review comments):...hashar20:44, 29 October 2010

Comments

#Comment by Simetrical (talk | contribs)   00:28, 1 November 2010

See comments about the regex in code review for r75627.

#Comment by Hashar (talk | contribs)   20:12, 3 November 2010

The regexp should be fixed with r75876

#Comment by Raymond (talk | contribs)   09:55, 4 November 2010

newuser@localhost is a valid e-mail address but rejected :-(

#Comment by Simetrical (talk | contribs)   21:21, 4 November 2010

Seems like it would be nice if that were valid for testing purposes, yeah. I filed a bug against the spec: http://www.w3.org/Bugs/Public/show_bug.cgi?id=11225

#Comment by Reedy (talk | contribs)   19:30, 7 January 2011

Marking as resolved, added upstream and remote bug on bug 22449

#Comment by Raymond (talk | contribs)   19:41, 7 January 2011

Sorry, this still breaks every wiki that use @localhost. I propose to add a hack that allows @localhost.

#Comment by Hashar (talk | contribs)   15:47, 23 January 2011

This should be fixed by followup r80694

Marking 'new'

#Comment by Brion VIBBER (talk | contribs)   06:29, 24 January 2011

The doc comment for User::isValidEmailAddr() still says:

* There used to be a regular expression here, it got removed because it
* rejected valid addresses. Actually just check if there is '@' somewhere
* in the given address.

Are there any known failure cases that caused the original regex to be removed? If so, they should be added to the unit tests.

Either way, comment needs to be corrected.


The domain portion of the validation doesn't appear to match spec:

"A valid e-mail address is a string that matches the ABNF production 1*( atext / "." ) "@" ldh-str *( "." ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5. [ABNF] [RFC5322] [RFC1034]"

ldh-str per [1] should probably be something like "(?:[a-z][a-z0-9-]{0,61}[a-z0-9])".

#Comment by Simetrical (talk | contribs)   13:14, 24 January 2011

How do you read ldh-str that way? It looks like [a-z0-9-]+ to me.

#Comment by Brion VIBBER (talk | contribs)   17:18, 24 January 2011

On closer look, you're right. It's label that actually follows the DNS restrictions; ldh-str is more permissive than DNS allows. I'm not sure it's correct for the HTML 5 spec to be listing ldh-str here then, but that's what it currently says. :)

That should leave us with just the comment fix.

#Comment by Hashar (talk | contribs)   20:31, 24 January 2011

Comment updated with r80913

Marking 'new'

#Comment by Brion VIBBER (talk | contribs)   21:31, 24 January 2011

Ok, with the comment updates it looks like we're now pretty golden. I went back into bugzilla:989 for what the probs with the earlier validation code were; looks like mainly it was the '+' usage which is covered in the test cases. We might watch out for IDN stuff in future, but we'll watch out for that separately.

Marking this as resolved.

#Comment by Simetrical (talk | contribs)   19:50, 26 January 2011

The spec notes that it's a willful violation of the RFC, because the RFC's definition is too strict, lax, and vague. Not to mention absurdly complicated; HTML5's definition is actually comprehensible.

#Comment by Brion VIBBER (talk | contribs)   20:00, 26 January 2011

For the domain part, matching the actual rules of DNS makes a reasonable amount of sense. :) But a superset of valid domain parts doesn't hurt much here as invalid ones will just fail to work, and the things it would catch probably aren't common typo cases.

#Comment by Brion VIBBER (talk | contribs)   20:02, 26 January 2011

(it's the local part that's the real pain to work with per the 'official' mail specs)

Status & tagging log