r98281 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r98280‎ | r98281 | r98282 >
Date:22:51, 27 September 2011
Author:brion
Status:ok (Comments)
Tags:
Comment:
* (bug 31187) Fix for user JavaScript validation to allow identifiers with valid Unicode letter characters

Followup r91591, r93020: patch to jsminplus to support Unicode chars and char escapes in identifiers

Fast-path check keeps runtime about the same on most scripts (eg jquery.js parsing was abround 4100ms both before and after on my test machine)
Slow-path code kicks in if plain ASCII word chars don't extend all the way to the next whitespace or punctuation char.
Using PCRE's Unicode properties magic to ensure that we're catching everything, following ECMA-262 edition 5.1 spec.

Note that identifiers using escapes don't get normalized to their UTF-8 form; this might be a nice thing to do as it saves a couple bytes, but currently there's no change made to output.


Added QUnit tests to verify that unicode letter & escapes work in identifiers in all supported browsers (ok back to IE 6, yay)
Modified paths:
  • /trunk/phase3/includes/libs/jsminplus.php (modified) (history)
  • /trunk/phase3/tests/phpunit/includes/libs/JavaScriptMinifierTest.php (modified) (history)
  • /trunk/phase3/tests/qunit/index.html (modified) (history)
  • /trunk/phase3/tests/qunit/suites/resources/mediawiki/mediawiki.jscompat.test.js (added) (history)

Diff [purge]

Index: trunk/phase3/tests/qunit/index.html
@@ -79,6 +79,7 @@
8080 <script src="data/testrunner.js"></script>
8181
8282 <!-- QUnit: Load test suites (maintain the same order as above please) -->
 83+ <script src="suites/resources/mediawiki/mediawiki.jscompat.test.js"></script>
8384 <script src="suites/resources/mediawiki/mediawiki.test.js"></script>
8485 <script src="suites/resources/mediawiki/mediawiki.user.test.js"></script>
8586
Index: trunk/phase3/tests/qunit/suites/resources/mediawiki/mediawiki.jscompat.test.js
@@ -0,0 +1,35 @@
 2+/* Some misc JavaScript compatibility tests, just to make sure the environments we run in are consistent */
 3+
 4+module( 'mediawiki.jscompat' );
 5+
 6+test( 'Variable with Unicode letter in name', function() {
 7+ expect(3);
 8+ var orig = "some token";
 9+ var ŝablono = orig;
 10+ deepEqual( ŝablono, orig, 'ŝablono' );
 11+ deepEqual( \u015dablono, orig, '\\u015dablono' );
 12+ deepEqual( \u015Dablono, orig, '\\u015Dablono' );
 13+});
 14+
 15+/*
 16+// Not that we need this. ;)
 17+// This fails on IE 6-8
 18+// Works on IE 9, Firefox 6, Chrome 14
 19+test( 'Keyword workaround: "if" as variable name using Unicode escapes', function() {
 20+ var orig = "another token";
 21+ \u0069\u0066 = orig;
 22+ deepEqual( \u0069\u0066, orig, '\\u0069\\u0066' );
 23+});
 24+*/
 25+
 26+/*
 27+// Not that we need this. ;)
 28+// This fails on IE 6-9
 29+// Works on Firefox 6, Chrome 14
 30+test( 'Keyword workaround: "if" as member variable name using Unicode escapes', function() {
 31+ var orig = "another token";
 32+ var foo = {};
 33+ foo.\u0069\u0066 = orig;
 34+ deepEqual( foo.\u0069\u0066, orig, 'foo.\\u0069\\u0066' );
 35+});
 36+*/
Property changes on: trunk/phase3/tests/qunit/suites/resources/mediawiki/mediawiki.jscompat.test.js
___________________________________________________________________
Added: svn:eol-style
137 + native
Index: trunk/phase3/tests/phpunit/includes/libs/JavaScriptMinifierTest.php
@@ -78,6 +78,12 @@
7979
8080 // newline insertion after 1000 chars: break after the "++", not before
8181 array( str_repeat( ';', 996 ) . "if(x++);", str_repeat( ';', 996 ) . "if(x++\n);" ),
 82+
 83+ // Unicode letter characters should pass through ok in identifiers (bug 31187)
 84+ array( "var KaŝSkatolVal = {}", 'var KaŝSkatolVal={}'),
 85+ // And also per spec unicode char escape values should work in identifiers,
 86+ // as long as it's a valid char. In future it might get normalized.
 87+ array( "var Ka\\u015dSkatolVal = {}", 'var Ka\\u015dSkatolVal={}'),
8288 );
8389 }
8490
Index: trunk/phase3/includes/libs/jsminplus.php
@@ -2029,13 +2029,55 @@
20302030 break;
20312031
20322032 default:
2033 - // FIXME: add support for unicode and unicode escape sequence \uHHHH
2034 - if (preg_match('/^[$\w]+/', $input, $match))
 2033+ // Fast path for identifiers: word chars followed by whitespace or various other tokens.
 2034+ // Note we don't need to exclude digits in the first char, as they've already been found
 2035+ // above.
 2036+ if (!preg_match('/^[$\w]+(?=[\s\/\|\^\&<>\+\-\*%=!.;,\?:~\[\]\{\}\(\)@])/', $input, $match))
20352037 {
2036 - $tt = in_array($match[0], $this->keywords) ? $match[0] : TOKEN_IDENTIFIER;
 2038+ // Character classes per ECMA-262 edition 5.1 section 7.6
 2039+ // Per spec, must accept Unicode 3.0, *may* accept later versions.
 2040+ // We'll take whatever PCRE understands, which should be more recent.
 2041+ $identifierStartChars = "\\p{L}\\p{Nl}" . # UnicodeLetter
 2042+ "\$" .
 2043+ "_";
 2044+ $identifierPartChars = $identifierStartChars .
 2045+ "\\p{Mn}\\p{Mc}" . # UnicodeCombiningMark
 2046+ "\\p{Nd}" . # UnicodeDigit
 2047+ "\\p{Pc}"; # UnicodeConnectorPunctuation
 2048+ $unicodeEscape = "\\\\u[0-9A-F-a-f]{4}";
 2049+ $identifierRegex = "/^" .
 2050+ "(?:[$identifierStartChars]|$unicodeEscape)" .
 2051+ "(?:[$identifierPartChars]|$unicodeEscape)*" .
 2052+ "/uS";
 2053+ if (preg_match($identifierRegex, $input, $match))
 2054+ {
 2055+ if (strpos($match[0], '\\') !== false) {
 2056+ // Per ECMA-262 edition 5.1, section 7.6 escape sequences should behave as if they were
 2057+ // the original chars, but only within the boundaries of the identifier.
 2058+ $decoded = preg_replace_callback('/\\\\u([0-9A-Fa-f]{4})/',
 2059+ array(__CLASS__, 'unicodeEscapeCallback'),
 2060+ $match[0]);
 2061+
 2062+ // Since our original regex didn't de-escape the originals, we need to check for validity again.
 2063+ // No need to worry about token boundaries, as anything outside the identifier is illegal!
 2064+ if (!preg_match("/^[$identifierStartChars][$identifierPartChars]*$/u", $decoded)) {
 2065+ throw $this->newSyntaxError('Illegal token');
 2066+ }
 2067+
 2068+ // Per spec it _ought_ to work to use these escapes for keywords words as well...
 2069+ // but IE rejects them as invalid, while Firefox and Chrome treat them as identifiers
 2070+ // that don't match the keyword.
 2071+ if (in_array($decoded, $this->keywords)) {
 2072+ throw $this->newSyntaxError('Illegal token');
 2073+ }
 2074+
 2075+ // TODO: save the decoded form for output?
 2076+ }
 2077+ }
 2078+ else
 2079+ throw $this->newSyntaxError('Illegal token');
20372080 }
2038 - else
2039 - throw $this->newSyntaxError('Illegal token');
 2081+ $tt = in_array($match[0], $this->keywords) ? $match[0] : TOKEN_IDENTIFIER;
20402082 }
20412083 }
20422084
@@ -2073,6 +2115,11 @@
20742116 {
20752117 return new Exception('Parse error: ' . $m . ' in file \'' . $this->filename . '\' on line ' . $this->lineno);
20762118 }
 2119+
 2120+ public static function unicodeEscapeCallback($m)
 2121+ {
 2122+ return html_entity_decode('&#x' . $m[1]. ';', ENT_QUOTES, 'UTF-8');
 2123+ }
20772124 }
20782125
20792126 class JSToken

Follow-up revisions

RevisionCommit summaryAuthorDate
r98285MFT r98281: (bug 31187) patch JSMin+ to support valid Unicode characters in J...brion23:15, 27 September 2011
r98303MFT r98281tstarling11:58, 28 September 2011

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r91591Followup r83885: add JSMin+ 1.3 to use its parser to verify output of JavaScr...brion20:02, 6 July 2011
r93020Update JSMin+ to the newly released 1.4...platonides21:46, 24 July 2011

Comments

#Comment by Brion VIBBER (talk | contribs)   23:18, 27 September 2011

Per bug 31187 a workaround is in use on eo.wikipedia.org for some affected scripts, but this'd be nice to get deployed. :) I'm not merging direct to 1.18wmf1 as the process seems to get confused when someone who can't immediately push code merges to the deployment branch.

Also needs a fix from r94614 for the QUnit test case.

Already merged to 1.18 with the above as r98285.

#Comment by Krinkle (talk | contribs)   01:29, 28 September 2011
+	public static function unicodeEscapeCallback($m)
+	{
+		return html_entity_decode('&#x' . $m[1]. ';', ENT_QUOTES, 'UTF-8');
+	}

Nothing serious but ahm, I hear those whitespace bells ringing, in a winter wonder web. ;D

#Comment by Brion VIBBER (talk | contribs)   01:35, 28 September 2011

Can you clarify? Note that this is a patch to an external library with its own coding style, which I ve tried to match.

#Comment by Krinkle (talk | contribs)   05:07, 28 September 2011

Which is why it wasn't serious. I just find it interesting whenever I see this coding style.

#Comment by Duplicatebug (talk | contribs)   19:25, 28 September 2011

"Note that identifiers using escapes don't get normalized to their UTF-8 form; this might be a nice thing to do as it saves a couple bytes, but currently there's no change made to output."

Sounds like a job for a Minifier (like JavaScriptMinifier.php), not for a Validator (jsminplus.php is used as that in the Resource Loader)

#Comment by Brion VIBBER (talk | contribs)   19:28, 28 September 2011

Indeed, I was in no hurry in part because it wouldn't actually affect our usage. ;)

#Comment by Duplicatebug (talk | contribs)   19:21, 30 September 2011

Ok. Tracked now as bug 31286

#Comment by Platonides (talk | contribs)   20:39, 18 October 2011

Send patch upstream?

#Comment by Brion VIBBER (talk | contribs)   00:27, 1 November 2011

I emailed it to the upstream maintainer at the time; haven't heard back. No source control or bug tracker to submit to.

#Comment by Platonides (talk | contribs)   00:32, 1 November 2011

That's odd, he answered me quite promptly.

#Comment by Brion VIBBER (talk | contribs)   00:37, 1 November 2011

I may have gotten spam-filtered or something. :) Feel free to send him another update since your email doesn't bounce!

#Comment by Platonides (talk | contribs)   16:30, 9 November 2011

Can you sent me a copy, so that I can forward it?

Status & tagging log