r98281 MediaWiki - Code Review archive

Repository:	MediaWiki
Revision:	< r98280‎ \| r98281 \| r98282 >
Date:	22:51, 27 September 2011
Author:	brion
Status:	ok (Comments)
Tags:
Comment:	* (bug 31187) Fix for user JavaScript validation to allow identifiers with valid Unicode letter characters Followup r91591, r93020: patch to jsminplus to support Unicode chars and char escapes in identifiers Fast-path check keeps runtime about the same on most scripts (eg jquery.js parsing was abround 4100ms both before and after on my test machine) Slow-path code kicks in if plain ASCII word chars don't extend all the way to the next whitespace or punctuation char. Using PCRE's Unicode properties magic to ensure that we're catching everything, following ECMA-262 edition 5.1 spec. Note that identifiers using escapes don't get normalized to their UTF-8 form; this might be a nice thing to do as it saves a couple bytes, but currently there's no change made to output. Added QUnit tests to verify that unicode letter & escapes work in identifiers in all supported browsers (ok back to IE 6, yay)
Modified paths:	/trunk/phase3/includes/libs/jsminplus.php (modified) (history) /trunk/phase3/tests/phpunit/includes/libs/JavaScriptMinifierTest.php (modified) (history) /trunk/phase3/tests/qunit/index.html (modified) (history) /trunk/phase3/tests/qunit/suites/resources/mediawiki/mediawiki.jscompat.test.js (added) (history)

Diff [purge]

Index: trunk/phase3/tests/qunit/index.html
—	—	@@ -79,6 +79,7 @@
80	80	<script src="data/testrunner.js"></script>
81	81
82	82	<!-- QUnit: Load test suites (maintain the same order as above please) -->
	83	+ <script src="suites/resources/mediawiki/mediawiki.jscompat.test.js"></script>
83	84	<script src="suites/resources/mediawiki/mediawiki.test.js"></script>
84	85	<script src="suites/resources/mediawiki/mediawiki.user.test.js"></script>
85	86
Index: trunk/phase3/tests/qunit/suites/resources/mediawiki/mediawiki.jscompat.test.js
—	—	@@ -0,0 +1,35 @@
	2	+/* Some misc JavaScript compatibility tests, just to make sure the environments we run in are consistent */
	3	+
	4	+module( 'mediawiki.jscompat' );
	5	+
	6	+test( 'Variable with Unicode letter in name', function() {
	7	+ expect(3);
	8	+ var orig = "some token";
	9	+ var ŝablono = orig;
	10	+ deepEqual( ŝablono, orig, 'ŝablono' );
	11	+ deepEqual( \u015dablono, orig, '\\u015dablono' );
	12	+ deepEqual( \u015Dablono, orig, '\\u015Dablono' );
	13	+});
	14	+
	15	+/*
	16	+// Not that we need this. ;)
	17	+// This fails on IE 6-8
	18	+// Works on IE 9, Firefox 6, Chrome 14
	19	+test( 'Keyword workaround: "if" as variable name using Unicode escapes', function() {
	20	+ var orig = "another token";
	21	+ \u0069\u0066 = orig;
	22	+ deepEqual( \u0069\u0066, orig, '\\u0069\\u0066' );
	23	+});
	24	+*/
	25	+
	26	+/*
	27	+// Not that we need this. ;)
	28	+// This fails on IE 6-9
	29	+// Works on Firefox 6, Chrome 14
	30	+test( 'Keyword workaround: "if" as member variable name using Unicode escapes', function() {
	31	+ var orig = "another token";
	32	+ var foo = {};
	33	+ foo.\u0069\u0066 = orig;
	34	+ deepEqual( foo.\u0069\u0066, orig, 'foo.\\u0069\\u0066' );
	35	+});
	36	+*/
Property changes on: trunk/phase3/tests/qunit/suites/resources/mediawiki/mediawiki.jscompat.test.js
___________________________________________________________________
Added: svn:eol-style
1	37	+ native
Index: trunk/phase3/tests/phpunit/includes/libs/JavaScriptMinifierTest.php
—	—	@@ -78,6 +78,12 @@
79	79
80	80	// newline insertion after 1000 chars: break after the "++", not before
81	81	array( str_repeat( ';', 996 ) . "if(x++);", str_repeat( ';', 996 ) . "if(x++\n);" ),
	82	+
	83	+ // Unicode letter characters should pass through ok in identifiers (bug 31187)
	84	+ array( "var KaŝSkatolVal = {}", 'var KaŝSkatolVal={}'),
	85	+ // And also per spec unicode char escape values should work in identifiers,
	86	+ // as long as it's a valid char. In future it might get normalized.
	87	+ array( "var Ka\\u015dSkatolVal = {}", 'var Ka\\u015dSkatolVal={}'),
82	88	);
83	89	}
84	90
Index: trunk/phase3/includes/libs/jsminplus.php
—	—	@@ -2029,13 +2029,55 @@
2030	2030	break;
2031	2031
2032	2032	default:
2033		~~- // FIXME: add support for unicode and unicode escape sequence \uHHHH~~
2034		~~- if (preg_match('/^[$\w]+/', $input, $match))~~
	2033	+ // Fast path for identifiers: word chars followed by whitespace or various other tokens.
	2034	+ // Note we don't need to exclude digits in the first char, as they've already been found
	2035	+ // above.
	2036	+ if (!preg_match('/^[$\w]+(?=[\s\/\\|\^\&<>\+\-\*%=!.;,\?:~\[\]\{\}@])/', $input, $match))
2035	2037	{
2036		~~- $tt = in_array($match[0], $this->keywords) ? $match[0] : TOKEN_IDENTIFIER;~~
	2038	+ // Character classes per ECMA-262 edition 5.1 section 7.6
	2039	+ // Per spec, must accept Unicode 3.0, may accept later versions.
	2040	+ // We'll take whatever PCRE understands, which should be more recent.
	2041	+ $identifierStartChars = "\\p{L}\\p{Nl}" . # UnicodeLetter
	2042	+ "\$" .
	2043	+ "_";
	2044	+ $identifierPartChars = $identifierStartChars .
	2045	+ "\\p{Mn}\\p{Mc}" . # UnicodeCombiningMark
	2046	+ "\\p{Nd}" . # UnicodeDigit
	2047	+ "\\p{Pc}"; # UnicodeConnectorPunctuation
	2048	+ $unicodeEscape = "\\\\u[0-9A-F-a-f]{4}";
	2049	+ $identifierRegex = "/^" .
	2050	+ "(?:[$identifierStartChars]\|$unicodeEscape)" .
	2051	+ "(?:[$identifierPartChars]\|$unicodeEscape)*" .
	2052	+ "/uS";
	2053	+ if (preg_match($identifierRegex, $input, $match))
	2054	+ {
	2055	+ if (strpos($match[0], '\\') !== false) {
	2056	+ // Per ECMA-262 edition 5.1, section 7.6 escape sequences should behave as if they were
	2057	+ // the original chars, but only within the boundaries of the identifier.
	2058	+ $decoded = preg_replace_callback('/\\\\u([0-9A-Fa-f]{4})/',
	2059	+ array(__CLASS__, 'unicodeEscapeCallback'),
	2060	+ $match[0]);
	2061	+
	2062	+ // Since our original regex didn't de-escape the originals, we need to check for validity again.
	2063	+ // No need to worry about token boundaries, as anything outside the identifier is illegal!
	2064	+ if (!preg_match("/^[$identifierStartChars][$identifierPartChars]*$/u", $decoded)) {
	2065	+ throw $this->newSyntaxError('Illegal token');
	2066	+ }
	2067	+
	2068	+ // Per spec it _ought_ to work to use these escapes for keywords words as well...
	2069	+ // but IE rejects them as invalid, while Firefox and Chrome treat them as identifiers
	2070	+ // that don't match the keyword.
	2071	+ if (in_array($decoded, $this->keywords)) {
	2072	+ throw $this->newSyntaxError('Illegal token');
	2073	+ }
	2074	+
	2075	+ // TODO: save the decoded form for output?
	2076	+ }
	2077	+ }
	2078	+ else
	2079	+ throw $this->newSyntaxError('Illegal token');
2037	2080	}
2038		~~- else~~
2039		~~- throw $this->newSyntaxError('Illegal token');~~
	2081	+ $tt = in_array($match[0], $this->keywords) ? $match[0] : TOKEN_IDENTIFIER;
2040	2082	}
2041	2083	}
2042	2084
—	—	@@ -2073,6 +2115,11 @@
2074	2116	{
2075	2117	return new Exception('Parse error: ' . $m . ' in file \'' . $this->filename . '\' on line ' . $this->lineno);
2076	2118	}
	2119	+
	2120	+ public static function unicodeEscapeCallback($m)
	2121	+ {
	2122	+ return html_entity_decode('&#x' . $m[1]. ';', ENT_QUOTES, 'UTF-8');
	2123	+ }
2077	2124	}
2078	2125
2079	2126	class JSToken

Follow-up revisions

Revision	Commit summary	Author	Date
r98285	MFT r98281: (bug 31187) patch JSMin+ to support valid Unicode characters in J...	brion	23:15, 27 September 2011
r98303	MFT r98281	tstarling	11:58, 28 September 2011

Past revisions this follows-up on

Revision	Commit summary	Author	Date
r91591	Followup r83885: add JSMin+ 1.3 to use its parser to verify output of JavaScr...	brion	20:02, 6 July 2011
r93020	Update JSMin+ to the newly released 1.4...	platonides	21:46, 24 July 2011

Comments

#Comment by Brion VIBBER (talk | contribs) 23:18, 27 September 2011

Per bug 31187 a workaround is in use on eo.wikipedia.org for some affected scripts, but this'd be nice to get deployed. :) I'm not merging direct to 1.18wmf1 as the process seems to get confused when someone who can't immediately push code merges to the deployment branch.

Also needs a fix from r94614 for the QUnit test case.

Already merged to 1.18 with the above as r98285.

#Comment by Krinkle (talk | contribs) 01:29, 28 September 2011

+	public static function unicodeEscapeCallback($m)
+	{
+		return html_entity_decode('&#x' . $m[1]. ';', ENT_QUOTES, 'UTF-8');
+	}

Nothing serious but ahm, I hear those whitespace bells ringing, in a winter wonder web. ;D

#Comment by Brion VIBBER (talk | contribs) 01:35, 28 September 2011

Can you clarify? Note that this is a patch to an external library with its own coding style, which I ve tried to match.

#Comment by Krinkle (talk | contribs) 05:07, 28 September 2011

Which is why it wasn't serious. I just find it interesting whenever I see this coding style.

#Comment by Duplicatebug (talk | contribs) 19:25, 28 September 2011

"Note that identifiers using escapes don't get normalized to their UTF-8 form; this might be a nice thing to do as it saves a couple bytes, but currently there's no change made to output."

Sounds like a job for a Minifier (like JavaScriptMinifier.php), not for a Validator (jsminplus.php is used as that in the Resource Loader)

#Comment by Brion VIBBER (talk | contribs) 19:28, 28 September 2011

Indeed, I was in no hurry in part because it wouldn't actually affect our usage. ;)

#Comment by Duplicatebug (talk | contribs) 19:21, 30 September 2011

Ok. Tracked now as bug 31286

#Comment by Platonides (talk | contribs) 20:39, 18 October 2011

Send patch upstream?

#Comment by Brion VIBBER (talk | contribs) 00:27, 1 November 2011

I emailed it to the upstream maintainer at the time; haven't heard back. No source control or bug tracker to submit to.

#Comment by Platonides (talk | contribs) 00:32, 1 November 2011

That's odd, he answered me quite promptly.

#Comment by Brion VIBBER (talk | contribs) 00:37, 1 November 2011

I may have gotten spam-filtered or something. :) Feel free to send him another update since your email doesn't bounce!

#Comment by Platonides (talk | contribs) 16:30, 9 November 2011

Can you sent me a copy, so that I can forward it?

Status & tagging log

20:36, 29 September 2011 Reedy (talk | contribs) changed the tags for r98281 [removed: 1.18wmf1]
11:44, 28 September 2011 Tim Starling (talk | contribs) changed the status of r98281 [removed: new added: ok]
23:18, 27 September 2011 Brion VIBBER (talk | contribs) changed the tags for r98281 [added: 1.18wmf1]