r93291 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r93290‎ | r93291 | r93292 >
Date:18:03, 27 July 2011
Author:hashar
Status:ok (Comments)
Tags:
Comment:
Unicode space separator characters (Zs) now terminates links

Fix 19052 which was only reporting the issue for U+3000 IDEOGRAPHIC SPACE.
Covers both external links and images links. See parser tests for examples.

Unicode 'Zs' includes all characters from the 'separator, space' category.
Characters part of this category are:

Char Name
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE


TEST PLAN:

$ php parserTests.php --quiet
This is MediaWiki version 1.19alpha (r93258).

Reading tests from "tests/parser/parserTests.txt"...
Reading tests from "tests/parser/extraParserTests.txt"...
Reading tests from "../mwexts/LabeledSectionTransclusion/lstParserTests.txt"...
Passed 686 of 686 tests (100%)... ALL TESTS PASSED!

Sounds good :-)
Modified paths:
  • /trunk/phase3/RELEASE-NOTES-1.19 (modified) (history)
  • /trunk/phase3/includes/parser/Parser.php (modified) (history)
  • /trunk/phase3/tests/parser/parserTests.txt (modified) (history)

Diff [purge]

Index: trunk/phase3/RELEASE-NOTES-1.19
@@ -20,6 +20,8 @@
2121 elements to work with DNSBLs that require keys, such as
2222 Project Honeypot.
2323 * (bug 30022) Add support for custom loadScript sources to ResourceLoader.
 24+* (bug 19052) Unicode space separator characters (Zs) now terminates external
 25+ links and images links.
2426
2527 === Bug fixes in 1.19 ===
2628 * $wgUploadNavigationUrl should be used for file redlinks if
Index: trunk/phase3/tests/parser/parserTests.txt
@@ -9248,7 +9248,33 @@
92499249 </p>
92509250 !! end
92519251
 9252+!! test
 9253+Bug 19052 U+3000 IDEOGRAPHIC SPACE should terminate free external links
 9254+!! input
 9255+http://www.example.org/ <-- U+3000 (vim: ^Vu3000)
 9256+!! result
 9257+<p><a rel="nofollow" class="external free" href="http://www.example.org/">http://www.example.org/</a> &lt;-- U+3000 (vim: ^Vu3000)
 9258+</p>
 9259+!! end
92529260
 9261+!! test
 9262+Bug 19052 U+3000 IDEOGRAPHIC SPACE should terminate bracketed external links
 9263+!! input
 9264+[http://www.example.org/ ideograms]
 9265+!! result
 9266+<p><a rel="nofollow" class="external text" href="http://www.example.org/">ideograms</a>
 9267+</p>
 9268+!! end
 9269+
 9270+!! test
 9271+Bug 19052 U+3000 IDEOGRAPHIC SPACE should terminate external images links
 9272+!! input
 9273+http://www.example.org/pic.png <-- U+3000 (vim: ^Vu3000)
 9274+!! result
 9275+<p><img src="http://www.example.org/pic.png" alt="pic.png" /> &lt;-- U+3000 (vim: ^Vu3000)
 9276+</p>
 9277+!! end
 9278+
92539279 TODO:
92549280 more images
92559281 more tables
Index: trunk/phase3/includes/parser/Parser.php
@@ -68,9 +68,11 @@
6969
7070 # Constants needed for external link processing
7171 # Everything except bracket, space, or control characters
72 - const EXT_LINK_URL_CLASS = '[^][<>"\\x00-\\x20\\x7F]';
73 - const EXT_IMAGE_REGEX = '/^(http:\/\/|https:\/\/)([^][<>"\\x00-\\x20\\x7F]+)
74 - \\/([A-Za-z0-9_.,~%\\-+&;#*?!=()@\\x80-\\xFF]+)\\.((?i)gif|png|jpg|jpeg)$/Sx';
 72+ # \p{Zs} is unicode 'separator, space' category. It covers the space 0x20
 73+ # as well as U+3000 is IDEOGRAPHIC SPACE for bug 19052
 74+ const EXT_LINK_URL_CLASS = '[^][<>"\\x00-\\x20\\x7F\p{Zs}]';
 75+ const EXT_IMAGE_REGEX = '/^(http:\/\/|https:\/\/)([^][<>"\\x00-\\x20\\x7F\p{Zs}]+)
 76+ \\/([A-Za-z0-9_.,~%\\-+&;#*?!=()@\\x80-\\xFF]+)\\.((?i)gif|png|jpg|jpeg)$/Sxu';
7577
7678 # State constants for the definition list colon extraction
7779 const COLON_STATE_TEXT = 0;
@@ -184,7 +186,7 @@
185187 $this->mConf = $conf;
186188 $this->mUrlProtocols = wfUrlProtocols();
187189 $this->mExtLinkBracketedRegex = '/\[((' . wfUrlProtocols() . ')'.
188 - '[^][<>"\\x00-\\x20\\x7F]+) *([^\]\\x00-\\x08\\x0a-\\x1F]*?)\]/S';
 190+ self::EXT_LINK_URL_CLASS.'+)\p{Zs}*([^\]\\x00-\\x08\\x0a-\\x1F]*?)\]/Su';
189191 if ( isset( $conf['preprocessorClass'] ) ) {
190192 $this->mPreprocessorClass = $conf['preprocessorClass'];
191193 } elseif ( defined( 'MW_COMPILED' ) ) {
@@ -1197,7 +1199,7 @@
11981200 (?: [0-9] [\ \-]? ){9} # 9 digits with opt. delimiters
11991201 [0-9Xx] # check digit
12001202 \b)
1201 - )!x', array( &$this, 'magicLinkCallback' ), $text );
 1203+ )!xu', array( &$this, 'magicLinkCallback' ), $text );
12021204 wfProfileOut( __METHOD__ );
12031205 return $text;
12041206 }
@@ -4963,7 +4965,7 @@
49644966 $value = true;
49654967 $validated = true;
49664968 } elseif ( preg_match( "/^$prots/", $value ) ) {
4967 - if ( preg_match( "/^($prots)$chars+$/", $value, $m ) ) {
 4969+ if ( preg_match( "/^($prots)$chars+$/u", $value, $m ) ) {
49684970 $paramName = 'link-url';
49694971 $this->mOutput->addExternalLink( $value );
49704972 if ( $this->mOptions->getExternalLinkTarget() ) {

Comments

#Comment by Brion VIBBER (talk | contribs)   18:11, 27 July 2011

The first and third test cases look like they would still terminate on the following '<' char, but that would produce different output so I think that's ok.

#Comment by Platonides (talk | contribs)   15:14, 28 July 2011

Maybe the tests should be on a different file, to lessen the chance of some editor replacing them with normal spaces.

What's the cost of changing the regex to /u ?

#Comment by Hashar (talk | contribs)   19:25, 28 July 2011

> Maybe the tests should be on a different file, to lessen the chance of some editor replacing them with normal spaces.

Seems unlikley to me, but feel free to split the tests if you are afraid of bad editors. Anyway, I will probably one day split the parserTests.txt file which is wayyy too long.

> What's the cost of changing the regex to /u ? That is something I have eluded, I am assuming the PCRE library has roughly the same performance with or without unicode. We might need some benchmarks. However, I have no idea how to bench it properly in our context.

#Comment by Platonides (talk | contribs)   13:07, 30 July 2011

Running both regex against eswiki-20100926-pages-meta-history.xml.7z (36942737 revisions)

CPU time with the old one: 342m45s CPU time with this one: 401m33s

This change is quite slower.

I suspect the results would be worse with a Chinese or Korean wiki.

#Comment by Platonides (talk | contribs)   20:30, 1 August 2011

Tested with zh_min_nanwiki-20101104-pages-meta-history.xml.7z, although it's much smaller:

Classic regex: 1m17.765s Unicode regex: 1m27.148s

Status & tagging log