r44000 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r43999‎ | r44000 | r44001 >
Date:20:11, 27 November 2008
Author:vyznev
Status:ok (Comments)
Tags:
Comment:
(bug 6100) Strip Unicode BiDi embedding/override characters (U+202A - U+202E) from titles.
NOTE: run maintenance/cleanupImages.php and cleanupTitles.php ASAP after deploying this!
Modified paths:
  • /trunk/phase3/RELEASE-NOTES (modified) (history)
  • /trunk/phase3/includes/Title.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/Title.php
@@ -2047,8 +2047,7 @@
20482048 # Strip Unicode bidi override characters.
20492049 # Sometimes they slip into cut-n-pasted page titles, where the
20502050 # override chars get included in list displays.
2051 - $dbkey = str_replace( "\xE2\x80\x8E", '', $dbkey ); // 200E LEFT-TO-RIGHT MARK
2052 - $dbkey = str_replace( "\xE2\x80\x8F", '', $dbkey ); // 200F RIGHT-TO-LEFT MARK
 2051+ $dbkey = preg_replace( '/\xE2\x80[\x8E\x8F\xAA-\xAE]/S', '', $dbkey );
20532052
20542053 # Clean up whitespace
20552054 #
Index: trunk/phase3/RELEASE-NOTES
@@ -369,6 +369,10 @@
370370 * Honour unchecked "Leave a redirect behind" for moved subpages
371371 * (bug 16440) Broken 0-byte math renderings are now deleted and re-rendered
372372 when page is re-parsed.
 373+* (bug 6100) Unicode BiDi embedding/override characters (U+202A - U+202E) are
 374+ now automatically removed from titles; these characters can accidentally end
 375+ up in copy-and-pasted titles, and, by overriding normal bidirectional text
 376+ handling, can lead to annoying behavior such as text rendering backwards
373377
374378 === API changes in 1.14 ===
375379

Follow-up revisions

RevisionCommit summaryAuthorDate
r90264(part of bug 6100) Set the directionality based on user language instead of c...robin11:32, 17 June 2011
r90320Follow-up to r90265: directionality improvements as part of bug 6100 (under $...robin21:48, 17 June 2011
r90334Follow-up to r90265: directionality improvements as part of bug 6100 (under $...robin13:12, 18 June 2011
r90517* Improvements as part of bug 6100: Use wfUILang() instead of $wgContLang whe...robin10:14, 21 June 2011
r90581Directionality improvements as part of bug 6100 (under $wgBetterDirectionalit...robin13:10, 22 June 2011
r90734(bug 12406) Pages with names in RTL scripts are not listed correctly in Speci...robin20:25, 24 June 2011
r90742Directionality and language improvements as part of bug 6100 (under $wgBetter...robin22:10, 24 June 2011
r90743Directionality improvements as part of bug 6100 (under $wgBetterDirectionality):...robin23:01, 24 June 2011
r91315* Add release notes for my recent commits (bug 6100 and others like bugs 2803...robin22:50, 1 July 2011
r91518(bug 6100; follow-up to r91315) Being bold and removing $wgBetterDirectionali...robin02:26, 6 July 2011

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r14495* (bug 6100) BiDi: different directionality for user interface and wiki conte...nikerabbit15:19, 31 May 2006

Comments

#Comment by Brion VIBBER (talk | contribs)   18:55, 11 December 2008

I... *think* this is ok. I hope. :(

#Comment by Ilmari Karonen (talk | contribs)   19:53, 11 December 2008

I did test it on a local wiki, and it seems to work as designed. It still doesn't strip all potentially confusing invisible Unicode characters — such as, for example, the Byte Order Mark (U+FEFF) — but the U+202A–U+202E range is arguably the most harmful, since they can override the normal directionality of strongly directional characters (‮such as Latin letters, like here‬), whereas U+200E and U+200F only affect weakly directional ones like punctuation (and only in their immediate vicinity). Indeed, if anything, I'd be more inclined to consider allowing U+200E / U+200F in titles, since they may have some legitimate uses.

Status & tagging log