r86130 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r86129‎ | r86130 | r86131 >
Date:18:39, 15 April 2011
Author:bawolff
Status:resolved (Comments)
Tags:
Comment:
(follow-up r69626) Make it so the intl normalizer_normalize function is not
fed an invalid sequence in UtfNormal::cleanUp

normalizer_normalize seems to return false if fed an invalid unicode sequence (Which is quite different
from what our built in normalization functions do). So use quickIsNFC if it returns false.
(Noticed when investigating bug 28541).
Modified paths:
  • /trunk/phase3/RELEASE-NOTES (modified) (history)
  • /trunk/phase3/includes/normal/UtfNormal.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/normal/UtfNormal.php
@@ -79,7 +79,7 @@
8080 * @return string a clean, shiny, normalized UTF-8 string
8181 */
8282 static function cleanUp( $string ) {
83 - if( NORMALIZE_ICU || NORMALIZE_INTL ) {
 83+ if( NORMALIZE_ICU ) {
8484 # We exclude a few chars that ICU would not.
8585 $string = preg_replace(
8686 '/[\x00-\x08\x0b\x0c\x0e-\x1f]/',
@@ -90,8 +90,24 @@
9191
9292 # UnicodeString constructor fails if the string ends with a
9393 # head byte. Add a junk char at the end, we'll strip it off.
94 - if ( NORMALIZE_ICU ) return rtrim( utf8_normalize( $string . "\x01", UNORM_NFC ), "\x01" );
95 - if ( NORMALIZE_INTL ) return normalizer_normalize( $string, Normalizer::FORM_C );
 94+ return rtrim( utf8_normalize( $string . "\x01", UNORM_NFC ), "\x01" );
 95+ } elseif( NORMALIZE_INTL ) {
 96+ $norm = normalizer_normalize( $string, Normalizer::FORM_C );
 97+ if( $norm === null || $norm === false ) {
 98+ # normalizer_normalize will either return false or null
 99+ # (depending on which doc you read) if invalid utf8 string.
 100+ # quickIsNFCVerify cleans up invalid sequences.
 101+
 102+ if( UtfNormal::quickIsNFCVerify( $string ) ) {
 103+ # if that's true, the string is actually already normal.
 104+ return $string;
 105+ } else {
 106+ # Now we are valid but non-normal
 107+ return normalizer_normalize( $string, Normalizer::FORM_C );
 108+ }
 109+ } else {
 110+ return $norm;
 111+ }
96112 } elseif( UtfNormal::quickIsNFCVerify( $string ) ) {
97113 # Side effect -- $string has had UTF-8 errors cleaned up.
98114 return $string;
Index: trunk/phase3/RELEASE-NOTES
@@ -237,6 +237,7 @@
238238 * (bug 27473) Fix regression: bold, italic no longer interfere with linktrail for ca, kaa
239239 * (bug 28444) Fix regression: edit-on-doubleclick retains revision id again
240240 * ' character entity is now allowed in wikitext
 241+* UtfNormal::cleanUp on an invalid utf-8 sequence no longer returns false if intl installed.
241242
242243 === API changes in 1.18 ===
243244 * (bug 26339) Throw warning when truncating an overlarge API result

Follow-up revisions

RevisionCommit summaryAuthorDate
r86210(follow-up r86130) the normalizer_normalize function doesn't replace things l...bawolff15:32, 16 April 2011
r86257API: BREAKING CHANGE: (bug 28541) Output of binary ICU sortkeys is broken. Ch...catrope12:41, 17 April 2011
r864641.17wmf1: MFT r85377, r85555, r85583, r86100, r86121, r86130, r86142, r86146,...catrope11:27, 20 April 2011
r864741.17: MFT r81731, r85377, r85547, r85555, r85583, r85803, r85881, r86100, r86...catrope13:22, 20 April 2011

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r69626Prefer the intl PECL extension for ICU Unicodemah15:41, 20 July 2010

Comments

#Comment by Bawolff (talk | contribs)   18:42, 15 April 2011

Adding tags 1.17, 1.17wmf1. (which probably means I should not have touched the release notes, whoops)

Status & tagging log