r70126 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r70125‎ | r70126 | r70127 >
Date:02:44, 29 July 2010
Author:mah
Status:resolved (Comments)
Tags:
Comment:
Add detection for unicode normalization. Next step: use what we find! :)
I think I want to point to an as-yet-to-be-created page on MediaWiki.org to help people understand what to do if they're stuck with pure PHP normalization, but any pointers here would help.
Modified paths:
  • /trunk/phase3/includes/installer/Installer.i18n.php (modified) (history)
  • /trunk/phase3/includes/installer/Installer.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/installer/Installer.php
@@ -88,6 +88,7 @@
8989 'envCheckExtension',
9090 'envCheckShellLocale',
9191 'envCheckUploadsDirectory',
 92+ 'envCheckLibicu'
9293 );
9394
9495 /**
@@ -812,6 +813,69 @@
813814 }
814815
815816 /**
 817+ * Convert a hex string representing a Unicode code point to that code point.
 818+ * @param string $c
 819+ * @return string
 820+ */
 821+ protected function unicodeChar( $c ) {
 822+ $c = hexdec($c);
 823+ if ($c <= 0x7F) {
 824+ return chr($c);
 825+ } else if ($c <= 0x7FF) {
 826+ return chr(0xC0 | $c >> 6) . chr(0x80 | $c & 0x3F);
 827+ } else if ($c <= 0xFFFF) {
 828+ return chr(0xE0 | $c >> 12) . chr(0x80 | $c >> 6 & 0x3F)
 829+ . chr(0x80 | $c & 0x3F);
 830+ } else if ($c <= 0x10FFFF) {
 831+ return chr(0xF0 | $c >> 18) . chr(0x80 | $c >> 12 & 0x3F)
 832+ . chr(0x80 | $c >> 6 & 0x3F)
 833+ . chr(0x80 | $c & 0x3F);
 834+ } else {
 835+ return false;
 836+ }
 837+ }
 838+
 839+
 840+ /**
 841+ * Check the libicu version
 842+ */
 843+ public function envCheckLibicu() {
 844+ $utf8 = function_exists( 'utf8_normalize' );
 845+ $intl = function_exists( 'normalizer_normalize' );
 846+
 847+ /**
 848+ * This needs to be updated something that the latest libicu
 849+ * will properly normalize. This normalization was found at
 850+ * http://www.unicode.org/versions/Unicode5.2.0/#Character_Additions
 851+ * Note that we use the hex representation to create the code
 852+ * points in order to avoid any Unicode-destroying during transite.
 853+ */
 854+ $not_normal_c = $this->unicodeChar("FA6C");
 855+ $normal_c = $this->unicodeChar("242EE");
 856+
 857+ $useNormalizer = 'config-unicode-php';
 858+
 859+ /**
 860+ * We're going to prefer the pecl extension here unless
 861+ * utf8_normalize is more up to date.
 862+ */
 863+ if( $utf8 ) {
 864+ $utf8 = utf8_normalize( $not_normal_c, UNORM_NFC );
 865+ $useNormalizer = 'config-unicode-utf8';
 866+ }
 867+ if( $intl ) {
 868+ $intl = normalizer_normalize( $not_normal_c, Normalizer::FORM_C );
 869+ $useNormalizer = 'config-unicode-intl';
 870+ }
 871+
 872+ $this->showMessage( $useNormalizer );
 873+ if( $useNormalizer === 'config-unicode-php' ) {
 874+ $this->showMessage( 'config-unicode-pure-php-warning' );
 875+ }
 876+ }
 877+
 878+
 879+ /**
816880 * Search a path for any of the given executable names. Returns the
817881 * executable name if found. Also checks the version string returned
818882 * by each executable.
Index: trunk/phase3/includes/installer/Installer.i18n.php
@@ -79,6 +79,10 @@
8080 'config-env-latest-old' => "'''Warning:''' You are installing an outdated version of Mediawiki.",
8181 'config-env-latest-help' => 'You are installing version $1, but the latest version is $2.
8282 You are advised to use the latest release, which can be downloaded from [http://www.mediawiki.org/wiki/Download mediawiki.org]',
 83+ 'config-unicode-php' => "Using pure PHP to normalize Unicode characters.",
 84+ 'config-unicode-pure-php-warning' => "'''Warning''': Either the PECL Intl extension is not available, or it uses an older version of [http://site.icu-project.org/ the ICU project's] library for handling Unicode normalization. If you run a high-traffic site, you should read a little on [http://www.mediawiki.org/wiki/Unicode_normalization_considerations Unicode normalization].",
 85+ 'config-unicode-utf8' => "Using Brion Vibber's utf8_normalize.so for UTF",
 86+ 'config-unicode-intl' => "Using the [http://pecl.php.net/intl intl PECL extension] for UTF-8 normalization.",
8387 'config-no-db' => 'Could not find a suitable database driver!',
8488 'config-no-db-help' => 'You need to install a database driver for PHP.
8589 The following database types are supported: $1.

Follow-up revisions

RevisionCommit summaryAuthorDate
r70168follow-up r70126 — better warningsmah19:28, 29 July 2010

Comments

#Comment by Nikerabbit (talk | contribs)   05:08, 29 July 2010
+ * points in order to avoid any Unicode-destroying during transite.

Is there a spelling error in transite?

+'config-unicode-php'              => "Using pure PHP to normalize Unicode characters.",
+'config-unicode-utf8'             => "Using Brion Vibber's utf8_normalize.so for UTF",
+'config-unicode-intl'             => "Using the intl PECL extension for UTF-8 normalization.",

Correct me if I'm wrong, but all these messages are saying the same thing: Using X for Unicode normalization? If so, they should use the same wording for that. The proper term seems to be wikipedia:Unicode normalization.

I'd also perhaps replace pure PHP with fallback PHP implementation or something similar.

We're you planning on making this a configuration setting? I'd go for runtime detection of the best option. It's still good thing to check it in installer and add a warning if the best option would be slow or buggy, like you are doing now.

#Comment by MarkAHershberger (talk | contribs)   18:35, 29 July 2010

I was thinking that it would be useful to check to see which was more up-to-date, utf8_normalize.so or the PECL extension, but I'll follow your advice and just check both and notify the user if they're out of date.

Status & tagging log