r69961 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r69960‎ | r69961 | r69962 >
Date:19:27, 26 July 2010
Author:simetrical
Status:resolved (Comments)
Tags:
Comment:
Reconcept cl_raw_sortkey as cl_sortkey_prefix

In response to feedback by Phillipe Verdy on bug 164. Now if a bunch of
pages have [[Category:Foo| ]], they'll sort amongst themselves according
to page name, instead of in basically random order as it is currently.
This also makes storage more elegant and intuitive: instead of giving
NULL a magic meaning when there's no custom sortkey specified, we just
store an empty string, since there's no prefix.

This means {{defaultsort:}} really now means {{defaultsortprefix:}},
which is slightly confusing, and a lot of code is now slightly
misleading or poorly named. But it should all work fine.

Also, while I was at it, I made updateCollation.php work as a transition
script, so you can apply the SQL patch and then run updateCollation.php
and things will work. However, with the new schema it's not trivial to
reverse this -- you'd have to recover the raw sort keys with some PHP.
Conversion goes at about a thousand rows a second for me, and seems to
be CPU-bound. Could probably be optimized.

I also adjusted the transition script so it will fix rows with collation
versions *greater* than the current one, as well as less. Thus if some
site wants to use their own collation, they can call it 137 or
something, and if they later want to switch back to MediaWiki stock
collation 7, it will work.

Also fixed a silly bug in updateCollation.php where it would say "1000
done" if it did nothing, and changed $res->numRows() >= self::BATCH_SIZE
to == so people don't wonder how it could be bigger (since it can't, I
hope).
Modified paths:
  • /trunk/phase3/includes/CategoryPage.php (modified) (history)
  • /trunk/phase3/includes/DefaultSettings.php (modified) (history)
  • /trunk/phase3/includes/LinksUpdate.php (modified) (history)
  • /trunk/phase3/includes/Title.php (modified) (history)
  • /trunk/phase3/includes/parser/Parser.php (modified) (history)
  • /trunk/phase3/languages/Language.php (modified) (history)
  • /trunk/phase3/maintenance/archives/patch-categorylinks-better-collation.sql (modified) (history)
  • /trunk/phase3/maintenance/updateCollation.php (modified) (history)

Diff [purge]

Index: trunk/phase3/maintenance/archives/patch-categorylinks-better-collation.sql
@@ -8,9 +8,9 @@
99 -- to work for MySQL for now, without table prefixes, possibly other random
1010 -- limitations.
1111 ALTER TABLE categorylinks
12 - ADD COLUMN cl_raw_sortkey varchar(255) binary NULL default NULL,
 12+ ADD COLUMN cl_sortkey_prefix varchar(255) binary NOT NULL default '',
1313 ADD COLUMN cl_collation tinyint NOT NULL default 0,
14 - ADD COLUMN cl_type ENUM('page', 'subcat', 'file') NOT NULL,
 14+ ADD COLUMN cl_type ENUM('page', 'subcat', 'file') NOT NULL default 'page',
1515 ADD INDEX (cl_collation),
1616 DROP INDEX cl_sortkey,
1717 ADD INDEX cl_sortkey (cl_to, cl_type, cl_sortkey, cl_from);
Index: trunk/phase3/maintenance/updateCollation.php
@@ -1,6 +1,6 @@
22 <?php
33 /**
4 - * @file
 4+ * @file
55 * @ingroup Maintenance
66 * @author Aryeh Gregor (Simetrical)
77 */
@@ -17,10 +17,10 @@
1818
1919 global $wgCollationVersion;
2020 $this->mDescription = <<<TEXT
21 -This script will find all rows in the categorylinks table whose collation is
22 -out-of-date (cl_collation < $wgCollationVersion) and repopulate cl_sortkey
23 -using cl_raw_sortkey. If everything's collation is up-to-date, it will do
24 -nothing.
 21+This script will find all rows in the categorylinks table whose collation is
 22+out-of-date (cl_collation != $wgCollationVersion) and repopulate cl_sortkey
 23+using the page title and cl_sortkey_prefix. If everything's collation is
 24+up-to-date, it will do nothing.
2525 TEXT;
2626
2727 #$this->addOption( 'force', 'Run on all rows, even if the collation is supposed to be up-to-date.' );
@@ -32,8 +32,8 @@
3333 $dbw = wfGetDB( DB_MASTER );
3434 $count = $dbw->estimateRowCount(
3535 'categorylinks',
36 - array( 'cl_from', 'cl_to', 'cl_raw_sortkey' ),
37 - 'cl_collation < ' . $dbw->addQuotes( $wgCollationVersion ),
 36+ array( 'cl_from', 'cl_to', 'cl_sortkey_prefix' ),
 37+ 'cl_collation != ' . $dbw->addQuotes( $wgCollationVersion ),
3838 __METHOD__
3939 );
4040
@@ -42,21 +42,50 @@
4343 $count = 0;
4444 do {
4545 $res = $dbw->select(
46 - 'categorylinks',
47 - array( 'cl_from', 'cl_to', 'cl_raw_sortkey' ),
48 - 'cl_collation < ' . $dbw->addQuotes( $wgCollationVersion ),
 46+ array( 'categorylinks', 'page' ),
 47+ array( 'cl_from', 'cl_to', 'cl_sortkey_prefix', 'cl_collation',
 48+ 'cl_sortkey', 'page_namespace', 'page_title'
 49+ ),
 50+ array(
 51+ 'cl_collation != ' . $dbw->addQuotes( $wgCollationVersion ),
 52+ 'cl_from = page_id'
 53+ ),
4954 __METHOD__,
5055 array( 'LIMIT' => self::BATCH_SIZE )
5156 );
5257
5358 $dbw->begin();
5459 foreach ( $res as $row ) {
55 - # TODO: Handle the case where cl_raw_sortkey is null.
 60+ $title = Title::newFromRow( $row );
 61+ $rawSortkey = $title->getCategorySortkey();
 62+ if ( $row->cl_collation == 0 ) {
 63+ # This is an old-style row, so the sortkey needs to be
 64+ # converted.
 65+ if ( $row->cl_sortkey == $rawSortkey ) {
 66+ $prefix = '';
 67+ } else {
 68+ # Custom sortkey, use it as a prefix
 69+ $prefix = $row->cl_sortkey;
 70+ }
 71+ } else {
 72+ $prefix = $row->cl_sortkey_prefix;
 73+ }
 74+ # cl_type will be wrong for lots of pages if cl_collation is 0,
 75+ # so let's update it while we're here.
 76+ if ( $title->getNamespace() == NS_CATEGORY ) {
 77+ $type = 'subcat';
 78+ } elseif ( $title->getNamespace() == NS_FILE ) {
 79+ $type = 'file';
 80+ } else {
 81+ $type = 'page';
 82+ }
5683 $dbw->update(
5784 'categorylinks',
5885 array(
59 - 'cl_sortkey' => $wgContLang->convertToSortkey( $row->cl_raw_sortkey ),
60 - 'cl_collation' => $wgCollationVersion
 86+ 'cl_sortkey' => $wgContLang->convertToSortkey( $prefix . $rawSortkey ),
 87+ 'cl_sortkey_prefix' => $prefix,
 88+ 'cl_collation' => $wgCollationVersion,
 89+ 'cl_type' => $type,
6190 ),
6291 array( 'cl_from' => $row->cl_from, 'cl_to' => $row->cl_to ),
6392 __METHOD__
@@ -64,9 +93,9 @@
6594 }
6695 $dbw->commit();
6796
68 - $count += self::BATCH_SIZE;
 97+ $count += $res->numRows();
6998 $this->output( "$count done.\n" );
70 - } while ( $res->numRows() >= self::BATCH_SIZE );
 99+ } while ( $res->numRows() == self::BATCH_SIZE );
71100 }
72101 }
73102
Index: trunk/phase3/includes/CategoryPage.php
@@ -270,7 +270,7 @@
271271 foreach ( array( 'page', 'subcat', 'file' ) as $type ) {
272272 $res = $dbr->select(
273273 $tables,
274 - array_merge( $fields, array( 'cl_raw_sortkey' ) ),
 274+ array_merge( $fields, array( 'cl_sortkey_prefix' ) ),
275275 $conds + array( 'cl_type' => $type ) + ( $type == 'page' ? array( $pageCondition ) : array() ),
276276 __METHOD__,
277277 $opts + ( $type == 'page' ? array( 'LIMIT' => $this->limit + 1 ) : array() ),
@@ -286,14 +286,15 @@
287287 }
288288
289289 $title = Title::newFromRow( $row );
 290+ $rawSortkey = $row->cl_sortkey_prefix . $title->getCategorySortkey();
290291
291292 if ( $title->getNamespace() == NS_CATEGORY ) {
292293 $cat = Category::newFromRow( $row, $title );
293 - $this->addSubcategoryObject( $cat, $row->cl_raw_sortkey, $row->page_len );
 294+ $this->addSubcategoryObject( $cat, $rawSortkey, $row->page_len );
294295 } elseif ( $this->showGallery && $title->getNamespace() == NS_FILE ) {
295 - $this->addImage( $title, $row->cl_raw_sortkey, $row->page_len, $row->page_is_redirect );
 296+ $this->addImage( $title, $rawSortkey, $row->page_len, $row->page_is_redirect );
296297 } else {
297 - $this->addPage( $title, $row->cl_raw_sortkey, $row->page_len, $row->page_is_redirect );
 298+ $this->addPage( $title, $rawSortkey, $row->page_len, $row->page_is_redirect );
298299 }
299300 }
300301 }
Index: trunk/phase3/includes/parser/Parser.php
@@ -5056,12 +5056,8 @@
50575057 global $wgCategoryPrefixedDefaultSortkey;
50585058 if ( $this->mDefaultSort !== false ) {
50595059 return $this->mDefaultSort;
5060 - } elseif ( $this->mTitle->getNamespace() == NS_CATEGORY ||
5061 - !$wgCategoryPrefixedDefaultSortkey )
5062 - {
5063 - return $this->mTitle->getText();
50645060 } else {
5065 - return $this->mTitle->getPrefixedText();
 5061+ return $this->mTitle->getCategorySortkey();
50665062 }
50675063 }
50685064
Index: trunk/phase3/includes/LinksUpdate.php
@@ -441,14 +441,31 @@
442442 } else {
443443 $type = 'page';
444444 }
445 - $convertedSortkey = $wgContLang->convertToSortkey( $sortkey );
446 - # TODO: Set $sortkey to null if it's redundant
 445+
 446+ # TODO: This is kind of wrong, because someone might set a sort
 447+ # key prefix that's the same as the default sortkey for the
 448+ # title. This should be fixed by refactoring code to replace
 449+ # $sortkey in this array by a prefix, but it's basically harmless
 450+ # (Title::moveTo() has had the same issue for a long time).
 451+ if ( $this->mTitle->getCategorySortkey() == $sortkey ) {
 452+ $prefix = '';
 453+ $sortkey = $wgContLang->convertToSortkey( $sortkey );
 454+ } else {
 455+ # Treat custom sortkeys as a prefix, so that if multiple
 456+ # things are forced to sort as '*' or something, they'll
 457+ # sort properly in the category rather than in page_id
 458+ # order or such.
 459+ $prefix = $sortkey;
 460+ $sortkey = $wgContLang->convertToSortkey(
 461+ $prefix . $this->mTitle->getCategorySortkey() );
 462+ }
 463+
447464 $arr[] = array(
448465 'cl_from' => $this->mId,
449466 'cl_to' => $name,
450 - 'cl_sortkey' => $convertedSortkey,
 467+ 'cl_sortkey' => $sortkey,
451468 'cl_timestamp' => $this->mDb->timestamp(),
452 - 'cl_raw_sortkey' => $sortkey,
 469+ 'cl_sortkey_prefix' => $prefix,
453470 'cl_collation' => $wgCollationVersion,
454471 'cl_type' => $type,
455472 );
Index: trunk/phase3/includes/Title.php
@@ -4137,4 +4137,22 @@
41384138
41394139 return $types;
41404140 }
 4141+
 4142+ /**
 4143+ * Returns what the default sort key for categories would be, if
 4144+ * {{defaultsort:}} isn't used. This is the same as getText() for
 4145+ * categories, and for everything if $wgCategoryPrefixedDefaultSortkey is
 4146+ * false; otherwise it's the same as getPrefixedText().
 4147+ *
 4148+ * @return string
 4149+ */
 4150+ public function getCategorySortkey() {
 4151+ global $wgCategoryPrefixedDefaultSortkey;
 4152+ if ( $this->getNamespace() == NS_CATEGORY
 4153+ || !$wgCategoryPrefixedDefaultSortkey ) {
 4154+ return $this->getText();
 4155+ } else {
 4156+ return $this->getPrefixedText();
 4157+ }
 4158+ }
41414159 }
Index: trunk/phase3/includes/DefaultSettings.php
@@ -4476,8 +4476,8 @@
44774477 /**
44784478 * A version indicator for collations that will be stored in cl_collation for
44794479 * all new rows. Used when the collation algorithm changes: a script checks
4480 - * for all rows where cl_collation < $wgCollationVersion and regenerates
4481 - * cl_sortkey based on cl_raw_sortkey.
 4480+ * for all rows where cl_collation != $wgCollationVersion and regenerates
 4481+ * cl_sortkey based on the page name and cl_sortkey_prefix.
44824482 */
44834483 $wgCollationVersion = 1;
44844484
Index: trunk/phase3/languages/Language.php
@@ -2939,7 +2939,9 @@
29402940 * Given a string, convert it to a (hopefully short) key that can be used
29412941 * for efficient sorting. A binary sort according to the sortkeys
29422942 * corresponds to a logical sort of the corresponding strings. Applying
2943 - * this to cl_raw_sortkey produces cl_sortkey.
 2943+ * this to cl_sortkey_prefix concatenated with the page title (possibly
 2944+ * with namespace prefix, depending on $wgCategoryPrefixedDefaultSortkey)
 2945+ * gives you cl_sortkey.
29442946 *
29452947 * @param string $string UTF-8 string
29462948 * @return string Binary sortkey

Follow-up revisions

RevisionCommit summaryAuthorDate
r70007Followup r69961. Remove now unused global.platonides12:19, 27 July 2010
r70065Followup to r69961, update PG schemaoverlordq05:08, 28 July 2010
r70716Update tables.sql for category sorting changes...simetrical16:51, 8 August 2010
r80436Change the default collation from strtoupper to Language::uc, so that non-asc...bawolff06:27, 17 January 2011

Comments

#Comment by X! (talk | contribs)   18:29, 7 August 2010

It seems that you forgot to update tables.sql, breaking any new installation.

#Comment by Simetrical (talk | contribs)   16:51, 8 August 2010

Um, right! Forgot about that bit. I was surprised that no one reported me breaking anything major. :D Updated tables.sql in r70716.

Status & tagging log