r83993 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r83992‎ | r83993 | r83994 >
Date:02:53, 15 March 2011
Author:saper
Status:reverted (Comments)
Tags:
Comment:
r83812, r83814: Don't use cl_type at all when paging categorylinks

* Remove cl_type from paging in categorylinks - it's not
really needed there. Although cl_type is in WHERE but not
in ORDER BY clause the worst thing that can happen
is to have a filesort going again through <500 entries
selected by index. Or will FORCE INDEX work anyway?

* Revert schema change, as we don't need cl_type there
anyway (or even if we wanted to compare, it should
work as expected by using INT values against ENUM).
Modified paths:
  • /trunk/phase3/includes/api/ApiQueryCategoryMembers.php (modified) (history)
  • /trunk/phase3/includes/installer/MysqlUpdater.php (modified) (history)
  • /trunk/phase3/maintenance/archives/patch-cl_type.sql (deleted) (history)
  • /trunk/phase3/maintenance/tables.sql (modified) (history)

Diff [purge]

Index: trunk/phase3/maintenance/archives/patch-cl_type.sql
@@ -1,6 +0,0 @@
2 -
3 -ALTER TABLE /*_*/categorylinks MODIFY cl_type varchar(6) NOT NULL default 'page';
Index: trunk/phase3/maintenance/tables.sql
@@ -521,9 +521,7 @@
522522 -- paginate the three categories separately. This never has to be updated
523523 -- after the page is created, since none of these page types can be moved to
524524 -- any other.
525 - -- This used to be ENUM('page', 'subcat', 'file') but was changed to a
526 - -- varchar because of the weird semantics of < and > when used on enums
527 - cl_type varchar(6) NOT NULL default 'page'
 525+ cl_type ENUM('page', 'subcat', 'file') NOT NULL default 'page'
528526 ) /*$wgDBTableOptions*/;
529527
530528 CREATE UNIQUE INDEX /*i*/cl_from ON /*_*/categorylinks (cl_from,cl_to);
Index: trunk/phase3/includes/installer/MysqlUpdater.php
@@ -175,7 +175,6 @@
176176 array( 'dropIndex', 'archive', 'ar_page_revid', 'patch-archive_kill_ar_page_revid.sql' ),
177177 array( 'addIndex', 'archive', 'ar_revid', 'patch-archive_ar_revid.sql' ),
178178 array( 'doLangLinksLengthUpdate' ),
179 - array( 'doClTypeVarcharUpdate' ),
180179 );
181180 }
182181
@@ -829,18 +828,4 @@
830829 $this->output( "...ll_lang is up-to-date.\n" );
831830 }
832831 }
833 -
834 - protected function doClTypeVarcharUpdate() {
835 - $categorylinks = $this->db->tableName( 'categorylinks' );
836 - $res = $this->db->query( "SHOW COLUMNS FROM $categorylinks LIKE 'cl_type'" );
837 - $row = $this->db->fetchObject( $res );
838 -
839 - if ( $row && substr( $row->Type, 0, 4 ) == 'enum' ) {
840 - $this->output( 'Changing cl_type from enum to varchar...' );
841 - $this->applyPatch( 'patch-cl_type.sql' );
842 - $this->output( "done.\n" );
843 - } else {
844 - $this->output( "...cl_type is up-to-date.\n" );
845 - }
846 - }
847832 }
Index: trunk/phase3/includes/api/ApiQueryCategoryMembers.php
@@ -122,27 +122,39 @@
123123 $this->addOption( 'USE INDEX', 'cl_timestamp' );
124124 } else {
125125 if ( $params['continue'] ) {
126 - // type|from|sortkey
127 - $cont = explode( '|', $params['continue'], 3 );
128 - if ( count( $cont ) != 3 ) {
 126+ // from|sortkey
 127+ $cont = explode( '|', $params['continue'], 2 );
 128+ if ( count( $cont ) != 2 ) {
129129 $this->dieUsage( 'Invalid continue param. You should pass the original value returned '.
130130 'by the previous query', '_badcontinue'
131131 );
132132 }
133 - $escType = $this->getDB()->addQuotes( $cont[0] );
134 - $from = intval( $cont[1] );
135 - $escSortkey = $this->getDB()->addQuotes( $cont[2] );
136 - $op = $dir == 'newer' ? '>' : '<';
137 - $this->addWhere( "cl_type $op $escType OR " .
138 - "(cl_type = $escType AND " .
139 - "(cl_sortkey $op $escSortkey OR " .
140 - "(cl_sortkey = $escSortkey AND " .
141 - "cl_from $op= $from)))"
142 - );
 133+ list ( $from, $contsortkey ) = $cont;
 134+ if ( intval( $from ) == 0 ) {
 135+ $this->dieUsage( 'Invalid continue param. You should pass the original value returned '.
 136+ 'by the previous query', '_badcontinue'
 137+ );
 138+ }
 139+ $where_outer = array();
 140+ $where_inner = array();
 141+ $db = $this->getDB();
 142+ $op = ( $dir === 'newer' ? '>' : '<' );
 143+ $sortdir = ( $dir === 'newer' ? 'asc' : 'desc' );
 144+ $where_outer[] = 'cl_sortkey ' . $op . ' ' .
 145+ $db->addQuotes( $contsortkey );
 146+ // OR
 147+ $where_inner[] = 'cl_sortkey = ' .
 148+ $db->addQuotes( $contsortkey );
 149+ // AND
 150+ $where_inner[] = 'cl_from ' . $op . '= '. $from;
 151+
 152+ $where_outer[] = $db->makeList( $where_inner, LIST_AND );
 153+ $this->addWhere( $db->makeList( $where_outer, LIST_OR ) );
 154+ $this->addOption( 'ORDER BY',
 155+ 'cl_sortkey ' . $sortdir .', cl_from ' . $sortdir );
143156
144157 } else {
145 - // The below produces ORDER BY cl_type, cl_sortkey, cl_from, possibly with DESC added to each of them
146 - $this->addWhereRange( 'cl_type', $dir, null, null );
 158+ // The below produces ORDER BY cl_sortkey, cl_from, possibly with DESC added to each of them
147159 $this->addWhereRange( 'cl_sortkey',
148160 $dir,
149161 $params['startsortkey'],
@@ -171,7 +183,7 @@
172184 // because we don't have to worry about pipes in the sortkey that way
173185 // (and type and from can't contain pipes anyway)
174186 $this->setContinueEnumParameter( 'continue',
175 - "{$row->cl_type}|{$row->cl_from}|{$row->cl_sortkey}"
 187+ "{$row->cl_from}|{$row->cl_sortkey}"
176188 );
177189 }
178190 break;
@@ -213,7 +225,7 @@
214226 $this->setContinueEnumParameter( 'start', wfTimestamp( TS_ISO_8601, $row->cl_timestamp ) );
215227 } else {
216228 $this->setContinueEnumParameter( 'continue',
217 - "{$row->cl_type}|{$row->cl_from}|{$row->cl_sortkey}"
 229+ "{$row->cl_from}|{$row->cl_sortkey}"
218230 );
219231 }
220232 break;

Follow-up revisions

RevisionCommit summaryAuthorDate
r84071Revert r83993reedy00:13, 16 March 2011

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r83812Schema change: change cl_type from ENUM('page', 'subcat', 'file') to varchar(...catrope10:35, 13 March 2011
r83814(bug 27965) Paging in list=categorymembers was completely broken. It was pagi...catrope10:39, 13 March 2011

Comments

#Comment by Tim Starling (talk | contribs)   02:59, 15 March 2011

Why do you think the filesort would be <500 entries? Wouldn't it have to sort the whole category?

#Comment by Reedy (talk | contribs)   11:22, 15 March 2011

Also, you're going to be breaking this for anyone who did the schema update. Granted, this should only be development sites.

But it also possibly affects Translate Wiki, who do run trunk, and are likely to have upgraded. This then breaks it completely for them....

#Comment by Reedy (talk | contribs)   11:32, 15 March 2011

Also, the API is now doing unindexed queries...

#Comment by Catrope (talk | contribs)   20:29, 15 March 2011

I strongly suggest this be reverted.

Removing the ordering on cl_type makes the rest of the ordering unindexed (in the general case at least; it's fine if cmtype is set to a single value, because ordering by cl_type does nothing in that case, but this isn't the default) and, as Tim points out, that means the entire category has to be sorted to get the first 50 (or 500 or whatever) entries. We have categories with hundreds of thousands of members on enwiki (largest is 800k IIRC) and AFAICT this is going to totally explode when confronted with such categories.

#Comment by Saper (talk | contribs)   21:14, 15 March 2011

Ok, I see your point.

I looked at query execution plans on some smaller wikis and on the toolserver, and indeed there is too much risk.

I looked also at the alternative solution, which works much better - converting cl_type to int and using a small auxiliary name-value table to join when mapping from the text value is needed.

Here is a sample query using the auxiliary table:

mysql> EXPLAIN SELECT           
  cl_from,page_namespace,page_title,page_id,cl_sortkey,cl_timestamp   
FROM `page`,`categorylinks`,catlinks_types  
WHERE cl_to = 'Loga' AND ctl_id=cl_type AND
      ctl_value IN ('page','subcat','file') 
 AND (cl_from >= '1474') AND (cl_from=page_id)   ORDER BY cl_type, cl_sortkey, cl_from LIMIT 51;
+----+-------------+----------------+--------+-------------------------------------------+------------+---------+-------------------------------------------------+-----------------------------------------------------------+
| id | select_type | table          | type   | possible_keys                             | key        | key_len | ref                                       | rows | Extra                                                     |
+----+-------------+----------------+--------+-------------------------------------------+------------+---------+-------------------------------------------------+-----------------------------------------------------------+
|  1 | SIMPLE      | catlinks_types | index  | cl_types                                  | cl_types   | 22      | NULL                                      |    3 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | categorylinks  | ref    | cl_from,cl_sortkey,cl_timestamp,ctl_types | cl_sortkey | 771     | const,plwikimedia_p.catlinks_types.ctl_id |    5 | Using where                                               |
|  1 | SIMPLE      | page           | eq_ref | PRIMARY                                   | PRIMARY    | 4       | plwikimedia_p.categorylinks.cl_from       |    1 |                                                           |
+----+-------------+----------------+--------+-------------------------------------------+------------+---------+-------------------------------------------------+-----------------------------------------------------------+
3 rows in set (0.00 sec)

I think this is much better.

But then, it seems like that cl_type is directly derived from cl_from and and its namespace. I look at its usage and except for picking up subcategories and files first in CategoryPage (and few extensions doing similar stuff). As those join the page table anyway, using namespace from there or the immediate table looks much better for me. Do we really need cl_type at all? What do you think?

#Comment by Catrope (talk | contribs)   15:23, 16 March 2011

Yes, we need cl_type. It's true that the information comes directly from the page table, but that's the exact problem: it's in another table. So you can't put page_namespace and cl_sortkey in one index, so you can't order by it or efficiently filter by it while also doing ordering/filtering based on fields in categorylinks. Try rewriting the query above to use, say, page_namespace=6 instead of cl_type='fi;e' and you'll see nasty filesorts or table scans happen.

Why is making cl_type numeric so much better? Your paste shows that the relevant query filesorts and uses a temporary table.

In an ideal world it would've been numeric from the start, but making it a varchar makes migration trivial, and the difference in space requirements is negligible (a few bytes per row, at most, for rows that can be up to 700+ bytes each).

Status & tagging log