r102951 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r102950‎ | r102951 | r102952 >
Date:09:13, 14 November 2011
Author:catrope
Status:ok (Comments)
Tags:
Comment:
(bug 29854) Store protocol-relative links twice in the externallinks table, one with http: in el_index and once with https: . Modified patch by Brad Jorsch
Modified paths:
  • /trunk/phase3/includes/AutoLoader.php (modified) (history)
  • /trunk/phase3/includes/GlobalFunctions.php (modified) (history)
  • /trunk/phase3/includes/LinksUpdate.php (modified) (history)
  • /trunk/phase3/includes/api/ApiQueryExternalLinks.php (modified) (history)
  • /trunk/phase3/includes/installer/DatabaseUpdater.php (modified) (history)
  • /trunk/phase3/maintenance/fixExtLinksProtocolRelative.php (added) (history)
  • /trunk/phase3/tests/phpunit/includes/GlobalFunctions/GlobalTest.php (modified) (history)

Diff [purge]

Index: trunk/phase3/maintenance/fixExtLinksProtocolRelative.php
@@ -0,0 +1,81 @@
 2+<?php
 3+/**
 4+ * Fixes any entries for protocol-relative URLs in the externallinks table,
 5+ * replacing each protocol-relative entry with two entries, one for http
 6+ * and one for https.
 7+ *
 8+ * This program is free software; you can redistribute it and/or modify
 9+ * it under the terms of the GNU General Public License as published by
 10+ * the Free Software Foundation; either version 2 of the License, or
 11+ * (at your option) any later version.
 12+ *
 13+ * This program is distributed in the hope that it will be useful,
 14+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
 15+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 16+ * GNU General Public License for more details.
 17+ *
 18+ * You should have received a copy of the GNU General Public License along
 19+ * with this program; if not, write to the Free Software Foundation, Inc.,
 20+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 21+ * http://www.gnu.org/copyleft/gpl.html
 22+ *
 23+ * @ingroup Maintenance
 24+ */
 25+
 26+require_once( dirname( __FILE__ ) . '/Maintenance.php' );
 27+
 28+class FixExtLinksProtocolRelative extends LoggedUpdateMaintenance {
 29+ public function __construct() {
 30+ parent::__construct();
 31+ $this->mDescription = "Fixes any entries in the externallinks table containing protocol-relative URLs";
 32+ }
 33+
 34+ protected function getUpdateKey() {
 35+ return 'fix protocol-relative URLs in externallinks';
 36+ }
 37+
 38+ protected function updateSkippedMessage() {
 39+ return 'protocol-relative URLs in externallinks table already fixed.';
 40+ }
 41+
 42+ protected function doDBUpdates() {
 43+ $db = wfGetDB( DB_MASTER );
 44+ if ( !$db->tableExists( 'externallinks' ) ) {
 45+ $this->error( "externallinks table does not exist" );
 46+ return false;
 47+ }
 48+ $this->output( "Fixing protocol-relative entries in the externallinks table...\n" );
 49+ $res = $db->select( 'externallinks', array( 'el_from', 'el_to', 'el_index' ),
 50+ array( 'el_index' . $db->buildLike( '//', $db->anyString() ) ),
 51+ __METHOD__
 52+ );
 53+ $count = 0;
 54+ foreach ( $res as $row ) {
 55+ $count++;
 56+ if ( $count % 100 == 0 ) {
 57+ $this->output( $count );
 58+ wfWaitForSlaves();
 59+ }
 60+ $db->insert( 'externallinks',
 61+ array(
 62+ array(
 63+ 'el_from' => $row->el_from,
 64+ 'el_to' => $row->el_to,
 65+ 'el_index' => "http:{$row->el_index}",
 66+ ),
 67+ array(
 68+ 'el_from' => $row->el_from,
 69+ 'el_to' => $row->el_to,
 70+ 'el_index' => "https:{$row->el_index}",
 71+ )
 72+ ), __METHOD__, array( 'IGNORE' )
 73+ );
 74+ $db->delete( 'externallinks', array( 'el_index' => $row->el_index ), __METHOD__ );
 75+ }
 76+ $this->output( "Done, $count rows updated.\n" );
 77+ return true;
 78+ }
 79+}
 80+
 81+$maintClass = "FixExtLinksProtocolRelative";
 82+require_once( RUN_MAINTENANCE_IF_MAIN );
Property changes on: trunk/phase3/maintenance/fixExtLinksProtocolRelative.php
___________________________________________________________________
Added: svn:eol-style
183 + native
Index: trunk/phase3/tests/phpunit/includes/GlobalFunctions/GlobalTest.php
@@ -831,42 +831,42 @@
832832 }
833833
834834 /**
835 - * @dataProvider provideMakeUrlIndex()
 835+ * @dataProvider provideMakeUrlIndexes()
836836 */
837 - function testMakeUrlIndex( $url, $expected ) {
838 - $index = wfMakeUrlIndex( $url );
839 - $this->assertEquals( $expected, $index, "wfMakeUrlIndex(\"$url\")" );
 837+ function testMakeUrlIndexes( $url, $expected ) {
 838+ $index = wfMakeUrlIndexes( $url );
 839+ $this->assertEquals( $expected, $index, "wfMakeUrlIndexes(\"$url\")" );
840840 }
841841
842 - function provideMakeUrlIndex() {
 842+ function provideMakeUrlIndexes() {
843843 return array(
844844 array(
845845 // just a regular :)
846846 'https://bugzilla.wikimedia.org/show_bug.cgi?id=28627',
847 - 'https://org.wikimedia.bugzilla./show_bug.cgi?id=28627'
 847+ array( 'https://org.wikimedia.bugzilla./show_bug.cgi?id=28627' )
848848 ),
849849 array(
850850 // mailtos are handled special
851851 // is this really right though? that final . probably belongs earlier?
852852 'mailto:wiki@wikimedia.org',
853 - 'mailto:org.wikimedia@wiki.',
 853+ array( 'mailto:org.wikimedia@wiki.' )
854854 ),
855855
856856 // file URL cases per bug 28627...
857857 array(
858858 // three slashes: local filesystem path Unix-style
859859 'file:///whatever/you/like.txt',
860 - 'file://./whatever/you/like.txt'
 860+ array( 'file://./whatever/you/like.txt' )
861861 ),
862862 array(
863863 // three slashes: local filesystem path Windows-style
864864 'file:///c:/whatever/you/like.txt',
865 - 'file://./c:/whatever/you/like.txt'
 865+ array( 'file://./c:/whatever/you/like.txt' )
866866 ),
867867 array(
868868 // two slashes: UNC filesystem path Windows-style
869869 'file://intranet/whatever/you/like.txt',
870 - 'file://intranet./whatever/you/like.txt'
 870+ array( 'file://intranet./whatever/you/like.txt' )
871871 ),
872872 // Multiple-slash cases that can sorta work on Mozilla
873873 // if you hack it just right are kinda pathological,
@@ -875,6 +875,15 @@
876876 //
877877 // Those will survive the algorithm but with results that
878878 // are less consistent.
 879+
 880+ // protocol-relative URL cases per bug 29854...
 881+ array(
 882+ '//bugzilla.wikimedia.org/show_bug.cgi?id=28627',
 883+ array(
 884+ 'http://org.wikimedia.bugzilla./show_bug.cgi?id=28627',
 885+ 'https://org.wikimedia.bugzilla./show_bug.cgi?id=28627'
 886+ )
 887+ ),
879888 );
880889 }
881890
Index: trunk/phase3/includes/GlobalFunctions.php
@@ -647,12 +647,12 @@
648648 }
649649
650650 /**
651 - * Make a URL index, appropriate for the el_index field of externallinks.
 651+ * Make URL indexes, appropriate for the el_index field of externallinks.
652652 *
653653 * @param $url String
654 - * @return String
 654+ * @return array
655655 */
656 -function wfMakeUrlIndex( $url ) {
 656+function wfMakeUrlIndexes( $url ) {
657657 $bits = wfParseUrl( $url );
658658
659659 // Reverse the labels in the hostname, convert to lower case
@@ -692,7 +692,12 @@
693693 if ( isset( $bits['fragment'] ) ) {
694694 $index .= '#' . $bits['fragment'];
695695 }
696 - return $index;
 696+
 697+ if ( $prot == '' ) {
 698+ return array( "http:$index", "https:$index" );
 699+ } else {
 700+ return array( $index );
 701+ }
697702 }
698703
699704 /**
Index: trunk/phase3/includes/installer/DatabaseUpdater.php
@@ -43,7 +43,8 @@
4444 'DeleteDefaultMessages',
4545 'PopulateRevisionLength',
4646 'PopulateRevisionSha1',
47 - 'PopulateImageSha1'
 47+ 'PopulateImageSha1',
 48+ 'FixExtLinksProtocolRelative',
4849 );
4950
5051 /**
Index: trunk/phase3/includes/LinksUpdate.php
@@ -456,11 +456,13 @@
457457 $arr = array();
458458 $diffs = array_diff_key( $this->mExternals, $existing );
459459 foreach( $diffs as $url => $dummy ) {
460 - $arr[] = array(
461 - 'el_from' => $this->mId,
462 - 'el_to' => $url,
463 - 'el_index' => wfMakeUrlIndex( $url ),
464 - );
 460+ foreach( wfMakeUrlIndexes( $url ) as $index ) {
 461+ $arr[] = array(
 462+ 'el_from' => $this->mId,
 463+ 'el_to' => $url,
 464+ 'el_index' => $index,
 465+ );
 466+ }
465467 }
466468 return $arr;
467469 }
Index: trunk/phase3/includes/api/ApiQueryExternalLinks.php
@@ -69,6 +69,11 @@
7070 $this->addOption( 'ORDER BY', 'el_from' );
7171 }
7272
 73+ // If we're querying all protocols, use DISTINCT to avoid repeating protocol-relative links twice
 74+ if ( $protocol === null ) {
 75+ $this->addOption( 'DISTINCT' );
 76+ }
 77+
7378 $this->addOption( 'LIMIT', $params['limit'] + 1 );
7479 $offset = isset( $params['offset'] ) ? $params['offset'] : 0;
7580 if ( $offset ) {
Index: trunk/phase3/includes/AutoLoader.php
@@ -861,6 +861,7 @@
862862 'FakeMaintenance' => 'maintenance/Maintenance.php',
863863 'LoggedUpdateMaintenance' => 'maintenance/Maintenance.php',
864864 'Maintenance' => 'maintenance/Maintenance.php',
 865+ 'FixExtLinksProtocolRelative' => 'maintenance/fixExtLinksProtocolRelative.php',
865866 'PopulateCategory' => 'maintenance/populateCategory.php',
866867 'PopulateImageSha1' => 'maintenance/populateImageSha1.php',
867868 'PopulateLogSearch' => 'maintenance/populateLogSearch.php',

Sign-offs

UserFlagDate
Hasharinspected17:33, 14 November 2011

Follow-up revisions

RevisionCommit summaryAuthorDate
r102954Fix for r102949: don't delete too muchcatrope09:16, 14 November 2011
r103029REL1_18 MFT r99236, r102301, r102303, r102498, r102710, r102751, r102951, r10...reedy20:58, 14 November 2011
r103056MFT r102951, r102954reedy22:40, 14 November 2011
r104033Followup r102951: per CR, remove duplicated empty string from API extlinks mo...catrope13:41, 23 November 2011

Comments

#Comment by Bryan (talk | contribs)   09:34, 14 November 2011

indices?

#Comment by Catrope (talk | contribs)   09:41, 14 November 2011

Hmm, right, I guess so. I don't care much either way, feel free to rename if it bothers you :)

#Comment by Tim Starling (talk | contribs)   03:06, 12 December 2011
$ ack -i indexes phase3/includes | wc -l
100
$ ack -i indices phase3/includes | wc -l
18

I think it can stay how it is.

#Comment by Hashar (talk | contribs)   17:33, 14 November 2011

Hopefull the "WHERE (el_index LIKE '//%' ESCAPE '\' )" will run fine thanks to the index :) This script could have make use of the batch system for those wikis with millions of relative URLs :-D

TODO: still need to make sure the DISTINCT keyword added in the API call will not throw away the live DB. I had issues a long time ago with DISTINCT unconditionally skipping indexes under MySQL.

Looks good overall.

#Comment by Duplicatebug (talk | contribs)   19:00, 17 November 2011

You have to remove the search term '//' from the special page and from the api (The empty string has special meaning in the api, but on action=paraminfo&querymodules=extlinks the empty string is shown twice, so you should remove that).

#Comment by Catrope (talk | contribs)   11:06, 19 November 2011

I don't understand this comment, could you explain this with an example?

#Comment by Duplicatebug (talk | contribs)   09:55, 21 November 2011
#Comment by Catrope (talk | contribs)   20:47, 22 November 2011

Right, good point. This is because it's using $wgUrlProtocols, of course. Will poke tomorrow.

Status & tagging log