r90754 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r90753‎ | r90754 | r90755 >
Date:04:45, 25 June 2011
Author:kbrown
Status:deferred (Comments)
Tags:
Comment:
Start maintenance script for the actual spidering links. Make the internet archive the default if $wgArchiveLinksConfig['archive_service'] is not set to a valid option.
Modified paths:
  • /trunk/extensions/ArchiveLinks/ArchiveLinks.php (modified) (history)
  • /trunk/extensions/ArchiveLinks/Spider.php (added) (history)

Diff [purge]

Index: trunk/extensions/ArchiveLinks/ArchiveLinks.php
@@ -54,6 +54,7 @@
5555 $wgArchiveLinksConfig = array (
5656 'archive_service' => 'wikiwix',
5757 'use_multiple_archives' => false,
 58+ 'run_spider_in_loop' => false,
5859 );
5960
6061 class ArchiveLinks {
@@ -121,12 +122,14 @@
122123 case 'wikiwix':
123124 $link_to_archive = 'http://archive.wikiwix.com/cache/?url=' . $url;
124125 break;
 126+ case 'webcitation':
 127+ $link_to_archive = 'http://webcitation.org/query?url=' . $url;
 128+ break;
125129 case 'internet_archive':
 130+ default:
126131 $link_to_archive = 'http://wayback.archive.org/web/*/' . $url;
127132 break;
128 - case 'webcitation':
129 - $link_to_archive = 'http://webcitation.org/query?url=' . $url;
130 - break;
 133+
131134 }
132135 }
133136 //Note to self: need to fix this to use Html.php instead of direct html
Index: trunk/extensions/ArchiveLinks/Spider.php
@@ -0,0 +1,79 @@
 2+<?php
 3+/**
 4+ * This class is for the actual spidering and will be calling wget
 5+ */
 6+
 7+$path = getenv( 'MW_INSTALL_PATH' );
 8+if ( strval( $path ) === '' ) {
 9+ $path = dirname( __FILE__ ) . '/../..';
 10+}
 11+
 12+require_once "$path/maintenance/Maintenance.php";
 13+
 14+class ArchiveLinksSpider extends Maintenance {
 15+ private $db_master;
 16+ private $db_slave;
 17+ private $db_result;
 18+
 19+ public function execute() {
 20+ global $wgArchiveLinksConfig;
 21+
 22+ $this->db_master = $this->getDB( DB_MASTER );
 23+ $this->db_slave = $this->getDB( DB_SLAVE );
 24+ $this->db_result = array();
 25+
 26+ if ( $wgArchiveLinksConfig['run_spider_in_loop'] ) {
 27+ while ( TRUE ) {
 28+ if ( ( $url = $this->check_queue() ) !== false ) {
 29+ //do stuff
 30+ }
 31+ sleep(1);
 32+ }
 33+ } else {
 34+ if ( ( $url = $this->check_queue() ) !== false ) {
 35+ //do stuff
 36+ }
 37+ }
 38+ return null;
 39+ }
 40+
 41+ private function check_queue() {
 42+ $this->db_result['job-fetch'] = $this->db_slave->select( 'el_archive_queue', '*',
 43+ '`el_archive_queue`.`delay_time` <= ' . time()
 44+ . ' AND `el_archive_queue`.`in_progress` = 0'
 45+ . ' ORDER BY `el_archive_queue`.`queue_id` ASC'
 46+ . ' LIMIT 1');
 47+
 48+ if ( $this->db_result['job-fetch']->numRows() > 0 ) {
 49+ $row = $this->db_result['job-fetch']->fetchRow();
 50+
 51+ //Since we querried the slave to check for dups when we insterted instead of the master let's check
 52+ //that the job isn't in the queue twice, we don't want to archive it twice
 53+ $this->db_result['dup-check'] = $this->db_slave->select( 'el_archive_queue', '*', '`el_archive_queue`.`url` = "' . $row['url']
 54+ . '" ORDER BY `el_archive_queue`.`queue_id` ASC' );
 55+
 56+ if ( $this->db_result['dup-check']->numRows() > 1 ) {
 57+ //keep only the original jobs and remove all duplicates
 58+ $this->db_result['dup-check']->fetchRow();
 59+ while ( $del_row = $this->db_result['dup-check']->fetchRow() ) {
 60+ echo 'you have a dup ';
 61+ var_dump( $del_row );
 62+ //this is commented for testing purposes, so I don't have to keep readding the duplicate to my test db
 63+ //in other words this has a giant "remove before flight" ribbon hanging from it...
 64+ //$this->db_master->delete( 'el_archive_queue', '`el_archive_queue`.`queue_id` = ' . $del_row['queue_id'] );
 65+ }
 66+
 67+ }
 68+
 69+ return $row['url'];
 70+ } else {
 71+ //there are no jobs to do right now
 72+ return false;
 73+ }
 74+ }
 75+}
 76+
 77+$maintClass = 'ArchiveLinksSpider';
 78+require_once RUN_MAINTENANCE_IF_MAIN;
 79+
 80+?>
\ No newline at end of file
Property changes on: trunk/extensions/ArchiveLinks/Spider.php
___________________________________________________________________
Added: svn:eol-style
181 + native

Comments

#Comment by Bawolff (talk | contribs)   05:00, 25 June 2011

I'm not overly familiar with your code, so this might be a stupid comment, but this kind of looks like something that could maybe be handled by the job queue?

+	$this->db_result['job-fetch'] = $this->db_slave->select( 'el_archive_queue', '*', 
+		'`el_archive_queue`.`delay_time` <= ' . time()
+		. ' AND `el_archive_queue`.`in_progress` = 0'
+		. ' ORDER BY `el_archive_queue`.`queue_id` ASC'
+		. ' LIMIT 1');

This would be better structured like

select( 'el_archive_queue', '*',
      array(
           'delay_time <= ' . time(),
           'in_progress' => 0
      ),
      __METHOD__,
      array(
           'LIMIT' => 1,
           'ORDER BY' => 'queue_id'
      )
);

It might also be a good idea to explicitly list the fields you want instead of using *, but that's a matter of preference I suppose.

Cheers.

#Comment by Kevin Brown (talk | contribs)   05:48, 25 June 2011

hmm okay, I'll fix it to use arrays. I thought the other way looked cleaner though.

As far as using the job queue I can't really do that because in addition to the long time it will take to run the job (probably several seconds, and in some cases maybe over 10 or 15 seconds) there will be jobs in the queue to recheck a URL of an unsuccessful archival attempt. As far as I know the job queue does not really include any way to prevent these jobs from running before they are supposed to. A workaround would of course be to readd the job the queue if it wasn't ready but that just seems like bad way of doing it.

#Comment by Nikerabbit (talk | contribs)   07:49, 25 June 2011

We also don't use ?> at the end of the file anymore.

Status & tagging log