r94726 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r94725‎ | r94726 | r94727 >
Date:02:51, 17 August 2011
Author:kbrown
Status:deferred (Comments)
Tags:
Comment:
*Changes to spider.php to use wfShellExec instead of shell_exec
*Updates to log parsing regexes, including using lazy instead of greedy where applicable
*Fix results array to be in correct format
Modified paths:
  • /trunk/extensions/ArchiveLinks/spider.php (modified) (history)

Diff [purge]

Index: trunk/extensions/ArchiveLinks/spider.php
@@ -16,6 +16,7 @@
1717 private $db_slave;
1818 private $db_result;
1919 private $jobs;
 20+ private $downloaded_files;
2021
2122 public function execute( ) {
2223 global $wgArchiveLinksConfig, $wgLoadBalancer, $path;
@@ -63,56 +64,71 @@
6465
6566 private function call_wget( $url ) {
6667 global $wgArchiveLinksConfig, $path;
 68+
 69+ //Check Configuration
 70+ if ( isset( $wgArchiveLinksConfig['file_types'] ) ) {
 71+ if ( is_array( $wgArchiveLinksConfig['file_types']) ){
 72+ $accept_file_types = '-A ' . implode( ',', $wgArchiveLinksConfig['file_types'] );
 73+ } else {
 74+ $accept_file_types = '-A ' . $wgArchiveLinksConfig['file_types'];
 75+ }
 76+ } else {
 77+ //we should set a default, for now we will disable this for testing purposes, but this should be closed sometime later...
 78+ $accept_file_types = '';
 79+ }
 80+ //At the current time we are only adding support for the local filestore, but swift support is something that will be added later
 81+ //Add shutup operator for PHP notice, it's okay if this is not set as it's an optional config value
 82+ switch( @$wgArchiveLinksConfig['filestore'] ) {
 83+ case 'local':
 84+ default:
 85+ if ( isset( $wgArchiveLinksConfig['subfolder_name'] ) ) {
 86+ $dir = $path . $wgArchiveLinksConfig['subfolder_name'];
 87+ } elseif ( isset( $wgArchiveLinksConfig['content_path'] ) ) {
 88+ $dir = realpath( $wgArchiveLinksConfig['content_path'] );
 89+ if ( !$dir ) {
 90+ die ( 'The path you have set for $wgArchiveLinksConfig[\'content_path\'] does not exist. ' .
 91+ 'This makes the spider a very sad panda. Please either create it or use a different setting.');
 92+ }
 93+ } else {
 94+ $dir = $path . '/archived_content/';
 95+ }
 96+ $dir = $dir . sha1( time() . ' - ' . $url );
 97+ mkdir( $dir, 0644, TRUE );
 98+ $log_dir = $dir . '/log.txt';
 99+ $log_dir_esc = escapeshellarg($log_dir);
 100+ $dir = escapeshellarg( $dir );
 101+ $sanitized_url = escapeshellarg( $url );
 102+ }
 103+
 104+ if ( ! isset( $wgArchiveLinksConfig['wget_quota'] ) ) {
 105+ //We'll set the default max quota for any specific web page for 8 mb, which is kind of a lot but should allow for large images
 106+ $wgArchiveLinksConfig['wget_quota'] = '8m';
 107+ }
 108+
 109+ if ( !isset( $wgArchiveLinksConfig['retry_times'] ) ) {
 110+ //by default wget is set to retry something 20 times which is probably *way* too high for our purposes
 111+ //this has the potential to really slow it down as --waitretry is set to 10 seconds by default, meaning that it would take
 112+ //serveral minutes to go through all the retries which has the potential to stall the spider unnecessarily
 113+ $wgArchiveLinksConfig['retry_times'] = '3';
 114+ }
 115+
 116+
 117+ //Do stuff with wget
67118 if ( isset( $wgArchiveLinksConfig['wget_path'] ) && file_exists( $wgArchiveLinksConfig['wget_path'] ) ) {
68119 die ( 'Support is not yet added for wget in a different directory' );
69120 } elseif ( file_exists( "$path/wget.exe" ) ) {
70 - if ( isset( $wgArchiveLinksConfig['file_types'] ) ) {
71 - if ( is_array( $wgArchiveLinksConfig['file_types']) ){
72 - $accept_file_types = '-A ' . implode( ',', $wgArchiveLinksConfig['file_types'] );
73 - } else {
74 - $accept_file_types = '-A ' . $wgArchiveLinksConfig['file_types'];
 121+ wfShellExec( "cd $path" );
 122+ //echo "\n\nwget.exe -nv -p -H -E -k -t {$wgArchiveLinksConfig['retry_times']} -Q{$wgArchiveLinksConfig['retry_times']} -o $log_dir -P $dir $accept_file_types $sanitized_url\n\n";
 123+ wfShellExec( "wget.exe -nv -p -H -E -k -t {$wgArchiveLinksConfig['retry_times']} -Q {$wgArchiveLinksConfig['wget_quota']} -o $log_dir_esc -P $dir $accept_file_types $sanitized_url" );
 124+ $this->parse_wget_log( $log_dir, $url );
 125+ /*foreach( $this->downloaded_files as $file ) {
 126+ if ( $file['status'] === 'success' ) {
 127+
 128+ } elseif ( $file['status'] === 'failure' ) {
 129+ echo 'bar';
75130 }
76 - } else {
77 - //we should set a default, for now we will disable this for testing purposes, but this should be closed sometime later...
78 - $accept_file_types = '';
79 - }
80 - //At the current time we are only adding support for the local filestore, but swift support is something that will be added later
81 - //Add shutup operator for PHP notice, it's okay if this is not set as it's an optional config value
82 - switch( @$wgArchiveLinksConfig['filestore'] ) {
83 - case 'local':
84 - default:
85 - if ( isset( $wgArchiveLinksConfig['subfolder_name'] ) ) {
86 - $dir = $path . $wgArchiveLinksConfig['subfolder_name'];
87 - } elseif ( isset( $wgArchiveLinksConfig['content_path'] ) ) {
88 - $dir = realpath( $wgArchiveLinksConfig['content_path'] );
89 - if ( !$dir ) {
90 - die ( 'The path you have set for $wgArchiveLinksConfig[\'content_path\'] does not exist. ' .
91 - 'This makes the spider a very sad panda. Please either create it or use a different setting.');
92 - }
93 - } else {
94 - $dir = $path . '/archived_content/';
95 - }
96 - $dir = $dir . sha1( time() . ' - ' . $url );
97 - mkdir( $dir, 0644, TRUE );
98 - $dir = escapeshellarg( $dir );
99 - $sanitized_url = escapeshellarg( $url );
100 - }
101 -
102 - if ( ! isset( $wgArchiveLinksConfig['wget_quota'] ) ) {
103 - //We'll set the default max quota for any specific web page for 8 mb, which is kind of a lot but should allow for large images
104 - $quota = '8m';
105 - }
106 -
107 - if ( !isset( $wgArchiveLinksConfig['retry_times'] ) ) {
108 - //by default wget is set to retry something 20 times which is probably *way* too high for our purposes
109 - //this has the potential to really slow it down as --waitretry is set to 10 seconds by default, meaning that it would take
110 - //serveral minutes to go through all the retries which has the potential to stall the spider unnecessarily
111 - $wgArchiveLinksConfig['retry_times'] = '3';
112 - }
113 -
114 - shell_exec( "cd $path" );
115 - shell_exec( "wget.exe -nv -p -H -E -k -t {$wgArchiveLinksConfig['retry_times']} -Q{$wgArchiveLinksConfig['retry_times']} -o $dir/log.txt -P $dir $accept_file_types $sanitized_url" );
116 - $this->parse_wget_log( "$dir/log.txt", $url );
 131+ }*/
 132+ $this->db_master->insert( $this->downloaded_files[0]['url'] );
117133 } else {
118134 //this is primarily designed with windows in mind and no built in wget, so yeah, *nix support should be added, in other words note to self...
119135 die ( 'wget must be installed in order for the spider to function in wget mode' );
@@ -121,11 +137,11 @@
122138
123139 private function check_queue( ) {
124140 //need to fix this to use arrays instead of what I'm doing now
125 - $this->db_result['job-fetch'] = $this->db_slave->select('el_archive_queue', '*', '`el_archive_queue`.`delay_time` <= ' . time()
126 - . ' AND `el_archive_queue`.`in_progress` = 0'
127 - . ' ORDER BY `el_archive_queue`.`queue_id` ASC'
128 - . ' LIMIT 1');
129 -
 141+ $this->db_result['job-fetch'] = $this->db_slave->select( 'el_archive_queue', '*',
 142+ array( 'delay_time' => ' >=' . time(), 'in_progress' => '0'),
 143+ __METHOD__,
 144+ array( 'ORDER BY' => 'queue_id ASC', 'LIMIT' => '1' ));
 145+
130146 if ( $this->db_result['job-fetch']->numRows() > 0 ) {
131147 $row = $this->db_result['job-fetch']->fetchRow();
132148
@@ -244,37 +260,52 @@
245261 }
246262
247263 private function parse_wget_log( $log_path, $url ) {
248 - $fp = fopen( $log_path, 'r' ) or die( 'can\'t find wget log file to parse' );
 264+ //We have a die statement here, PHP error unnecessary
 265+ @$fp = fopen( $log_path, 'r' ) or die( 'can\'t find wget log file to parse' );
249266
250 - $downloaded_files = array ( 'failed' => array(), 'success' => array() );
 267+ $this->downloaded_files = array ( );
251268
 269+ $line_regexes = array (
 270+ 'url' => '%^\d{4}-(?:\d{2}(?:-|:| )?){5}URL:(http://.*?) \[.+?\] ->%',
 271+ 'finish' => '%^Downloaded: \d+ files, (\d(?:.\d)?+(?:K|M)).*%',
 272+ 'sole_url' => '%^(http://.*):%',
 273+ 'error' => '%^\d{4}-(?:\d{2}-?){2} (?:\d{2}:?){3} ERROR (\d){3}:(.+)%',
 274+ 'quota_exceed' => '%^Download quota of .*? EXCEEDED!%',
 275+ 'finish_line' => '%^FINISHED --(\d{4}-(?:\d{2}(?:-|:| )){5})-%',
 276+ );
 277+
252278 while ( $line = fgets( $fp ) ) {
253 - $line_regexes = array (
254 - 'url' => '%\^d{4}-(?:\d{2}-?){2} (?:\d{2}:?){3} URL:(http://.*) \[.+\] ->%',
255 - 'finish' => '%^Downloaded: \d+ files, (\d+(?:K|M)).*%',
256 - 'sole_url' => '%^(http://.*):%',
257 - 'error' => '%^\d{4}-(?:\d{2}-?){2} (?:\d{2}:?){3} ERROR (\d){3}:(.+)%',
258 -
259 - );
260279 foreach( $line_regexes as $line_type => $regex ) {
261280 if ( preg_match( $regex, $line, $matches ) ) {
262281 switch ( $line_type ) {
263282 case 'url':
264 - $downloaded_files['success'][] = $matches[1];
 283+ $this->downloaded_files[] = array (
 284+ 'status' => 'success',
 285+ 'url' => $matches[1]
 286+ );
265287 $last_line = 'url';
266288 break;
267289 case 'sole_url':
268 - $downloaded_files['failed'][]['url'] = $matches[1];
 290+ $this->downloaded_files[] = array (
 291+ 'status' => 'failed',
 292+ 'url' => $matches[1]
 293+ );
269294 break;
270295 case 'error':
271 - end( $downloaded_files['failed'] );
272 - $array_key = key( $downloaded_files['failed'] );
273 - $downloaded_files['failed'][$array_key]['error_code'] = $matches[1];
274 - $downloaded_files['failed'][$array_key]['error_text'] = $matches[2];
 296+ //this is a contination of the previous line, so just add stuff to that
 297+ end( $this->downloaded_files );
 298+ $array_key = key( $this->downloaded_files );
 299+ $this->downloaded_files[$array_key]['error_code'] = $matches[1];
 300+ $this->downloaded_files[$array_key]['error_text'] = $matches[2];
275301 break;
276302 case 'finish':
277303 $finish_time = $matches[1];
278304 break;
 305+ case 'finish_line':
 306+ //this is kind of useless, it contains the date/time stamp of when the download finished
 307+ break;
 308+ case 'quote_exceed':
 309+ break;
279310 default:
280311 //we missed a line type, this is mainly for testing purposes and shouldn't happen when parsing the log
281312 echo "\n\nUNKNOWN LINE: $line\n\n";
@@ -283,6 +314,8 @@
284315 }
285316 }
286317 }
 318+
 319+ return $this->downloaded_files;
287320 }
288321 }
289322

Follow-up revisions

RevisionCommit summaryAuthorDate
r95208*add rudimentary and probably broken tests for link rewriter function...kbrown15:12, 22 August 2011

Comments

#Comment by NeilK (talk | contribs)   16:03, 19 August 2011

@$wgArchiveLinksConfig['filestore']

For your switches you are using the 'shutup' operator a lot. We don't use the "shutup" operator unless it's absolutely necessary. Just wrap it in another if statement.

Please don't commit debug code like the //echo "\n\nwget.exe and the next foreach after that

    @$fp = fopen( $log_path, 'r' ) or die( 'can\'t find wget log file to parse' );

This is confusing, what is the @ symbol even for if you die?

Otherwise looks okay

#Comment by 😂 (talk | contribs)   16:05, 19 August 2011

@ is never necessary, it's forbidden :)

#Comment by NeilK (talk | contribs)   17:38, 19 August 2011

^demon: I had some idea that I'd seen it once in MediaWiki code (wrapped in some other complex operations), but I can't find that now... okay, amend comment to never necessary.

#Comment by 😂 (talk | contribs)   17:44, 19 August 2011

If you can find it, I can point out why it's probably incorrect ;-)

Status & tagging log