r49669 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r49668‎ | r49669 | r49670 >
Date:18:00, 20 April 2009
Author:thomasv
Status:reverted (Comments)
Tags:
Comment:
extract text layer from djvu file (see bug 18046)
Modified paths:
  • /trunk/phase3/includes/DefaultSettings.php (modified) (history)
  • /trunk/phase3/includes/media/DjVu.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/media/DjVu.php
@@ -52,6 +52,8 @@
5353 $m = false;
5454 if ( preg_match( '/^page(\d+)-(\d+)px$/', $str, $m ) ) {
5555 return array( 'width' => $m[2], 'page' => $m[1] );
 56+ } else if ( preg_match( '/^page(\d+)-djvutxt$/', $str, $m ) ) {
 57+ return array( 'djvutxt' => 1, 'page' => $m[1] );
5658 } else {
5759 return false;
5860 }
@@ -64,8 +66,22 @@
6567 );
6668 }
6769
 70+
 71+ function normaliseParams( $image, &$params ) {
 72+ global $wgDjvuTxt;
 73+ if( $params['djvutxt'] && $wgDjvuTxt) {
 74+ if ( !isset( $params['page'] ) ) {
 75+ $params['page'] = 1;
 76+ }
 77+ $params['width'] = 0;
 78+ $params['height'] = 0;
 79+ return true;
 80+ }
 81+ else return parent::normaliseParams( $image, $params );
 82+ }
 83+
6884 function doTransform( $image, $dstPath, $dstUrl, $params, $flags = 0 ) {
69 - global $wgDjvuRenderer, $wgDjvuPostProcessor;
 85+ global $wgDjvuRenderer, $wgDjvuPostProcessor, $wgDjvuTxt, $wgSed;
7086
7187 // Fetch XML and check it, to give a more informative error message than the one which
7288 // normaliseParams will inevitably give.
@@ -96,12 +112,22 @@
97113
98114 # Use a subshell (brackets) to aggregate stderr from both pipeline commands
99115 # before redirecting it to the overall stdout. This works in both Linux and Windows XP.
100 - $cmd = '(' . wfEscapeShellArg( $wgDjvuRenderer ) . " -format=ppm -page={$page} -size={$width}x{$height} " .
101 - wfEscapeShellArg( $srcPath );
102 - if ( $wgDjvuPostProcessor ) {
103 - $cmd .= " | {$wgDjvuPostProcessor}";
 116+
 117+ if( $params['djvutxt'] && $wgDjvuTxt && $wgSed ) {
 118+ #Read text from djvu
 119+ $cmd = '(' . wfEscapeShellArg( $wgDjvuTxt ) . " --page={$page} " . wfEscapeShellArg( $srcPath );
 120+ #Escape < > & characters
 121+ $cmd .= ' | ' . wfEscapeShellArg( $wgSed ) . ' "s/\&/\&amp;/g ; s/</\&lt;/g ; s/>/\&gt;/g ; s/\"/\&quot;/g "';
 122+ $cmd .= ' > ' . wfEscapeShellArg($dstPath) . ') 2>&1';
104123 }
105 - $cmd .= ' > ' . wfEscapeShellArg($dstPath) . ') 2>&1';
 124+ else {
 125+ $cmd = '(' . wfEscapeShellArg( $wgDjvuRenderer ) . " -format=ppm -page={$page} -size={$width}x{$height} " .
 126+ wfEscapeShellArg( $srcPath );
 127+ if ( $wgDjvuPostProcessor ) {
 128+ $cmd .= " | {$wgDjvuPostProcessor}";
 129+ }
 130+ $cmd .= ' > ' . wfEscapeShellArg($dstPath) . ') 2>&1';
 131+ }
106132 wfProfileIn( 'ddjvu' );
107133 wfDebug( __METHOD__.": $cmd\n" );
108134 $err = wfShellExec( $cmd, $retval );
Index: trunk/phase3/includes/DefaultSettings.php
@@ -1898,6 +1898,11 @@
18991899 $wgDiff = '/usr/bin/diff';
19001900
19011901 /**
 1902+ * Path to the GNU sed utility.
 1903+ */
 1904+$wgSed = '/bin/sed';
 1905+
 1906+/**
19021907 * We can also compress text stored in the 'text' table. If this is set on, new
19031908 * revisions will be compressed on page save if zlib support is available. Any
19041909 * compressed revisions will be decompressed on load regardless of this setting
@@ -3533,6 +3538,13 @@
35343539 $wgDjvuRenderer = null;
35353540
35363541 /**
 3542+ * Path of the djvutxt DJVU text extraction utility
 3543+ * Enable this and $wgDjvuDump to enable text layer extraction from djvu files
 3544+ */
 3545+# $wgDjvuTxt = 'djvutxt';
 3546+$wgDjvuTxt = null;
 3547+
 3548+/**
35373549 * Path of the djvutoxml executable
35383550 * This works like djvudump except much, much slower as of version 3.5.
35393551 *

Follow-up revisions

RevisionCommit summaryAuthorDate
r49805follow-up to r49669thomasv08:58, 24 April 2009
r50026Revert r49669, r49670 "extract text layer from djvu file (see bug 18046)"...brion22:54, 28 April 2009
r50050extract djvu text (bug 18046); escape possible script with htmlspecialchars i...thomasv19:33, 29 April 2009
r50051fetch djvu text (bug 18046); using parser hook instead of ajax, and Http::get...thomasv19:37, 29 April 2009

Comments

#Comment by Simetrical (talk | contribs)   16:43, 21 April 2009

I'm admittedly not getting any of the context here, but this seems like if you don't have sed installed (e.g., Windows), this will fail, and this isn't documented AFAICT. If $wgDjvuTxt requires $wgSed to have a valid setting, shouldn't that at least be mentioned in its comment in DefaultSettings?

#Comment by ThomasV (talk | contribs)   08:23, 23 April 2009

indeed; I guess the default value for $wgSed should be set to null.

#Comment by Simetrical (talk | contribs)   13:18, 23 April 2009

Or you could file_exists() it before trying to run it. You still need to document that you can't use $wgDjvuTxt without a working $wgSed ― that's pretty significant for Windows users.

#Comment by Brion VIBBER (talk | contribs)   22:41, 28 April 2009

I don't really like the use of sed here; it's an extra Unix dependency which seems unnecessary, when we could trivially do the escaping in the code.

The ajax fetch added in r49670 also looks like a big security hole. Will revert them both for now...

#Comment by Brion VIBBER (talk | contribs)   22:54, 28 April 2009

Reverted for now in r50026

Status & tagging log