r57266 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r57265‎ | r57266 | r57267 >
Date:23:29, 1 October 2009
Author:brion
Status:ok
Tags:
Comment:
Cleanup for r56413 - PDF text extraction support:
* use UtfNormal::cleanUp() for UTF-8 and control char cleanup instead of iconv() and a manual strip
* remove the htmlspecialchars() which looks like it shouldn't be here; this is for internal data storage not HTML output
Modified paths:
  • /trunk/extensions/PdfHandler/PdfHandler.image.php (modified) (history)

Diff [purge]

Index: trunk/extensions/PdfHandler/PdfHandler.image.php
@@ -101,14 +101,14 @@
102102 wfDebug( __METHOD__.": $cmd\n" );
103103 $txt = wfShellExec( $cmd, $retval );
104104 wfProfileOut( 'pdftotext' );
105 - if( $retval == 0) {
106 - # Get rid of invalid UTF-8, strip control characters
107 - wfSuppressWarnings();
108 - $txt = iconv( "UTF-8","UTF-8//IGNORE", $txt );
109 - wfRestoreWarnings();
110 - $txt = preg_replace( "/[\013\035\037]/", "", $txt );
111 - $txt = htmlspecialchars($txt);
112 - $pages = preg_split("/\f/s", $txt );
 105+ if( $retval == 0 ) {
 106+ $txt = str_replace( "\r\n", "\n", $txt );
 107+ $pages = explode( "\f", $txt );
 108+ foreach( $pages as $page => $pageText ) {
 109+ # Get rid of invalid UTF-8, strip control characters
 110+ # Note we need to do this per page, as \f page feed would be stripped.
 111+ $pages[$page] = UtfNormal::cleanUp( $pageText );
 112+ }
113113 $data['text'] = $pages;
114114 }
115115 }

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r56413extract text layer from pdfthomasv13:50, 16 September 2009

Status & tagging log