r85377 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r85376‎ | r85377 | r85378 >
Date:20:59, 4 April 2011
Author:brion
Status:ok (Comments)
Tags:
Comment:
Workaround for bug 28146: running out of memory during Unicode validation/normalization when uploading DjVu file with lots of embedded page text

This provisional workaround runs a page at a time through UtfNormal::cleanUp() instead of running the entire file's dumped text at once. This avoids exploding memory too much during the preg_match_all() used to divide up ASCII and non-ASCII runs for validation, which is very wasteful for long texts in Latin languages with many mixed-in non-ASCII characters (like French and German text).
Won't fix legit cases of huge texts, such as realllllllllly long page text, which would still be subject to getting run through at web input time in a giant chunk.
Modified paths:
  • /trunk/phase3/includes/DjVuImage.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/DjVuImage.php
@@ -254,8 +254,7 @@
255255 $txt = wfShellExec( $cmd, $retval );
256256 wfProfileOut( 'djvutxt' );
257257 if( $retval == 0) {
258 - # Get rid of invalid UTF-8, strip control characters
259 - $txt = UtfNormal::cleanUp( $txt );
 258+ # Strip some control characters
260259 $txt = preg_replace( "/[\013\035\037]/", "", $txt );
261260 $reg = <<<EOR
262261 /\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*"
@@ -279,7 +278,8 @@
280279 }
281280
282281 function pageTextCallback( $matches ) {
283 - return '<PAGE value="' . htmlspecialchars( $matches[1] ) . '" />';
 282+ # Get rid of invalid UTF-8, strip control characters
 283+ return '<PAGE value="' . htmlspecialchars( UtfNormal::cleanUp( $matches[1] ) ) . '" />';
284284 }
285285
286286 /**

Follow-up revisions

RevisionCommit summaryAuthorDate
r864641.17wmf1: MFT r85377, r85555, r85583, r86100, r86121, r86130, r86142, r86146,...catrope11:27, 20 April 2011
r864741.17: MFT r81731, r85377, r85547, r85555, r85583, r85803, r85881, r86100, r86...catrope13:22, 20 April 2011

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r85155Memory stress test for UtfNormal issue re bug 28146...brion21:14, 1 April 2011

Comments

#Comment by Brion VIBBER (talk | contribs)   15:52, 6 April 2011

marking as needing merge to 1.17; the issue this fixes does occur on live sites.

Status & tagging log