r52158 MediaWiki - Code Review archive

Repository:	MediaWiki
Revision:	< r52157‎ \| r52158 \| r52159 >
Date:	09:51, 19 June 2009
Author:	thomasv
Status:	reverted (Comments)
Tags:
Comment:	encode string to utf8 before converting to xml
Modified paths:	/trunk/phase3/includes/DjVuImage.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/DjVuImage.php
—	—	@@ -250,6 +250,7 @@
251	251	$txt = wfShellExec( $cmd, $retval );
252	252	wfProfileOut( 'djvutxt' );
253	253	if( $retval == 0) {
	254	+ $txt = utf8_encode($txt);
254	255	$txt = htmlspecialchars($txt);
255	256	$txt = preg_replace( "/$(page\s[\d-]\s[\d-]\s[\d-]\s[\d-]\s\"([^<]?)\"\s*\|)$/s", "<PAGE value=\"$2\" />", $txt );
256	257	$txt = "<DjVuTxt>\n<HEAD></HEAD>\n<BODY>\n" . $txt . "</BODY>\n</DjVuTxt>\n";

Follow-up revisions

Revision	Commit summary	Author	Date
r55745	Revert r52158 per CR: was attempting to convert text to UTF-8 that was alread...	tstarling	06:34, 2 September 2009

Comments

#Comment by Brion VIBBER (talk | contribs) 17:00, 25 August 2009

Hmmm.... man page for djvutxt says:

"Program djvutxt decodes the hidden text layer of a DjVu document inputdjvufile and prints the UTF8 encoded text into file outputtxtfile or the standard output. No output is produced if the file contains no hidden text layer. The hidden text layer is usually generated with the help of an optical character recognition software."

If it's UTF-8-encoded to begin with, this will just corrupt the data. Can you provide a sample file?

#Comment by ThomasV (talk | contribs) 06:06, 26 August 2009

here is a sample file of the problem I was trying to fix: http://commons.wikimedia.org/wiki/File:Lettres_persanes_I.djvu

#Comment by Tim Starling (talk | contribs) 06:32, 2 September 2009

That file was encoded with all non-ASCII characters replaced with delete characters (0x7f). There's no way to recover the original text.

#Comment by ThomasV (talk | contribs) 08:56, 2 September 2009

ok, but this does not solve the problem. there are other djvu files that have the same problem. see bug 9327

should delete characters just be removed ?

#Comment by ThomasV (talk | contribs) 10:29, 2 September 2009

The problem is that the output of djvutxt contains some non-ascii characters that are not accepted by the xml parser. I do not think that these characters are valid utf8 : for example in the sample file, we have \234 that appears alone at the beginning of the file.

I am not sure if this is caused by djvutxt or by ob_get_content called in wfShellExec.

#Comment by Tim Starling (talk | contribs) 20:43, 2 September 2009

The format output by djvutxt bears almost no relation to the parsing code contained in DjVuImage::retreiveMetaData(). It's lucky that you get anything at all out of it. For instance, if the input text contains a double-quote, retreiveMetaData() will truncate the string at that point. Escape sequences are used to encode quotes, most control characters and all non-ASCII characters, there's no decoding for them. Control characters are inserted to delimit pages, columns, regions and paragraphs, these characters need to be stripped. The format is documented on the man page of djvused.

#Comment by ThomasV (talk | contribs) 06:58, 3 September 2009

why would the input contain a double quote ? The manual mentions control characters \013 \035 and \037, and indeed they should be stripped. But that's not enough. There are other characters that cause the xml parser to fail, like \234 in the example I provided. I do not see anything about it in the documentation.

#Comment by ThomasV (talk | contribs) 14:21, 3 September 2009

I commited r55768, it should fix it

Status & tagging log

06:35, 2 September 2009 Tim Starling (talk | contribs) changed the status of r52158 [removed: fixme added: reverted]
17:00, 25 August 2009 Brion VIBBER (talk | contribs) changed the status of r52158 [removed: new added: fixme]