r70103 MediaWiki - Code Review archive

Repository:	MediaWiki
Revision:	< r70102‎ \| r70103 \| r70104 >
Date:	19:24, 28 July 2010
Author:	hartman
Status:	ok (Comments)
Tags:
Comment:	Recognize webm and matroska files. See also Bug 23888 Currently all webm files are stored as video/webm. It is not possible to detect wether this file is an audio file without using a full parser. This is why We should really move mime and mediatype accessors to the MediaHandlers. Using video/x-matroska for MKV files. There is no official mime for MKV (though the webm isn't official either, but everyone is already using it apparently).
Modified paths:	/trunk/phase3/includes/MimeMagic.php (modified) (history) /trunk/phase3/includes/mime.info (modified) (history) /trunk/phase3/includes/mime.types (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/MimeMagic.php
—	—	@@ -396,7 +396,8 @@
397	397	'xbm',
398	398
399	399	// Formats we recognize magic numbers for
400		~~- 'djvu', 'ogx', 'ogg', 'ogv', 'oga', 'spx', 'mid', 'pdf', 'wmf', 'xcf',~~
	400	+ 'djvu', 'ogx', 'ogg', 'ogv', 'oga', 'spx',
	401	+ 'mid', 'pdf', 'wmf', 'xcf', 'webm', 'mkv', 'mka',
401	402
402	403	// XML formats we sure hope we recognize reliably
403	404	'svg',
—	—	@@ -468,6 +469,24 @@
469	470	}
470	471	}
471	472
	473	+ /* Look for WebM and Matroska files */
	474	+ if( strncmp( $head, pack( "C4", 0x1a, 0x45, 0xdf, 0xa3 ), 4 ) == 0 ) {
	475	+ $doctype = strpos( $head, "\x42\x82" );
	476	+ if( $doctype ) {
	477	+ // Next byte is datasize, then data (sizes larger than 1 byte are very stupid muxers)
	478	+ $data = substr($head, $doctype+3, 8);
	479	+ if( strncmp( $data, "matroska", 8 ) == 0 ) {
	480	+ wfDebug( __METHOD__ . ": recognized file as video/x-matroska\n" );
	481	+ return "video/x-matroska";
	482	+ } else if ( strncmp( $data, "webm", 4 ) == 0 ) {
	483	+ wfDebug( __METHOD__ . ": recognized file as video/webm\n" );
	484	+ return "video/webm";
	485	+ }
	486	+ }
	487	+ wfDebug( __METHOD__ . ": unknown EBML file\n" );
	488	+ return "unknown/unknown";
	489	+ }
	490	+
472	491	/*
473	492	* Look for PHP. Check for this before HTML/XML... Warning: this is a
474	493	* heuristic, and won't match a file with a lot of non-PHP before. It
Index: trunk/phase3/includes/mime.types
—	—	@@ -64,7 +64,9 @@
65	65	audio/midi mid midi kar
66	66	audio/mpeg mpga mp2 mp3
67	67	audio/ogg oga ogg spx
	68	+audio/webm webm
68	69	audio/x-aiff aif aiff aifc
	70	+audio/x-matroska mka mkv
69	71	audio/x-mpegurl m3u
70	72	audio/x-ogg oga ogg spx
71	73	audio/x-pn-realaudio ram rm
—	—	@@ -114,7 +116,9 @@
115	117	video/ogg ogv ogm ogg
116	118	video/quicktime qt mov
117	119	video/vnd.mpegurl mxu
	120	+video/webm webm
118	121	video/x-flv flv
	122	+video/x-matroska mkv mka
119	123	video/x-msvideo avi
120	124	video/x-ogg ogv ogm ogg
121	125	video/x-sgi-movie movie
Index: trunk/phase3/includes/mime.info
—	—	@@ -35,11 +35,15 @@
36	36	audio/x-aiff [AUDIO]
37	37	audio/x-pn-realaudio [AUDIO]
38	38	audio/x-realaudio [AUDIO]
	39	+audio/webm [AUDIO]
	40	+audio/x-matroska [AUDIO]
39	41
40	42	video/mpeg application/mpeg [VIDEO]
41	43	video/ogg [VIDEO]
42	44	video/x-sgi-video [VIDEO]
43	45	video/x-flv [VIDEO]
	46	+video/webm [VIDEO]
	47	+video/x-matroska [VIDEO]
44	48
45	49	application/ogg application/x-ogg audio/ogg audio/x-ogg video/ogg video/x-ogg [MULTIMEDIA]
46	50

Comments

#Comment by Bryan (talk | contribs) 08:38, 20 December 2010

Why are you using pack( "C4", 0x1a, 0x45, 0xdf, 0xa3 ) instead of "\0x1a\0x45\0xdf\0xa3"?

I don't fully understand EBML, but shouldn't you use substr($head, $doctype+2,1) to determine the length of the doctype?

#Comment by Platonides (talk | contribs) 18:46, 20 December 2010

Looking at http://ebml.sourceforge.net/specs/ there could be more headers before the doctype one.

I think hartman is right and we should create an EBML parser. This has the advantage that we can make it so it doesn't try to load all the file/metadata in memory at once, as we suffered from other formats.

#Comment by Bryan (talk | contribs) 19:09, 20 December 2010

We're only loading 1K, not the entire file. It could of course be possible that the doctype is outside this first 1K, but that is not very likely.

#Comment by Platonides (talk | contribs) 20:17, 20 December 2010

Oh, I wasn't thinking on an exhaustion on this place, but we would want to reuse such parser for metadata extraction at another point.

#Comment by TheDJ (talk | contribs) 19:11, 21 December 2010

I tried using just strncmp, and it didn't work.

Regarding size: yes, you should, but then you need to decode the size as well.

See datasize: http://ebml.sourceforge.net/specs/

It can be implemented, but bitparsers in PHP are annoying to write in my experience. I'm not sure it's worth the effort.

#Comment by Bryan (talk | contribs) 21:39, 21 December 2010

utf like integer encoding... wtf?

Marking as ok.

Status & tagging log

21:39, 21 December 2010 Bryan (talk | contribs) changed the status of r70103 [removed: new added: ok]