r70103 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r70102‎ | r70103 | r70104 >
Date:19:24, 28 July 2010
Author:hartman
Status:ok (Comments)
Tags:
Comment:
Recognize webm and matroska files. See also Bug 23888

Currently all webm files are stored as video/webm. It is not possible to detect
wether this file is an audio file without using a full parser. This is why We should
really move mime and mediatype accessors to the MediaHandlers.

Using video/x-matroska for MKV files. There is no official mime for MKV (though the
webm isn't official either, but everyone is already using it apparently).
Modified paths:
  • /trunk/phase3/includes/MimeMagic.php (modified) (history)
  • /trunk/phase3/includes/mime.info (modified) (history)
  • /trunk/phase3/includes/mime.types (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/MimeMagic.php
@@ -396,7 +396,8 @@
397397 'xbm',
398398
399399 // Formats we recognize magic numbers for
400 - 'djvu', 'ogx', 'ogg', 'ogv', 'oga', 'spx', 'mid', 'pdf', 'wmf', 'xcf',
 400+ 'djvu', 'ogx', 'ogg', 'ogv', 'oga', 'spx',
 401+ 'mid', 'pdf', 'wmf', 'xcf', 'webm', 'mkv', 'mka',
401402
402403 // XML formats we sure hope we recognize reliably
403404 'svg',
@@ -468,6 +469,24 @@
469470 }
470471 }
471472
 473+ /* Look for WebM and Matroska files */
 474+ if( strncmp( $head, pack( "C4", 0x1a, 0x45, 0xdf, 0xa3 ), 4 ) == 0 ) {
 475+ $doctype = strpos( $head, "\x42\x82" );
 476+ if( $doctype ) {
 477+ // Next byte is datasize, then data (sizes larger than 1 byte are very stupid muxers)
 478+ $data = substr($head, $doctype+3, 8);
 479+ if( strncmp( $data, "matroska", 8 ) == 0 ) {
 480+ wfDebug( __METHOD__ . ": recognized file as video/x-matroska\n" );
 481+ return "video/x-matroska";
 482+ } else if ( strncmp( $data, "webm", 4 ) == 0 ) {
 483+ wfDebug( __METHOD__ . ": recognized file as video/webm\n" );
 484+ return "video/webm";
 485+ }
 486+ }
 487+ wfDebug( __METHOD__ . ": unknown EBML file\n" );
 488+ return "unknown/unknown";
 489+ }
 490+
472491 /*
473492 * Look for PHP. Check for this before HTML/XML... Warning: this is a
474493 * heuristic, and won't match a file with a lot of non-PHP before. It
Index: trunk/phase3/includes/mime.types
@@ -64,7 +64,9 @@
6565 audio/midi mid midi kar
6666 audio/mpeg mpga mp2 mp3
6767 audio/ogg oga ogg spx
 68+audio/webm webm
6869 audio/x-aiff aif aiff aifc
 70+audio/x-matroska mka mkv
6971 audio/x-mpegurl m3u
7072 audio/x-ogg oga ogg spx
7173 audio/x-pn-realaudio ram rm
@@ -114,7 +116,9 @@
115117 video/ogg ogv ogm ogg
116118 video/quicktime qt mov
117119 video/vnd.mpegurl mxu
 120+video/webm webm
118121 video/x-flv flv
 122+video/x-matroska mkv mka
119123 video/x-msvideo avi
120124 video/x-ogg ogv ogm ogg
121125 video/x-sgi-movie movie
Index: trunk/phase3/includes/mime.info
@@ -35,11 +35,15 @@
3636 audio/x-aiff [AUDIO]
3737 audio/x-pn-realaudio [AUDIO]
3838 audio/x-realaudio [AUDIO]
 39+audio/webm [AUDIO]
 40+audio/x-matroska [AUDIO]
3941
4042 video/mpeg application/mpeg [VIDEO]
4143 video/ogg [VIDEO]
4244 video/x-sgi-video [VIDEO]
4345 video/x-flv [VIDEO]
 46+video/webm [VIDEO]
 47+video/x-matroska [VIDEO]
4448
4549 application/ogg application/x-ogg audio/ogg audio/x-ogg video/ogg video/x-ogg [MULTIMEDIA]
4650

Comments

#Comment by Bryan (talk | contribs)   08:38, 20 December 2010

Why are you using pack( "C4", 0x1a, 0x45, 0xdf, 0xa3 ) instead of "\0x1a\0x45\0xdf\0xa3"?

I don't fully understand EBML, but shouldn't you use substr($head, $doctype+2,1) to determine the length of the doctype?

#Comment by Platonides (talk | contribs)   18:46, 20 December 2010

Looking at http://ebml.sourceforge.net/specs/ there could be more headers before the doctype one.

I think hartman is right and we should create an EBML parser. This has the advantage that we can make it so it doesn't try to load all the file/metadata in memory at once, as we suffered from other formats.

#Comment by Bryan (talk | contribs)   19:09, 20 December 2010

We're only loading 1K, not the entire file. It could of course be possible that the doctype is outside this first 1K, but that is not very likely.

#Comment by Platonides (talk | contribs)   20:17, 20 December 2010

Oh, I wasn't thinking on an exhaustion on this place, but we would want to reuse such parser for metadata extraction at another point.

#Comment by TheDJ (talk | contribs)   19:11, 21 December 2010

I tried using just strncmp, and it didn't work.

Regarding size: yes, you should, but then you need to decode the size as well.

See datasize: http://ebml.sourceforge.net/specs/

It can be implemented, but bitparsers in PHP are annoying to write in my experience. I'm not sure it's worth the effort.

#Comment by Bryan (talk | contribs)   21:39, 21 December 2010

utf like integer encoding... wtf?

Marking as ok.

Status & tagging log