r44248 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r44247‎ | r44248 | r44249 >
Date:05:35, 5 December 2008
Author:tstarling
Status:ok (Comments)
Tags:
Comment:
New file upload content type checks using a simulation of IE's content type detection algorithm.
Modified paths:
  • /trunk/phase3/RELEASE-NOTES (modified) (history)
  • /trunk/phase3/includes/DefaultSettings.php (modified) (history)
  • /trunk/phase3/includes/MimeMagic.php (modified) (history)
  • /trunk/phase3/includes/specials/SpecialUpload.php (modified) (history)
  • /trunk/phase3/languages/messages/MessagesEn.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/MimeMagic.php
@@ -808,4 +808,413 @@
809809
810810 return MEDIATYPE_UNKNOWN;
811811 }
 812+
 813+ /**
 814+ * Get the MIME type from ieMimeFromData(), but convert the result from IE's
 815+ * idiosyncratic private types into something MediaWiki will understand.
 816+ *
 817+ * @param string $fileName The file name (unused at present)
 818+ * @param string $chunk The first 256 bytes of the file
 819+ * @param string $proposed The MIME type proposed by the server
 820+ */
 821+ public function getIEMimeType( $fileName, $chunk, $proposed ) {
 822+ static $table = array(
 823+ 'image/pjpeg' => 'image/jpeg',
 824+ 'image/x-png' => 'image/png',
 825+ 'image/x-wmf' => 'application/x-msmetafile',
 826+ 'image/bmp' => 'image/x-bmp',
 827+ 'application/x-zip-compressed' => 'application/zip',
 828+ 'application/x-compressed' => 'application/x-compress',
 829+ 'application/x-gzip-compressed' => 'application/x-gzip',
 830+ 'audio/mid' => 'audio/midi',
 831+ );
 832+
 833+ $type = $this->ieMimeFromData( $fileName, $chunk, $proposed );
 834+ if ( isset( $table[$type] ) ) {
 835+ $type = $table[$type];
 836+ }
 837+ return $type;
 838+ }
 839+
 840+ /**
 841+ * Do a MIME type check similar to IE's FindMimeFromData(). This is used to
 842+ * ensure that a file won't be detected as a blacklisted type such as text/html,
 843+ * thus opening up an XSS vulnerability.
 844+ *
 845+ * Based on a disassembly of urlmon.dll from IE 7.
 846+ *
 847+ * @param string $fileName The file name (unused at present)
 848+ * @param string $chunk The first 256 bytes of the file
 849+ * @param string $proposed The MIME type proposed by the server
 850+ */
 851+ public function ieMimeFromData( $fileName, $chunk, $proposed ) {
 852+ // IE puts a null character at byte 255 (the 256th byte)
 853+ $chunk = substr( $chunk, 0, 255 );
 854+
 855+ $types = array(
 856+ 'ambiguous' /*1*/ => array(
 857+ 'text/plain',
 858+ 'application/octet-stream',
 859+ 'application/x-netcdf', // [sic]
 860+ 'unknown/unknown', // for MediaWiki
 861+ ),
 862+ 'text' /*3*/ => array(
 863+ 'text/richtext', 'image/x-bitmap', 'application/postscript', 'application/base64',
 864+ 'application/macbinhex40', 'application/x-cdf', 'text/scriptlet', 'text/xml',
 865+ 'application/xml',
 866+
 867+ ),
 868+ 'binary' /*4*/ => array(
 869+ 'application/pdf', 'audio/x-aiff', 'audio/basic', 'audio/wav', 'image/gif',
 870+ 'image/pjpeg', 'image/jpeg', 'image/tiff', 'image/x-png', 'image/png', 'image/bmp',
 871+ 'image/x-jg', 'image/x-art', 'image/x-emf', 'image/x-wmf', 'video/avi',
 872+ 'video/x-msvideo', 'video/mpeg', 'application/x-compressed',
 873+ 'application/x-zip-compressed', 'application/x-gzip-compressed', 'application/java',
 874+ 'application/x-msdownload'
 875+ ),
 876+ 'html' /*5*/ => array( 'text/html' ),
 877+ );
 878+
 879+ if ( in_array( $proposed, $types['text'] ) ) {
 880+ return $proposedType;
 881+ }
 882+ if ( !in_array( $proposed, $types['binary'] )
 883+ && !in_array( $proposed, $types['ambiguous'] )
 884+ && !in_array( $proposed, $types['html'] ) )
 885+ {
 886+ // Unknown
 887+ return $proposed;
 888+ }
 889+
 890+ // CContentAnalyzer::SampleData
 891+ $found = array();
 892+ $numControl = 0;
 893+ $numHigh = 0;
 894+ $numLow = 0;
 895+ $numLF = 0;
 896+ $numCR = 0;
 897+ $numFF = 0;
 898+ $htmlTags = array(
 899+ 'html',
 900+ 'head',
 901+ 'title',
 902+ 'body',
 903+ 'script',
 904+ 'a href',
 905+ 'pre',
 906+ 'img',
 907+ 'plaintext',
 908+ 'table'
 909+ );
 910+ $rdfUrl = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#';
 911+ $rdfPurl = 'http://purl.org/rss/1.0/';
 912+ $xbmMagic1 = '#define';
 913+ $xbmMagic2 = '_width';
 914+ $xbmMagic3 = '_bits';
 915+ $binhexMagic = 'converted with BinHex';
 916+
 917+ for ( $offset = 0; $offset < strlen( $chunk ); $offset++ ) {
 918+ $curChar = $chunk[$offset];
 919+ if ( $curChar == "\x0a" ) {
 920+ $numLF++;
 921+ continue;
 922+ } elseif ( $curChar == "\x0d" ) {
 923+ $numCR++;
 924+ continue;
 925+ } elseif ( $curChar == "\x0c" ) {
 926+ $numFF++;
 927+ continue;
 928+ } elseif ( $curChar == "\t" ) {
 929+ $numLow++;
 930+ continue;
 931+ } elseif ( ord( $curChar ) < 32 ) {
 932+ $numControl++;
 933+ continue;
 934+ } elseif ( ord( $curChar ) >= 128 ) {
 935+ $numHigh++;
 936+ continue;
 937+ }
 938+
 939+ $numLow++;
 940+ if ( $curChar == '<' ) {
 941+ // XML
 942+ $remainder = substr( $chunk, $offset + 1 );
 943+ if ( !strncasecmp( $remainder, '?XML', 4 ) ) {
 944+ $nextChar = substr( $chunk, $offset + 5, 1 );
 945+ if ( $nextChar == ':' || $nextChar == ' ' || $nextChar == "\t" ) {
 946+ $found['xml'] = true;
 947+ }
 948+ }
 949+ // Scriptlet (JSP)
 950+ if ( !strncasecmp( $remainder, 'SCRIPTLET', 9 ) ) {
 951+ $found['scriptlet'] = true;
 952+ break;
 953+ }
 954+ // HTML
 955+ foreach ( $htmlTags as $tag ) {
 956+ if ( !strncasecmp( $remainder, $tag, strlen( $tag ) ) ) {
 957+ $found['html'] = true;
 958+ }
 959+ }
 960+ // Skip broken check for additional tags (HR etc.)
 961+
 962+ // RSS
 963+ if ( !strncasecmp( $remainder, 'RSS', 3 ) ) {
 964+ $found['rss'] = true;
 965+ break; // return from SampleData
 966+ }
 967+ if ( !strncasecmp( $remainder, 'rdf:RDF', 7 ) ) {
 968+ $found['rdf-tag'] = true;
 969+ // no break
 970+ }
 971+ if ( !strncasecmp( $remainder, 'FEED', 4 ) ) {
 972+ $found['feed'] = true;
 973+ break;
 974+ }
 975+ continue;
 976+ }
 977+ // Skip broken check for -->
 978+
 979+ // RSS URL checks
 980+ // For some reason both URLs must appear before a break is triggered
 981+ $remainder = substr( $chunk, $offset );
 982+ if ( !strncasecmp( $remainder, $rdfUrl, strlen( $rdfUrl ) ) ) {
 983+ $found['rdf-url'] = true;
 984+ if ( isset( $found['rdf-tag'] )
 985+ && isset( $found['rdf-purl'] ) ) // [sic]
 986+ {
 987+ break;
 988+ }
 989+ continue;
 990+ }
 991+
 992+ if ( !strncasecmp( $remainder, $rdfPurl, strlen( $rdfPurl ) ) ) {
 993+ if ( isset( $found['rdf-tag'] )
 994+ && isset( $found['rdf-url'] ) ) // [sic]
 995+ {
 996+ break;
 997+ }
 998+ continue;
 999+ }
 1000+
 1001+ // XBM checks
 1002+ if ( !strncasecmp( $remainder, $xbmMagic1, strlen( $xbmMagic1 ) ) ) {
 1003+ $found['xbm1'] = true;
 1004+ continue;
 1005+ }
 1006+ if ( $curChar == '_' ) {
 1007+ if ( isset( $found['xbm2'] ) ) {
 1008+ if ( !strncasecmp( $remainder, $xbmMagic3, strlen( $xbmMagic3 ) ) ) {
 1009+ $found['xbm'] = true;
 1010+ break;
 1011+ }
 1012+ } elseif ( isset( $found['xbm1'] ) ) {
 1013+ if ( !strncasecmp( $remainder, $xbmMagic2, strlen( $xbmMagic2 ) ) ) {
 1014+ $found['xbm2'] = true;
 1015+ }
 1016+ }
 1017+ }
 1018+
 1019+ // BinHex
 1020+ if ( !strncasecmp( $remainder, $binhexMagic, strlen( $binhexMagic ) ) ) {
 1021+ $found['binhex'] = true;
 1022+ }
 1023+ } // end main loop
 1024+
 1025+ // CContentAnalyzer::FindMimeFromData
 1026+
 1027+ // IE does the Check*Headers() calls last, and instead does the following image
 1028+ // type checksby directly looking for the magic numbers. What I do here should
 1029+ // have the same effect since the magic number checks are identical in both cases.
 1030+ $binaryType = $this->ieCheckBinaryHeaders( $chunk );
 1031+ $textType = $this->ieCheckTextHeaders( $chunk );
 1032+
 1033+ if ( $proposed == 'text/html' && isset( $found['html'] ) ) {
 1034+ return 'text/html';
 1035+ }
 1036+ if ( $proposed == 'image/gif' && $binaryType == 'image/gif' ) {
 1037+ return 'image/gif';
 1038+ }
 1039+ if ( ( $proposed == 'image/pjpeg' || $proposed == 'image/jpeg' )
 1040+ && $binaryType == 'image/pjpeg' )
 1041+ {
 1042+ return $proposed;
 1043+ }
 1044+ if ( ( $proposed == 'image/x-png' || $proposed == 'image/png' )
 1045+ && $binaryType == 'image/x-png' )
 1046+ {
 1047+ return $proposed;
 1048+ }
 1049+
 1050+ if ( isset( $found['rss'] ) ) {
 1051+ return 'application/rss+xml';
 1052+ }
 1053+ if ( isset( $found['rdf-tag'] )
 1054+ && isset( $found['rdf-url'] )
 1055+ && isset( $found['rdf-purl'] ) )
 1056+ {
 1057+ return 'application/rss+xml';
 1058+ }
 1059+ // Skip apparently unimplemented atom flag
 1060+ if ( isset( $found['xml'] ) ) {
 1061+ // Skip check for proposed MIME type
 1062+ // IE may continue on here to further checks if a feature is enabled
 1063+ // and if the server proposes something other than text/html or text/xml
 1064+ return 'text/xml';
 1065+ }
 1066+ if ( isset( $found['html'] ) ) {
 1067+ // Skip complex XML-related check
 1068+ return 'text/html';
 1069+ }
 1070+ if ( isset( $found['xbm'] ) ) {
 1071+ return 'image/x-bitmap';
 1072+ }
 1073+ if ( isset( $found['binhex'] ) ) {
 1074+ return 'application/macbinhex40';
 1075+ }
 1076+ if ( isset( $found['scriptlet'] ) ) {
 1077+ // Skip complex HTML-related check
 1078+ return 'text/scriptlet';
 1079+ }
 1080+
 1081+ // Freaky heuristics to determine check order
 1082+ // Probably doesn't matter since the checks appear to be orthogonal
 1083+ // The heuristic is of course broken for non-ASCII text
 1084+ if ( $numControl != 0 && ( $numFF + $numLow ) < ( $numControl + $numHigh ) * 16 ) {
 1085+ $kindOfBinary = true;
 1086+ $type = $binaryType ? $binaryType : $textType;
 1087+ } else {
 1088+ $kindOfBinary = false;
 1089+ $type = $textType ? $textType : $binaryType;
 1090+ }
 1091+ if ( $type !== false ) {
 1092+ return $type;
 1093+ }
 1094+
 1095+ // FormatAgreesWithData()
 1096+ if ( in_array( $proposed, $types['text'] ) && !$kindOfBinary ) {
 1097+ return $proposed;
 1098+ }
 1099+ if ( in_array( $proposed, $types['binary'] ) && $kindOfBinary ) {
 1100+ return $proposed;
 1101+ }
 1102+ if ( in_array( $proposed, $types['html'] ) ) {
 1103+ return $proposed;
 1104+ }
 1105+
 1106+ // TODO: extension checks
 1107+ return $proposed;
 1108+ }
 1109+
 1110+ private function ieCheckTextHeaders( $chunk ) {
 1111+ $chunk2 = substr( $chunk, 0, 2 );
 1112+ $chunk4 = substr( $chunk, 0, 4 );
 1113+ $chunk5 = substr( $chunk, 0, 5 );
 1114+ if ( $chunk4 == '%PDF' ) {
 1115+ return 'application/pdf';
 1116+ }
 1117+ if ( $chunk2 == '%!' ) {
 1118+ return 'application/postscript';
 1119+ }
 1120+ if ( $chunk5 == '{\\rtf' ) {
 1121+ return 'text/richtext';
 1122+ }
 1123+ if ( $chunk5 == 'begin' ) {
 1124+ return 'application/base64';
 1125+ }
 1126+ return false;
 1127+ }
 1128+
 1129+ private function ieCheckBinaryHeaders( $chunk ) {
 1130+ $chunk2 = substr( $chunk, 0, 2 );
 1131+ $chunk3 = substr( $chunk, 0, 3 );
 1132+ $chunk4 = substr( $chunk, 0, 4 );
 1133+ $chunk5 = substr( $chunk, 0, 5 );
 1134+ $chunk8 = substr( $chunk, 0, 8 );
 1135+ if ( $chunk5 == 'GIF87' || $chunk5 == 'GIF89' ) {
 1136+ return 'image/gif';
 1137+ }
 1138+ if ( $chunk2 == "\xff\xd8" ) {
 1139+ return 'image/pjpeg'; // actually plain JPEG but this is what IE returns
 1140+ }
 1141+ if ( $chunk2 == 'BM'
 1142+ && substr( $chunk, 6, 2 ) == "\000\000"
 1143+ && substr( $chunk, 8, 2 ) != "\000\000" )
 1144+ {
 1145+ return 'image/bmp'; // another non-standard MIME
 1146+ }
 1147+ if ( $chunk4 == 'RIFF'
 1148+ && substr( $chunk, 8, 4 ) == 'WAVE' )
 1149+ {
 1150+ return 'audio/wav';
 1151+ }
 1152+ // These were integer literals in IE
 1153+ // Perhaps the author was not sure what the target endianness was
 1154+ if ( $chunk4 == ".sd\000"
 1155+ || $chunk4 == ".snd"
 1156+ || $chunk4 == "\000ds."
 1157+ || $chunk4 == "dns." )
 1158+ {
 1159+ return 'audio/basic';
 1160+ }
 1161+ if ( $chunk3 == "MM\000" ) {
 1162+ return 'image/tiff';
 1163+ }
 1164+ if ( $chunk2 == 'MZ' ) {
 1165+ return 'application/x-msdownload';
 1166+ }
 1167+ if ( $chunk8 == "\x89PNG\x0d\x0a\x1a\x0a" ) {
 1168+ return 'image/x-png'; // [sic]
 1169+ }
 1170+ if ( strlen( $chunk ) >= 5 ) {
 1171+ $byte2 = ord( $chunk[2] );
 1172+ $byte4 = ord( $chunk[4] );
 1173+ if ( $byte2 >= 3 && $byte2 <= 31 && $byte4 == 0 && $chunk2 == 'JG' ) {
 1174+ return 'image/x-jg';
 1175+ }
 1176+ }
 1177+ // More endian confusion
 1178+ if ( $chunk4 == 'MROF' ) {
 1179+ return 'audio/x-aiff';
 1180+ }
 1181+ $chunk4_8 = substr( $chunk, 8, 4 );
 1182+ if ( $chunk4 == 'FORM' && ( $chunk4_8 == 'AIFF' || $chunk4_8 == 'AIFC' ) ) {
 1183+ return 'audio/x-aiff';
 1184+ }
 1185+ if ( $chunk4 == 'RIFF' && $chunk4_8 == 'AVI ' ) {
 1186+ return 'video/avi';
 1187+ }
 1188+ if ( $chunk4 == "\x00\x00\x01\xb3" || $chunk4 == "\x00\x00\x01\xba" ) {
 1189+ return 'video/mpeg';
 1190+ }
 1191+ if ( $chunk4 == "\001\000\000\000"
 1192+ && substr( $chunk, 40, 4 ) == ' EMF' )
 1193+ {
 1194+ return 'image/x-emf';
 1195+ }
 1196+ if ( $chunk4 == "\xd7\xcd\xc6\x9a" ) {
 1197+ return 'image/x-wmf';
 1198+ }
 1199+ if ( $chunk4 == "\xca\xfe\xba\xbe" ) {
 1200+ return 'application/java';
 1201+ }
 1202+ if ( $chunk2 == 'PK' ) {
 1203+ return 'application/x-zip-compressed';
 1204+ }
 1205+ if ( $chunk2 == "\x1f\x9d" ) {
 1206+ return 'application/x-compressed';
 1207+ }
 1208+ if ( $chunk2 == "\x1f\x8b" ) {
 1209+ return 'application/x-gzip-compressed';
 1210+ }
 1211+ // Skip redundant check for ZIP
 1212+ if ( $chunk5 == "MThd\000" ) {
 1213+ return 'audio/mid';
 1214+ }
 1215+ if ( $chunk4 == '%PDF' ) {
 1216+ return 'application/pdf';
 1217+ }
 1218+ return false;
 1219+ }
 1220+
8121221 }
Index: trunk/phase3/includes/DefaultSettings.php
@@ -1885,6 +1885,8 @@
18861886 'application/x-php', 'text/x-php',
18871887 # Other types that may be interpreted by some servers
18881888 'text/x-python', 'text/x-perl', 'text/x-bash', 'text/x-sh', 'text/x-csh',
 1889+ # Client-side script on Internet Explorer
 1890+ 'text/scriptlet',
18891891 # Windows metafile, client-side vulnerability on some systems
18901892 'application/x-msmetafile',
18911893 # A ZIP file may be a valid Java archive containing an applet which exploits the
Index: trunk/phase3/includes/specials/SpecialUpload.php
@@ -1324,11 +1324,11 @@
13251325 $magic = MimeMagic::singleton();
13261326 $mime = $magic->guessMimeType($tmpfile,false);
13271327
 1328+
13281329 #check mime type, if desired
13291330 global $wgVerifyMimeType;
13301331 if ($wgVerifyMimeType) {
1331 -
1332 - wfDebug ( "\n\nmime: <$mime> extension: <$extension>\n\n");
 1332+ wfDebug ( "\n\nmime: <$mime> extension: <$extension>\n\n");
13331333 #check mime type against file extension
13341334 if( !self::verifyExtension( $mime, $extension ) ) {
13351335 return new WikiErrorMsg( 'uploadcorrupt' );
@@ -1336,9 +1336,20 @@
13371337
13381338 #check mime type blacklist
13391339 global $wgMimeTypeBlacklist;
1340 - if( isset($wgMimeTypeBlacklist) && !is_null($wgMimeTypeBlacklist)
1341 - && $this->checkFileExtension( $mime, $wgMimeTypeBlacklist ) ) {
1342 - return new WikiErrorMsg( 'filetype-badmime', htmlspecialchars( $mime ) );
 1340+ if( isset($wgMimeTypeBlacklist) && !is_null($wgMimeTypeBlacklist) ) {
 1341+ if ( $this->checkFileExtension( $mime, $wgMimeTypeBlacklist ) ) {
 1342+ return new WikiErrorMsg( 'filetype-badmime', htmlspecialchars( $mime ) );
 1343+ }
 1344+
 1345+ # Check IE type
 1346+ $fp = fopen( $tmpfile, 'rb' );
 1347+ $chunk = fread( $fp, 256 );
 1348+ fclose( $fp );
 1349+ $extMime = $magic->guessTypesForExtension( $extension );
 1350+ $ieType = $magic->getIEMimeType( $tmpfile, $chunk, $extMime );
 1351+ if ( $this->checkFileExtension( $ieType, $wgMimeTypeBlacklist ) ) {
 1352+ return new WikiErrorMsg( 'filetype-bad-ie-mime', $ieType );
 1353+ }
13431354 }
13441355 }
13451356
@@ -1402,6 +1413,7 @@
14031414 }
14041415 }
14051416
 1417+
14061418 /**
14071419 * Heuristic for detecting files that *could* contain JavaScript instructions or
14081420 * things that may look like HTML to a browser and are thus
Index: trunk/phase3/languages/messages/MessagesEn.php
@@ -1801,6 +1801,7 @@
18021802 Please rename the file and try uploading it again.',
18031803 'badfilename' => 'File name has been changed to "$1".',
18041804 'filetype-badmime' => 'Files of the MIME type "$1" are not allowed to be uploaded.',
 1805+'filetype-bad-ie-mime' => 'Can\'t upload this file because Internet Explorer would detect it as "$1", which is a disallowed and potentially dangerous file type.',
18051806 'filetype-unwanted-type' => "'''\".\$1\"''' is an unwanted file type.
18061807 Preferred {{PLURAL:\$3|file type is|file types are}} \$2.",
18071808 'filetype-banned-type' => "'''\".\$1\"''' is not a permitted file type.
Index: trunk/phase3/RELEASE-NOTES
@@ -216,6 +216,8 @@
217217 * Moved password reset form from Special:Preferences to Special:ResetPass
218218 * Added Special:ChangePassword as a special page alias for Special:ResetPass
219219 * Added complimentary function for addHandler() called removeHandler() for removing events
 220+* Improved security of file uploads for IE clients, using a reverse-engineered
 221+ algorithm very similar to IE's content detection algorithm.
220222
221223 === Bug fixes in 1.14 ===
222224

Follow-up revisions

RevisionCommit summaryAuthorDate
r44249Follow up on r44248:...siebrand08:10, 5 December 2008

Comments

#Comment by Brion VIBBER (talk | contribs)   22:19, 10 December 2008

I would recommend breaking this out to another class to make it more easily redistributable/reusable by other apps...

Note this version is based on IE7's type detection, which is different in some details from older IE versions. (In particular, in IE 5 and 6 HTML detection overrides PNG, whereas in 7 this was finally reversed.) As long as our other checks are already dealing with the PNG thing that's ok, but we may need to adjust it if we replace the older check code.

Status & tagging log