r59024 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r59023‎ | r59024 | r59025 >
Date:20:45, 13 November 2009
Author:daniel
Status:deferred (Comments)
Tags:
Comment:
RDFa/microdata follow-up
Modified paths:
  • /trunk/phase3/includes/DefaultSettings.php (modified) (history)
  • /trunk/phase3/includes/Sanitizer.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/Sanitizer.php
@@ -43,7 +43,7 @@
4444 $attrib = '[A-Za-z0-9]';
4545 $space = '[\x09\x0a\x0d\x20]';
4646 define( 'MW_ATTRIBS_REGEX',
47 - "/(?:^|$space)($attrib+)
 47+ "/(?:^|$space)((?:xml:|xmlns:)?$attrib+)
4848 ($space*=$space*
4949 (?:
5050 # The attribute value: quoted or alone
@@ -59,9 +59,14 @@
6060 /**
6161 * Regular expression to match URIs that could trigger script execution
6262 */
63 -define( 'MW_SCRIPT_URL_PATTERN', '/(^|\s)(javascript|vbscript)[^\w]/i' );
 63+define( 'MW_EVIL_URI_PATTERN', '!(^|\s|\*/\s*)(javascript|vbscript)([^\w]|$)!i' );
6464
6565 /**
 66+ * Regular expression to match namespace attributes
 67+ */
 68+define( 'MW_XMLNS_ATTRIBUTE_PATTRN', "/^xmlns:$attrib+$/" );
 69+
 70+/**
6671 * List of all named character entities defined in HTML 4.01
6772 * http://www.w3.org/TR/html4/sgml/entities.html
6873 * @private
@@ -614,9 +619,21 @@
615620
616621 $out = array();
617622 foreach( $attribs as $attribute => $value ) {
 623+ #allow XML namespace declaration. Useful especially with RDFa
 624+ print "($attribute=$value)";
 625+
 626+ if ( preg_match( MW_XMLNS_ATTRIBUTE_PATTRN, $attribute ) ) {
 627+ if ( !preg_match( MW_EVIL_URI_PATTERN, $value ) ) {
 628+ $out[$attribute] = $value;
 629+ }
 630+
 631+ continue;
 632+ }
 633+
618634 if( !isset( $whitelist[$attribute] ) ) {
619635 continue;
620636 }
 637+
621638 # Strip javascript "expression" from stylesheets.
622639 # http://msdn.microsoft.com/workshop/author/dhtml/overview/recalc.asp
623640 if( $attribute == 'style' ) {
@@ -633,12 +650,14 @@
634651 $wgEnforceHtmlIds ? 'noninitial' : 'xml' );
635652 }
636653
637 - //RDFa properties allow URIs. check them
 654+ //RDFa and microdata properties allow URIs. check them
638655 if ( $attribute === 'rel' || $attribute === 'rev' ||
639656 $attribute === 'about' || $attribute === 'property' || $attribute === 'resource' ||
640 - $attribute === 'datatype' || $attribute === 'typeof' ) {
 657+ $attribute === 'datatype' || $attribute === 'typeof' ||
 658+ $attribute === 'item' || $attribute === 'itemprop' || $attribute === 'subject' ) {
 659+
641660 //Paranoia. Allow "simple" values but suppress javascript
642 - if ( preg_match( MW_SCRIPT_URL_PATTERN, $value ) ) {
 661+ if ( preg_match( MW_EVIL_URI_PATTERN, $value ) ) {
643662 continue;
644663 }
645664 }
@@ -1180,11 +1199,24 @@
11811200 * @return Array
11821201 */
11831202 static function setupAttributeWhitelist() {
1184 - $common = array( 'id', 'class', 'lang', 'dir', 'title', 'style',
1185 - #RDFa attributes as specified in section 9 of http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014
1186 - 'about', 'property', 'resource', 'datatype', 'typeof',
1187 - );
 1203+ global $wgAllowRdfaAttributes, $wgHtml5, $wgAllowItemAttributes;
11881204
 1205+ $common = array( 'id', 'class', 'lang', 'dir', 'title', 'style', 'xml:lang' );
 1206+
 1207+ if ( $wgAllowRdfaAttributes ) {
 1208+ #RDFa attributes as specified in section 9 of http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014
 1209+ $common = array_merge( $common, array(
 1210+ 'about', 'property', 'resource', 'datatype', 'typeof',
 1211+ ) );
 1212+ }
 1213+
 1214+ if ( $wgHtml5 && $wgAllowItemAttributes ) {
 1215+ # add HTML5 microdata tages as pecified by http://www.w3.org/TR/html5/microdata.html
 1216+ $common = array_merge( $common, array(
 1217+ 'item', 'itemprop', 'subject'
 1218+ ) );
 1219+ }
 1220+
11891221 $block = array_merge( $common, array( 'align' ) );
11901222 $tablealign = array( 'align', 'char', 'charoff', 'valign' );
11911223 $tablecell = array( 'abbr',
Index: trunk/phase3/includes/DefaultSettings.php
@@ -915,6 +915,16 @@
916916 $wgHtml5 = true;
917917
918918 /**
 919+ * Enabled RDFa attributes for use in wikitext.
 920+ */
 921+$wgAllowRdfaAttributes = true;
 922+
 923+/**
 924+ * Enabled HTML 5 data attributes for use in wikitext, if $wgHtml5 is also true.
 925+ */
 926+$wgAllowItemAttributes = true;
 927+
 928+/**
919929 * Should we try to make our HTML output well-formed XML? If set to false,
920930 * output will be a few bytes shorter, and the HTML will arguably be more
921931 * readable. If set to true, life will be much easier for the authors of

Comments

#Comment by Duesentrieb (talk | contribs)   20:59, 13 November 2009

this change adresses several points raised for r58712:

  • allow xmlns:* (and xml:lang)
  • make RDFa support optional (on per default)
  • add support for HTML 5 microdata (optional, on per default)

namespace definitions as well as all attribute values taht can be URIs or CURIs are run still filtered for javascript/vbscript using MW_EVIL_URI_PATTERN. Could be removed, but it does make me feel better.

#Comment by Simetrical (talk | contribs)   21:12, 13 November 2009

The needed attributes for microdata are: itemid, itemprop, itemref, itemscope, itemtype. subject and item are not valid attributes. This is based off the latest editor's draft at whatwg.org; the W3C Working Draft is always months outdated, and should not be used.

A better name for $wgAllowItemAttributes would be $wgAllowMicrodataAttributes.

You shouldn't allow xmlns:* unless RDFa is enabled. It's pointless, and in HTML5 also illegal.

I don't see the reason to allow xml:lang. Again, especially in HTML5, this is pointless and often illegal.

Have to run now, more review later.

#Comment by Duesentrieb (talk | contribs)   22:14, 13 November 2009

new spec for microdata is addressed in r59035. also renames $wgAllowItemAttributes to $wgAllowMicrodataAttributes

allowing xmlns:* ony if RDFa is enabled is adressed in r59032

xml:lang is allowed because some specs say to use both @lang and @xml:lang at the same time. it's also already used on-wiki with the {{{2}}} template (and shows in the resulting html, though i don't understand why).


#Comment by Duesentrieb (talk | contribs)   22:19, 13 November 2009

addendum: that would be the {{lang}} template on commons.

#Comment by Simetrical (talk | contribs)   00:33, 16 November 2009

Some more thoughts:

  • I still don't like the script filtering thing. It gives a false sense of security at best, and is WTFish at worst.
  • If xml:lang="" is used, it's certainly only valid if it matches lang="" on the same attribute. So why not only allow lang="", and just spit out a matching xml:lang="" every time if we want that? Frankly I think it's pointless; it clutters the output for no reason. Anything that's processing the document and cares about the content language will be an HTML5 processor and will recognize lang="" fine.
  • You may as well allow $wgAllowMicrodataAttributes to be true even if $wgHtml5 is false. It simplifies the code to only check one variable (cf. $wgWellFormedXml = false being allowed even if $wgHtml5 == false). If some crazy sysadmin wants to output microdata with an XHTML1 doctype, that's their business.
  • For XHTML1, we should surely be outputting a different doctype or extra namespace declarations or whatever magic is needed for RDFa to be valid, right?
  • "print "($attribute=$value)";" looks like debug code, but you removed it in r59027. It would be easier if you mentioned this commit's revision number in your follow-up commit messages so that Code Review picked them up automagically.

No other comments for now.

#Comment by Simetrical (talk | contribs)   00:34, 16 November 2009

I meant: ". . . will be an HTML processor . . ."

#Comment by Simetrical (talk | contribs)   11:25, 9 February 2010

Disabled by default in r61985, so marking deferred.

Status & tagging log