r58717 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r58716‎ | r58717 | r58718 >
Date:16:46, 7 November 2009
Author:daniel
Status:resolved (Comments)
Tags:
Comment:
adding support for <a> tags as a parser tag hook, in order to support rdfa output
Modified paths:
  • /trunk/phase3/includes/Linker.php (modified) (history)
  • /trunk/phase3/includes/Sanitizer.php (modified) (history)
  • /trunk/phase3/includes/parser/Parser.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/parser/Parser.php
@@ -129,7 +129,7 @@
130130 $this->mFunctionHooks = array();
131131 $this->mFunctionTagHooks = array();
132132 $this->mFunctionSynonyms = array( 0 => array(), 1 => array() );
133 - $this->mDefaultStripList = $this->mStripList = array( 'nowiki', 'gallery' );
 133+ $this->mDefaultStripList = $this->mStripList = array( 'nowiki', 'gallery', 'a' );
134134 $this->mUrlProtocols = wfUrlProtocols();
135135 $this->mExtLinkBracketedRegex = '/\[(\b(' . wfUrlProtocols() . ')'.
136136 '[^][<>"\\x00-\\x20\\x7F]+) *([^\]\\x0a\\x0d]*?)\]/S';
@@ -3284,6 +3284,9 @@
32853285 case 'gallery':
32863286 $output = $this->renderImageGallery( $content, $attributes );
32873287 break;
 3288+ case 'a':
 3289+ $output = $this->renderHyperlink( $content, $attributes, $frame );
 3290+ break;
32883291 case 'math':
32893292 if ( $this->mOptions->getUseTeX() ) {
32903293 $output = $wgContLang->armourMath(
@@ -4333,6 +4336,35 @@
43344337 }
43354338
43364339 /**
 4340+ * Tag hook handler for 'a'. Renders a HTML &lt;a&gt; tag, allowing most attributes, filtering href against
 4341+ * allowed protocols and spam blacklist.
 4342+ **/
 4343+ function renderHyperlink( $text, $params, $frame = false ) {
 4344+ foreach ( $params as $name => $value ) {
 4345+ $params[ $name ] = $this->replaceVariables( $value, $frame );
 4346+ }
 4347+
 4348+ $whitelist = Sanitizer::attributeWhitelist( 'a' );
 4349+ $params = Sanitizer::validateAttributes( $params, $whitelist );
 4350+
 4351+ $content = $this->recursiveTagParse( trim( $text ), $frame );
 4352+
 4353+ if ( isset( $params[ 'href' ] ) ) {
 4354+ $href = $params[ 'href' ];
 4355+ $this->mOutput->addExternalLink( $href );
 4356+ unset( $params[ 'href' ] );
 4357+ } else {
 4358+ # Non-link <a> tag
 4359+ return Xml::openElement( 'a', $params ) . $content . Xml::closeElement( 'a' );
 4360+ }
 4361+
 4362+ $sk = $this->mOptions->getSkin();
 4363+ $html = $sk->makeExternalLink( $href, $content, false, '', $params );
 4364+
 4365+ return $html;
 4366+ }
 4367+
 4368+ /**
43374369 * Renders an image gallery from a text with one line per image.
43384370 * text labels may be given by using |-style alternative text. E.g.
43394371 * Image:one.jpg|The number "1"
Index: trunk/phase3/includes/Linker.php
@@ -760,7 +760,10 @@
761761 * hook play with them, *then* expand it all at once.
762762 */
763763 function makeExternalLink( $url, $text, $escape = true, $linktype = '', $attribs = array() ) {
764 - $attribsText = $this->getExternalLinkAttributes( 'external ' . $linktype );
 764+ if ( isset( $attribs[ 'class' ] ) ) $class = $attribs[ 'class' ]; # yet another hack :(
 765+ else $class = 'external ' . $linktype;
 766+
 767+ $attribsText = $this->getExternalLinkAttributes( $class );
765768 $url = htmlspecialchars( $url );
766769 if( $escape ) {
767770 $text = htmlspecialchars( $text );
Index: trunk/phase3/includes/Sanitizer.php
@@ -610,6 +610,8 @@
611611 */
612612 static function validateAttributes( $attribs, $whitelist ) {
613613 $whitelist = array_flip( $whitelist );
 614+ $hrefExp = '/^(' . wfUrlProtocols() . ')[^\s]+$/';
 615+
614616 $out = array();
615617 foreach( $attribs as $attribute => $value ) {
616618 if( !isset( $whitelist[$attribute] ) ) {
@@ -641,6 +643,15 @@
642644 }
643645 }
644646
 647+ # NOTE: even though elements using href/src are not allowed directly, supply
 648+ # validation code that can be used by tag hook handlers, etc
 649+ if ( $attribute === 'href' || $attribute === 'src' ) {
 650+ if ( !preg_match( $hrefExp, $value ) ) {
 651+ continue; //drop any href or src attributes not using an allowed protocol.
 652+ //NOTE: this also drops all relative URLs
 653+ }
 654+ }
 655+
645656 // If this attribute was previously set, override it.
646657 // Output should only have one attribute of each name.
647658 $out[$attribute] = $value;
@@ -1279,6 +1290,9 @@
12801291 'td' => array_merge( $common, $tablecell, $tablealign ),
12811292 'th' => array_merge( $common, $tablecell, $tablealign ),
12821293
 1294+ # 12.2 # NOTE: <a> is not allowed directly, but the attrib whitelist is used from the Parser object
 1295+ 'a' => array_merge( $common, array( 'href', 'rel', 'rev' ) ), # rel/rev esp. for RDFa
 1296+
12831297 # 13.2
12841298 # Not usually allowed, but may be used for extension-style hooks
12851299 # such as <math> when it is rasterized

Follow-up revisions

RevisionCommit summaryAuthorDate
r61905Remove <a> tag hook for now, pending resolution of implementation issues as d...tstarling05:07, 3 February 2010
r109723Introducing optional support for <a> tags, to be used with microdata resp. RD...daniel05:58, 22 January 2012

Comments

#Comment by Tim Starling (talk | contribs)   02:34, 17 December 2009

Why is it a tag extension, calling the inefficient recursiveTagParse()? It could just be passed through Sanitizer like any other HTML tag.

Note that you omit registering the link in $parser->mOutput, providing a vulnerability which allows spammers to evade any external link controls like the captcha trigger. In my scheme, that task, and the modification of the class attribute, would need to be done in a separate pass in internalParse().

There's also the problem of invalid HTML generation, caused by nested anchor tags. In my scheme, if the new pass were done before replaceInternalLinks(), then user-specified anchor tags could be replaced by LinkHolderArray placeholders, allowing nested links to be suppressed.

An extra pass should be quite cheap in the worst case, because most of the scanning can be done in PCRE or strcspn(). By re-entering the preprocessor with text that has already been preprocessed once, you get O(N^2) worst-case performance, and costs due to tight PHP loops and high setup time.

#Comment by Simetrical (talk | contribs)   02:26, 11 January 2010

(Don't see a subscribe link, so replying to get follow-ups.)

#Comment by Happy-melon (talk | contribs)   23:26, 15 January 2010

<slightly tangential question to subscribe to thread>What's 'your scheme', Tim?</stqtstt>

#Comment by Tim Starling (talk | contribs)   00:55, 18 January 2010

Sanitizer would pass anchor tags through to a separate pass executed before replaceInternalLinks, which would add link placeholders for the anchor tags, register them in $parser->mOutput, and modify their attributes appropriately.

#Comment by Duesentrieb (talk | contribs)   16:47, 18 January 2010

When I first wrote this, I did pass <a> just through the sanatizer. Then, while testing it, *some* really good reason come up to handle it as a magic tag. But as these things go, I can't remember what it was :)

#Comment by Tim Starling (talk | contribs)   00:07, 19 January 2010

Was it because (as you said on r58711) it didn't register an external link?

#Comment by Duesentrieb (talk | contribs)   08:54, 19 January 2010

yes, right, I think that was indeed the main problem. Thanks for finding it :) I suppose to overcome this, the Sanitizer would have to have access to the ParserOutput...

#Comment by Tim Starling (talk | contribs)   00:57, 20 January 2010

Or, as I said above, you could let the Sanitizer do what it normally does, and add a separate pass for link registration to be executed before replaceInternalLinks.

#Comment by Platonides (talk | contribs)   23:43, 15 January 2010

Interesting topic, indeed. Anyone remembers the name of the thread at wikitech-l? I don't think it was conclusive.

#Comment by Bryan (talk | contribs)   22:24, 17 January 2010

(We definitely need a subscribe link)

With the debate still going on on wikitech-l, we might want to hold back this feature until we've actually decided what and how we're going to implement it.

#Comment by Tim Starling (talk | contribs)   00:58, 18 January 2010

All I see on wikitech-l is discussion about whether RDFa is suitable for Wikipedia. This feature can exist independently of the answer to that question. We can switch it off with a configuration variable if need be.

#Comment by Simetrical (talk | contribs)   01:09, 18 January 2010

FWIW, this would be useful to microdata too, not just RDFa. Also would be nice generally, to avoid hacks like class="plainlinks" on a wrapper span.

#Comment by Happy-melon (talk | contribs)   01:22, 18 January 2010

Any move towards regularly adding raw <a> tags in wikitext is Bad. I don't see that as a hack, particularly; it's a perfectly sane and sensible use of HTML and CSS. What would be hackish is the rearguard actions we'll be fighting for months to patch up the security vulnerabilities in enabling raw <a> tags (at least with this implementation), not to mention cleaning up the blood from all the Usability people who commit suicide when they learn about the latest twist in the wikitext Gordian knot.

Oh and don't forget, what's going to happen to this post if/when raw <a> tags become whitelisted? I haven't nowiki'd them, because there's no need to; I know they're escaped by Sanitizer. How many wiki pages have made the same assumption??

#Comment by Simetrical (talk | contribs)   02:02, 18 January 2010
  • Gratuitous wrapper spans are a hack. They produce completely unnecessarily HTML.
  • There should be no security vulnerabilities in adding <a>, certainly no more than style="" or whatnot. The allowed output URLs are exactly the same as for ordinary external links.
  • There are no grounds for thinking usability would be hurt – these wouldn't be expected to appear in ordinary wikitext any more than, say, style="", or complex template markup. If there are common uses, we can reliably expect big wikis to wrap them in templates, like they do for everything else in existence (and adding a few more templates can't make things much worse than they are). Wikitext is so massively complicated that it's sort of silly to argue at this point that we shouldn't make it slightly less usable.
  • Yes, pages that use <a> literally will break. Probably not your post, since you have no closing tag/href/etc. so it will likely be escaped anyway. This is a one-time cost that we'll see whenever we add any new tag, and is perfectly tolerable.
#Comment by Tim Starling (talk | contribs)   02:09, 18 January 2010

We enabled <span> without mass-suicide (see r6796), we should be able to do it with <a> as well.

#Comment by Platonides (talk | contribs)   16:31, 18 January 2010

Currently any post of <a> tag is a spambot or a blind paste. This change will make it more difficult to discern if it really was a person trying to add a link with a good idea using the "complex" syntax.

And why does external links need rel/rev? I'd expect them to be primarily used on [[internal links]], not external.

#Comment by Duesentrieb (talk | contribs)   16:41, 18 January 2010

At least in some notable cases, such links would point to the canonical URL of some standardized entity, such as a creative commons license.

#Comment by Simetrical (talk | contribs)   17:02, 18 January 2010

Yeah, there will be a bit of extra spam problems, but not enough to outweigh even a modest benefit. It's pretty easily to tell spammers anyway.

Some of these links might actually want to be internal . . . maybe this would be a good time to hack in some logic removing class="external" from external links that point to an internal domain?

#Comment by Tim Starling (talk | contribs)   05:08, 3 February 2010

Removed the tag hook for now.

Status & tagging log