r102587 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r102586‎ | r102587 | r102588 >
Date:22:44, 9 November 2011
Author:gicode
Status:resolved (Comments)
Tags:core 
Comment:
Add wfRemoveDotSegments and unit tests. This is a sane step towards fixing
bug 32168. This implements RFC3986 Section 5.2.4.
http://tools.ietf.org/html/rfc3986#section-5.2.4

This is important because you need to remove dot segments in order to safely
compare URLs when limiting URLs to a particular path.
Modified paths:
  • /trunk/phase3/RELEASE-NOTES-1.19 (modified) (history)
  • /trunk/phase3/includes/GlobalFunctions.php (modified) (history)
  • /trunk/phase3/tests/phpunit/includes/GlobalFunctions/wfRemoveDotSegmentsTest.php (added) (history)

Diff [purge]

Index: trunk/phase3/RELEASE-NOTES-1.19
@@ -125,6 +125,7 @@
126126 from recent changes feeds
127127 * (bug 30232) add current time to message wlnote on Special:Watchlist
128128 * (bug 29110) $wgFeedDiffCutoff did not affect new pages
 129+* (bug 32168) Add wfRemoveDotSegments for use in wfExpandUrl
129130
130131 === API changes in 1.19 ===
131132 * (bug 19838) siprop=interwikimap can now use the interwiki cache.
Index: trunk/phase3/tests/phpunit/includes/GlobalFunctions/wfRemoveDotSegmentsTest.php
@@ -0,0 +1,86 @@
 2+<?php
 3+/**
 4+ * Unit tests for wfRemoveDotSegments()
 5+ */
 6+
 7+class wfRemoveDotSegments extends MediaWikiTestCase {
 8+ /** @dataProvider providePaths */
 9+ public function testWfRemoveDotSegments( $inputPath, $outputPath ) {
 10+ $actualPath = wfRemoveDotSegments( $inputPath );
 11+ $message = "Testing $inputPath expands to $outputPath";
 12+ echo $message . "\n";
 13+ $this->assertEquals( $outputPath, $actualPath, $message );
 14+ }
 15+
 16+ /**
 17+ * Provider of URL paths for testing wfRemoveDotSegments()
 18+ *
 19+ * @return array
 20+ */
 21+ public function providePaths() {
 22+ return array(
 23+ array( '/a/b/c/./../../g', '/a/g' ),
 24+ array( 'mid/content=5/../6', 'mid/6' ),
 25+ array( '/a//../b', '/a/b' ),
 26+ array( '', '' ),
 27+ array( '/', '/' ),
 28+ array( '//', '//' ),
 29+ array( '.', '' ),
 30+ array( '..', '' ),
 31+ array( '/.', '/' ),
 32+ array( '/..', '/' ),
 33+ array( './', '' ),
 34+ array( '../', '' ),
 35+ array( './a', 'a' ),
 36+ array( '../a', 'a' ),
 37+ array( '../../a', 'a' ),
 38+ array( '.././a', 'a' ),
 39+ array( './../a', 'a' ),
 40+ array( '././a', 'a' ),
 41+ array( '../../', '' ),
 42+ array( '.././', '' ),
 43+ array( './../', '' ),
 44+ array( '././', '' ),
 45+ array( '../..', '' ),
 46+ array( '../.', '' ),
 47+ array( './..', '' ),
 48+ array( './.', '' ),
 49+ array( '/../../a', '/a' ),
 50+ array( '/.././a', '/a' ),
 51+ array( '/./../a', '/a' ),
 52+ array( '/././a', '/a' ),
 53+ array( '/../../', '/' ),
 54+ array( '/.././', '/' ),
 55+ array( '/./../', '/' ),
 56+ array( '/././', '/' ),
 57+ array( '/../..', '/' ),
 58+ array( '/../.', '/' ),
 59+ array( '/./..', '/' ),
 60+ array( '/./.', '/' ),
 61+ array( 'b/../../a', '/a' ),
 62+ array( 'b/.././a', '/a' ),
 63+ array( 'b/./../a', '/a' ),
 64+ array( 'b/././a', 'b/a' ),
 65+ array( 'b/../../', '/' ),
 66+ array( 'b/.././', '/' ),
 67+ array( 'b/./../', '/' ),
 68+ array( 'b/././', 'b/' ),
 69+ array( 'b/../..', '/' ),
 70+ array( 'b/../.', '/' ),
 71+ array( 'b/./..', '/' ),
 72+ array( 'b/./.', 'b/' ),
 73+ array( '/b/../../a', '/a' ),
 74+ array( '/b/.././a', '/a' ),
 75+ array( '/b/./../a', '/a' ),
 76+ array( '/b/././a', '/b/a' ),
 77+ array( '/b/../../', '/' ),
 78+ array( '/b/.././', '/' ),
 79+ array( '/b/./../', '/' ),
 80+ array( '/b/././', '/b/' ),
 81+ array( '/b/../..', '/' ),
 82+ array( '/b/../.', '/' ),
 83+ array( '/b/./..', '/' ),
 84+ array( '/b/./.', '/b/' ),
 85+ );
 86+ }
 87+}
Index: trunk/phase3/includes/GlobalFunctions.php
@@ -477,9 +477,9 @@
478478
479479 $defaultProtoWithoutSlashes = substr( $defaultProto, 0, -2 );
480480
481 - if( substr( $url, 0, 2 ) == '//' ) {
 481+ if ( substr( $url, 0, 2 ) == '//' ) {
482482 return $defaultProtoWithoutSlashes . $url;
483 - } elseif( substr( $url, 0, 1 ) == '/' ) {
 483+ } elseif ( substr( $url, 0, 1 ) == '/' ) {
484484 // If $serverUrl is protocol-relative, prepend $defaultProtoWithoutSlashes, otherwise leave it alone
485485 return ( $serverHasProto ? '' : $defaultProtoWithoutSlashes ) . $serverUrl . $url;
486486 } else {
@@ -488,6 +488,47 @@
489489 }
490490
491491 /**
 492+ * Remove all dot-segments in the provided URL path. For example,
 493+ * '/a/./b/../c/' becomes '/a/c/'. For details on the algorithm, please see
 494+ * RFC3986 section 5.2.4.
 495+ *
 496+ * @todo Need to integrate this into wfExpandUrl (bug 32168)
 497+ *
 498+ * @param $urlPath String URL path, potentially containing dot-segments
 499+ * @return string URL path with all dot-segments removed
 500+ */
 501+function wfRemoveDotSegments( $urlPath ) {
 502+ $output = '';
 503+
 504+ while ( $urlPath ) {
 505+ $matches = null;
 506+ if ( preg_match('%^\.\.?/%', $urlPath, $matches) ) {
 507+ # Step A
 508+ $urlPath = substr( $urlPath, strlen( $matches[0] ) );
 509+ } elseif ( preg_match( '%^/\.(/|$)%', $urlPath, $matches ) ) {
 510+ # Step B
 511+ $start = strlen( $matches[0] );
 512+ $urlPath = '/' . substr( $urlPath, $start );
 513+ } elseif ( preg_match( '%^/\.\.(/|$)%', $urlPath, $matches ) ) {
 514+ # Step C
 515+ $start = strlen( $matches[0] );
 516+ $urlPath = '/' . substr( $urlPath, $start );
 517+ $output = preg_replace('%(^|/)[^/]*$%', '', $output);
 518+ } elseif ( preg_match( '%^\.\.?$%', $urlPath, $matches ) ) {
 519+ # Step D
 520+ $urlPath = substr( $urlPath, strlen( $matches[0] ) );
 521+ } else {
 522+ # Step E
 523+ preg_match( '%^/?[^/]*%', $urlPath, $matches );
 524+ $urlPath = substr( $urlPath, strlen( $matches[0] ) );
 525+ $output .= $matches[0];
 526+ }
 527+ }
 528+
 529+ return $output;
 530+}
 531+
 532+/**
492533 * Returns a regular expression of url protocols
493534 *
494535 * @param $includeProtocolRelative bool If false, remove '//' from the returned protocol list.

Follow-up revisions

RevisionCommit summaryAuthorDate
r102671Follow-up r102587. Add details to comments and add a couple more tests.gicode18:02, 10 November 2011
r103199Add wfAssembleUrl and unit tests. This is the next step towards fixing...gicode17:38, 15 November 2011
r103208Make wfExpandUrl use wfRemoveDotSegments on the resulting path. This finishes...gicode18:53, 15 November 2011
r109720Follow-up 102587 to address performance concerns in wfRemoveDotSegments.gicode04:57, 22 January 2012

Comments

#Comment by 😂 (talk | contribs)   15:21, 10 November 2011
  • I already fixed the echo's (leftover from debugging?) in the followup
  • Could you add some slightly more informative comments than "Step A/Step B/etc?"
#Comment by GICodeWarrior (talk | contribs)   15:59, 10 November 2011

Thanks for the echo fix. I was trying to debug something, but I couldn't see the output and forgot about it. Do you know if there is any way to see the names of tests as they are executed? Or if there is any specific way to debug tests?

The comments map directly to the details in the RFC. You are right though, I will add some non-normative remarks inline.

#Comment by 😂 (talk | contribs)   16:00, 10 November 2011

There might be some verbosity options within phpunit that we could play with :)

#Comment by GICodeWarrior (talk | contribs)   16:04, 10 November 2011

Ok, I looked through the documentation I could find, but I didn't see any. It looks like we need to implement an entirely new PHPUnit_Util_Printer. :-/

#Comment by Hashar (talk | contribs)   12:28, 12 December 2011

Use --tap to show the test names :-)

#Comment by Hashar (talk | contribs)   12:29, 12 December 2011

Example output:

$ ./phpunit.php --tap
TAP version 13
ok 1 - ArticleTablesTest::testbug14404
ok 2 - ArticleTest::testImplementsGetMagic
ok 3 - ArticleTest::testImplementsSetMagic
ok 4 - ArticleTest::testImplementsCallMagic
ok 5 - ArticleTest::testGetOrSetOnNewProperty
ok 6 - ArticleTest::testStaticFunctions
...
#Comment by Hashar (talk | contribs)   12:49, 12 December 2011
if( preg_match('%^\.\.?/%', $urlPath, $matches) ) {

# Step A $urlPath = substr( $urlPath, strlen( $matches[0] ) );

}

Will match './' which is matched in Step B as well. You could just use substr() == to compare. Would save you a preg_match call and a strlen call (since '../' is known to be 3 chars).

} elseif ( preg_match( '%^/\.(/|$)%', $urlPath, $matches ) ) {

# Step B $start = strlen( $matches[0] ); $urlPath = '/' . substr( $urlPath, $start );

Same there, isn't strlen( $matches[0] ) always 3 here? Plus setting $start and using it only once after is not that helpful.


I am wondering if there is an existent PEAR implementation for this algorithm. Will avoid us reinventing the wheel and creating new bugs. We could ship the upstream lib in ./includes/libs and still run test on it.

#Comment by GICodeWarrior (talk | contribs)   16:04, 12 December 2011

Step A matches ./ and ../ each only if they are at the beginning of the string. Step B matches /./ or /. followed by the end of the string.

Both steps are matching two different possible strings.

Even if they were each matching only one, I am pretty sure a preg_match anchored to the beginning of the string while using only ? as a wildcard and | as a combinator followed by a boolean test is much faster than creating new strings and running a string comparison. Anchored, greedy regular expressions are generally quite fast.

If you think this implementation is particularly slow or in a critical path, perhaps we should run a performance test.

The usage of $start keeps the subsequent line under 80 characters. They can certainly be inlined if that is the preferred style.

If you find an implementation in PEAR, please let me know. I did a fair bit of searching for a PHP implementation of this RFC. The best attempt I could find was a ragged/"clever" patch for something Drupal related.

#Comment by Hashar (talk | contribs)   16:12, 12 December 2011

I have found an implementation for removing dot segment in PEAR. The module is Net::URL2 : http://svn.php.net/viewvc/pear/packages/Net_URL2/trunk/Net/URL2.php?view=markup Look for 'function removeDotSegments'.

#Comment by GICodeWarrior (talk | contribs)   16:20, 12 December 2011

Nice find. It appears correct.

Feel free to swap it in if you prefer.

#Comment by GICodeWarrior (talk | contribs)   16:24, 12 December 2011

Actually, please patch out the counter if you do. It opens a security vulnerability.

#Comment by MarkAHershberger (talk | contribs)   21:55, 20 December 2011

not really a FIXME, more of a todo.

#Comment by MarkAHershberger (talk | contribs)   21:55, 20 December 2011

not really a FIXME, more of a todo.

#Comment by Tim Starling (talk | contribs)   20:31, 21 January 2012

Consuming a string by repeatedly taking trailing substrings is not an appropriate way to parse in PHP, since it necessarily leads to O(N^2) runtime. substr() is O(N) in the size of the remaining substring, and you do O(N) substrings, so the overall runtime is O(N^2). Most string functions in PHP have an offset parameter which allow you to avoid this style of string processing.

Note that, at least last time I tested it, anchoring a regex with "^" doesn't work with an offset parameter, so the /A modifier needs to be used instead.

So you will have a style like:

$p = 0;
while ( $p < strlen( $urlPath ) ) {
   if ( preg_match( '%...%A', $urlPath, $matches, 0, $p ) ) {
      ...
      $p += strlen( $matches[0] ); 
   }
   ...
}
+	while ( $urlPath ) {

Do not convert strings to boolean like this, because strings such as "0000" will convert to false. See Manual:Coding conventions/PHP#Pitfalls for more tips like this.

I think using substr() for your leading substring comparisons, like in Hashar's comment, would improve performance slightly. You could save a few substrings to local variables:

$lead1 = substr( $urlPath, $p, 1 );
$lead2 = substr( $urlPath, $p, 2 );
$lead3 = substr( $urlPath, $p, 3 );

then most of your regexes would just become comparisons between these local variables and string literals. The regex in step E could be replaced with an strpos() search for "/". It would probably also improve the readability of the code, since your regexes are fairly dense.

#Comment by GICodeWarrior (talk | contribs)   05:09, 22 January 2012

Thanks for the detailed feedback. :-)

I have posted a followup, but I messed up the linking. http://www.mediawiki.org/wiki/Special:Code/MediaWiki/109720

Status & tagging log