r44560 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r44559‎ | r44560 | r44561 >
Date:05:47, 14 December 2008
Author:vyznev
Status:resolved (Comments)
Tags:
Comment:
Add a new FileCache class to wrap RepoGroup::findFile() and findFiles(), and make wfFindFile() use it by default. This should improve performance for pages that refer to the same image several times, but the real benefit is that it allows batch file existence checks, à la LinkBatch, by collecting a set of titles (or DB keys) and calling FileCache::findFiles() on them to prefill the cache.
XXX: The code seems to more or less work, but it obviously needs more testing, regarding both stability and memory usage. In particular, I have not tested file uploads yet -- there may be consistency issues there.
Modified paths:
  • /trunk/phase3/RELEASE-NOTES (modified) (history)
  • /trunk/phase3/includes/AutoLoader.php (modified) (history)
  • /trunk/phase3/includes/GlobalFunctions.php (modified) (history)
  • /trunk/phase3/includes/filerepo/FileCache.php (added) (history)

Diff [purge]

Index: trunk/phase3/includes/GlobalFunctions.php
@@ -2719,10 +2719,15 @@
27202720 * current version. An image object will be returned which
27212721 * was created at the specified time.
27222722 * @param mixed $flags FileRepo::FIND_ flags
 2723+ * @param boolean $bypass Bypass the file cache even if it could be used
27232724 * @return File, or false if the file does not exist
27242725 */
2725 -function wfFindFile( $title, $time = false, $flags = 0 ) {
2726 - return RepoGroup::singleton()->findFile( $title, $time, $flags );
 2726+function wfFindFile( $title, $time = false, $flags = 0, $bypass = false ) {
 2727+ if( !$time && !$flags && !$bypass ) {
 2728+ return FileCache::singleton()->findFile( $title );
 2729+ } else {
 2730+ return RepoGroup::singleton()->findFile( $title, $time, $flags );
 2731+ }
27272732 }
27282733
27292734 /**
Index: trunk/phase3/includes/filerepo/FileCache.php
@@ -0,0 +1,156 @@
 2+<?php
 3+/**
 4+ * Cache of file objects, wrapping some RepoGroup functions to avoid redundant
 5+ * queries. Loosely inspired by the LinkCache / LinkBatch classes for titles.
 6+ *
 7+ * ISSUE: Merge with RepoGroup?
 8+ *
 9+ * @ingroup FileRepo
 10+ */
 11+class FileCache {
 12+ var $repoGroup;
 13+ var $cache = array(), $notFound = array();
 14+
 15+ protected static $instance;
 16+
 17+ /**
 18+ * Get a FileCache instance. Typically, only one instance of FileCache
 19+ * is needed in a MediaWiki invocation.
 20+ */
 21+ static function singleton() {
 22+ if ( self::$instance ) {
 23+ return self::$instance;
 24+ }
 25+ self::$instance = new FileCache( RepoGroup::singleton() );
 26+ return self::$instance;
 27+ }
 28+
 29+ /**
 30+ * Destroy the singleton instance, so that a new one will be created next
 31+ * time singleton() is called.
 32+ */
 33+ static function destroySingleton() {
 34+ self::$instance = null;
 35+ }
 36+
 37+ /**
 38+ * Set the singleton instance to a given object
 39+ */
 40+ static function setSingleton( $instance ) {
 41+ self::$instance = $instance;
 42+ }
 43+
 44+ /**
 45+ * Construct a group of file repositories.
 46+ * @param RepoGroup $repoGroup
 47+ */
 48+ function __construct( $repoGroup ) {
 49+ $this->repoGroup = $repoGroup;
 50+ }
 51+
 52+
 53+ /**
 54+ * Add some files to the cache. This is a fairly low-level function,
 55+ * which most users should not need to call. Note that any existing
 56+ * entries for the same keys will not be replaced. Call clearFiles()
 57+ * first if you need that.
 58+ * @param array $files array of File objects, indexed by DB key
 59+ */
 60+ function addFiles( $files ) {
 61+ wfDebug( "FileCache adding ".count( $files )." files\n" );
 62+ $this->cache += $files;
 63+ }
 64+
 65+ /**
 66+ * Remove some files from the cache, so that their existence will be
 67+ * rechecked. This is a fairly low-level function, which most users
 68+ * should not need to call.
 69+ * @param array $remove array indexed by DB keys to remove (the values are ignored)
 70+ */
 71+ function clearFiles( $remove ) {
 72+ wfDebug( "FileCache clearing data for ".count( $remove )." files\n" );
 73+ $this->cache = array_diff_keys( $this->cache, $remove );
 74+ $this->notFound = array_diff_keys( $this->notFound, $remove );
 75+ }
 76+
 77+ /**
 78+ * Mark some DB keys as nonexistent. This is a fairly low-level
 79+ * function, which most users should not need to call.
 80+ * @param array $dbkeys array of DB keys
 81+ */
 82+ function markNotFound( $dbkeys ) {
 83+ wfDebug( "FileCache marking ".count( $dbkeys )." files as not found\n" );
 84+ $this->notFound += array_fill_keys( $dbkeys, true );
 85+ }
 86+
 87+
 88+ /**
 89+ * Search the cache for a file.
 90+ * @param mixed $title Title object or string
 91+ * @return File object or false if it is not found
 92+ * @todo Implement searching for old file versions(?)
 93+ */
 94+ function findFile( $title ) {
 95+ if( !( $title instanceof Title ) ) {
 96+ $title = Title::makeTitleSafe( NS_FILE, $title );
 97+ }
 98+ if( !$title ) {
 99+ return false; // invalid title?
 100+ }
 101+
 102+ $dbkey = $title->getDBkey();
 103+ if( array_key_exists( $dbkey, $this->cache ) ) {
 104+ wfDebug( "FileCache HIT for $dbkey\n" );
 105+ return $this->cache[$dbkey];
 106+ }
 107+ if( array_key_exists( $dbkey, $this->notFound ) ) {
 108+ wfDebug( "FileCache negative HIT for $dbkey\n" );
 109+ return false;
 110+ }
 111+
 112+ // Not in cache, fall back to a direct query
 113+ $file = $this->repoGroup->findFile( $title );
 114+ if( $file ) {
 115+ wfDebug( "FileCache MISS for $dbkey\n" );
 116+ $this->cache[$dbkey] = $file;
 117+ } else {
 118+ wfDebug( "FileCache negative MISS for $dbkey\n" );
 119+ $this->notFound[$dbkey] = true;
 120+ }
 121+ return $file;
 122+ }
 123+
 124+ /**
 125+ * Search the cache for multiple files.
 126+ * @param array $titles Title objects or strings to search for
 127+ * @return array of File objects, indexed by DB key
 128+ */
 129+ function findFiles( $titles ) {
 130+ $titleObjs = array();
 131+ foreach ( $titles as $title ) {
 132+ if ( !( $title instanceof Title ) ) {
 133+ $title = Title::makeTitleSafe( NS_FILE, $title );
 134+ }
 135+ if ( $title ) {
 136+ $titleObjs[$title->getDBkey()] = $title;
 137+ }
 138+ }
 139+
 140+ $result = array_intersect_key( $this->cache, $titleObjs );
 141+
 142+ $unsure = array_diff_key( $titleObjs, $result, $this->notFound );
 143+ if( $unsure ) {
 144+ wfDebug( "FileCache MISS for ".count( $unsure )." files out of ".count( $titleObjs )."...\n" );
 145+ // XXX: We assume the array returned by findFiles() is
 146+ // indexed by DBkey; this appears to be true, but should
 147+ // be explicitly documented.
 148+ $found = $this->repoGroup->findFiles( $unsure );
 149+ $result += $found;
 150+ $this->addFiles( $found );
 151+ $this->markNotFound( array_keys( array_diff_key( $unsure, $found ) ) );
 152+ }
 153+
 154+ wfDebug( "FileCache found ".count( $result )." files out of ".count( $titleObjs )."\n" );
 155+ return $result;
 156+ }
 157+}
Index: trunk/phase3/includes/AutoLoader.php
@@ -365,6 +365,7 @@
366366 # includes/filerepo
367367 'ArchivedFile' => 'includes/filerepo/ArchivedFile.php',
368368 'File' => 'includes/filerepo/File.php',
 369+ 'FileCache' => 'includes/filerepo/FileCache.php',
369370 'FileRepo' => 'includes/filerepo/FileRepo.php',
370371 'FileRepoStatus' => 'includes/filerepo/FileRepoStatus.php',
371372 'ForeignAPIFile' => 'includes/filerepo/ForeignAPIFile.php',
Index: trunk/phase3/RELEASE-NOTES
@@ -229,6 +229,7 @@
230230 your language) instead of Wikipedia.
231231 * (bug 16635) The "view and edit watchlist" page (Special:Watchlist/edit) now
232232 includes a table of contents
 233+* File objects returned by wfFindFile() are now cached by default
233234
234235 === Bug fixes in 1.14 ===
235236

Follow-up revisions

RevisionCommit summaryAuthorDate
r44567Fix regression in parserTests.inc caused by r44560.vyznev07:38, 14 December 2008
r55082* Per my CR comments on r44560: merged FileCache into RepoGroup and fixed wfF...tstarling09:59, 15 August 2009

Comments

#Comment by Aaron Schulz (talk | contribs)   06:14, 14 December 2008

Would be nice if $time was supported (FlaggedRevs uses it)

#Comment by Mr.Z-man (talk | contribs)   06:34, 14 December 2008

This breaks 41 image-related parser tests with output generally something like

Running test Simple image... FAILED!
--- Expected ---
<p><a href="https://www.mediawiki.org/wiki/File:Foobar.jpg" class="image" title="Image:foobar.jpg"><img alt="Image:foobar.jpg" src="[http://example.com/images/3/3a/Foobar.jpg http://example.com/images/3/3a/Foobar.jpg]" width="1941" height="220" border="0" /></a>
</p>
--- Actual ---
<p><a href="https://www.mediawiki.org/index.php?title=Special:Upload&wpDestFile=Foobar.jpg" class="new" title="File:Foobar.jpg">File:Foobar.jpg</a>
</p>

Also, if this is kept, the simple file object cache in the parser (Parser.php ~Line:4265) should probably be removed.

#Comment by Ilmari Karonen (talk | contribs)   07:26, 14 December 2008

I'm planning to eventually rewrite that bit of the parser so that all images get checked in one batch (like links currently are). Now I just need to figure out why those tests are breaking...

#Comment by Ilmari Karonen (talk | contribs)   07:40, 14 December 2008

The parser test breakage should be fixed by r44567. Please report any other regressions.

#Comment by Brion VIBBER (talk | contribs)   00:12, 17 December 2008

So far so good :D

#Comment by Tim Starling (talk | contribs)   15:42, 26 May 2009

OK, looks like my theory on single-line global function shortcut being short enough to avoid bloating out beyond all belief has been proven wrong. Please move this crap out of GlobalFunctions.php before I lose my cool and delete the function and migrate all callers to use RepoGroup::singleton()->findFile() directly.

#Comment by Ilmari Karonen (talk | contribs)   08:14, 27 May 2009

So do you think that:

  • a) the whole thing should be reverted — we don' need no stinkin' caches,
  • b) the cache could be used for code that wants to use it explicitly (e.g. if I ever got around to finishing the parser tweaks mentioned above), but wfFindFile() should not use it, or
  • c) the caching should be done inside RepoGroup (or possibly in individual repos)?
#Comment by Tim Starling (talk | contribs)   09:02, 27 May 2009

c. It should be done in RepoGroup::findFile(). It can certainly see the need for a separate cache class, but the relevant object can be a member of the RepoGroup instead of a static variable. The main issue is that global functions cannot be autoloaded and so their size needs to be kept to a minimum. Also, the modularity of the codebase needs to be preserved, to ease reading and reviewing. Five lines is tolerable, but it never stops there, once a function starts to grow it's just too tempting to pile in more logic.

By making the FileCache a static variable instead of an object member, you're giving up the flexibility of the OOP model and reducing the potential functionality that instantiating new RepoGroup objects might have. If an application for an uncached RepoGroup arises, that can be implemented as a RepoGroup constructor option.

#Comment by Tim Starling (talk | contribs)   10:05, 15 August 2009

Rewritten in r55082.

#Comment by Tim Starling (talk | contribs)   00:53, 9 July 2010

Was there some use case where this cache actually helped performance in some way? It's causing OOM errors on Wikimedia at the moment.

#Comment by Ilmari Karonen (talk | contribs)   12:03, 9 July 2010

I can't dig up any specific references, but my original reason for adding this was the observation that some pages displaying the same image (often an icon of some sort) many times could take a long time to parse. I'm sure there was discussion somewhere, but I'm not finding it right now. However, off the top of my head, consider e.g. commons:COM:FPC, which right now includes 338 copies of File:Symbol support vote.svg and 147 copies of File:Symbol oppose vote.svg, plus a lesser number of other vote icons.

Anyway, if the cache is taking up too much memory, the immediate fix should probably be to reduce the size limit; on pages like the example above, where the cache ought to have the most effect, a 100-file or even a 10-file cache should still give a fairly good hit rate. 1000 files is probably too much to cache for most purposes.

Part of the problem is probably that File objects can be relatively heavyweight, especially if they contain metadata. An alternative (or complementary) approach might be to only cache a limited subset of the file data, enough for normal parsing, and rely on lazy loading to get the rest when needed. But reducing the cache size is probably the best initial step.

#Comment by Ilmari Karonen (talk | contribs)   12:17, 9 July 2010

I see you already did that in the 1.16wmf4 branch in r69199. For the time being, making the same change in trunk might not be a bad idea. But I'd say the real long-term fix should be not to store "massive metadata" in File objects in the first place (or at least not including it in the cache) unless it's actually needed.

Status & tagging log