r21643 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r21642‎ | r21643 | r21644 >
Date:23:05, 26 April 2007
Author:rainman
Status:old
Tags:
Comment:

Added somewhat more detailed overview of lucene-search 2.0 extension.
Modified paths:
  • /trunk/lucene-search-2.0/OVERVIEW.txt (added) (history)

Diff [purge]

Index: trunk/lucene-search-2.0/OVERVIEW.txt
@@ -0,0 +1,225 @@
 2+
 3+Lucene Search 2.0 Overview (by Robert Stojnic)
 4+
 5+
 6+== Distributed architecture ==
 7+
 8+- one indexer / many searchers
 9+
 10+- indexers make clean snapshots of indexes and searchers use rsync to
 11+ obtain local copy of the updated index
 12+
 13+- index can be either whole (single), mainsplit (two parts, one with
 14+ all articles in main namespace, other with rest of article), and
 15+ split (some fraction of documents in each subindex).
 16+
 17+- both indexing and searching can be distributed on many hosts
 18+
 19+- there is a global configuration file (mwsearch-global.conf) that
 20+ lays out the global architecture, defines indexes, indexers,
 21+ searchers. It needs to be available to all nodes.
 22+
 23+- local configuration file deals with host-specific stuff
 24+
 25+- Java RMI is used for communication, it is typically at least 5-6
 26+ times faster than xmlrpc, and also manages persistent connections.
 27+ Further, Lucene has built-in support for RMI.
 28+
 29+- Searchers watch the status of other searchers, and try to ping dead
 30+ searchers (and at same time take them out of rotation while they are
 31+ down)
 32+
 33+Notes:
 34+
 35+- mainsplit is a special case of more general idea of having database
 36+ split around namespaces. It is a convenient special case because
 37+ database then has minimal size for most searches (taking that most
 38+ searches come from anonymous users wanting to search main
 39+ namespace). Currently, this is the only implemented way to split the
 40+ database around namespaces, but it could be easily extended.
 41+
 42+
 43+== Searching ==
 44+
 45+- Distributed search (on split index) takes 3 calls per nonlocal
 46+ index: getting IDFs (inverse document frequencies, needed to
 47+ construct global scorer), actual search, and retrieving some number
 48+ of documents (if needed)
 49+
 50+- Searchers are organized in search groups. When search is to be
 51+ executed, each searcher tries to use local copies of indexes, and
 52+ for those it doesn't have a local copy, randomly chooses remote
 53+ searcher within its group that has the needed index part
 54+
 55+- Search always returns correct number of hits, since queries are
 56+ either filtered (filters are cached and introduce very little
 57+ overhead - one BitSet.get() per result), or rewritten so that only
 58+ those namespaces specified in the query are searched
 59+
 60+- Update of index is done in separate thread that prepares a copy for
 61+ rsync (makes a hard-link of current index), and then runs rsync on
 62+ it so that only differences are transfered. After that index is
 63+ opened, typical queries run on it (to warm up the copy, load
 64+ caches), main namespace filter is rebuilt, and then in synchronized
 65+ block the old searcher is replaced with a new one. The old one is
 66+ close after 15 seconds (these 15s is to allow threads to finish with
 67+ the old index)
 68+
 69+
 70+
 71+== Indexing ==
 72+
 73+- One indexer per each index part. For split indexes there is one main
 74+ indexer that maintains a logical view on the whole index
 75+
 76+- Mainsplit and single indexes have very limited overhead, Split
 77+ indexes have a larger overhead, due to the reporting system that
 78+ keeps the operations atomic and makes sure they are correctly
 79+ carried out
 80+
 81+- Indexing is fastest if it's done in large batches. However, storing
 82+ large number of articles eats up heap. 64MB heap is eaten up by
 83+ queue size of 30000, but in my testing environment worked fine with
 84+ queue of 5000.
 85+
 86+- After testing java XMLRPC implementations it seemed to me that they
 87+ all introduce large overhead, so I implemented hackish HTTP frontend
 88+ for indexer. Article text is transfered raw in POST request, and
 89+ function to be called is encoded in the URL.
 90+
 91+
 92+
 93+== Wiki parser ==
 94+
 95+- FastWikiTokenizerEngine.java is a handmade parser for basic wiki
 96+ syntax. This is to replace the slow stripWiki() function.
 97+
 98+- Accents are stripped by default, thus no accents are ever
 99+ indexed. AFAIK, this is OK for all languages (and seems to be the
 100+ way major search engines do it). Indexing accented words as aliases
 101+ would be probably unnecessary overhead.
 102+
 103+- numbers are also tokenized
 104+
 105+- extracts categories and interwiki (interwikis are currently unused)
 106+
 107+- skips table properties, names of templates (e.g. so that search for
 108+ "stub" gives meaningful results), image properties, external link
 109+ urls, xml markup
 110+
 111+- localization is read at startup from Messages files (to be
 112+ up-to-date), so that parser recognized localized variants of
 113+ Category and Image keywords. Interwiki map is read out of static
 114+ file lib/interwiki.map (which could be update somehow?)
 115+
 116+
 117+
 118+== Analyzers/Languages ==
 119+
 120+- Generic Language analyzer consists of filter (e.g. for Serbian
 121+ (convert to latin, etc) and Thai (tokenize words)) and stemmer (for
 122+ English, German, French, Esperanto, Dutch, Russian). Stemmed words
 123+ are indexed alongside with the original words (i.e. as aliases -
 124+ positional increment 0)
 125+
 126+- Search queries uses same language analyzer, but stemmed words are
 127+ boosted with 0.5, so that exact match is favored
 128+
 129+- Titles are not stemmed, to even more favor exact matches and reduce
 130+ overhead, as words from title usually appear in the article
 131+
 132+TODO:
 133+- Maybe look at more languages, especially Chinese
 134+
 135+
 136+
 137+== Query Parser ==
 138+
 139+- Faster with complex queries than Lucene QueryParser
 140+
 141+- recognizes subset of QueryParser syntax: AND, OR keywords and
 142+ +,-. Phrases enclosed in ""
 143+
 144+- introduces namespace prefixes ''namespace:query'' to limit search to
 145+ ceratin namespace: e.g. ''help:inserting pictures''. Note that
 146+ ''help'' prefix is valid until the end of query of some other prefix
 147+ definition: e.g. ''help:editing project:wikipedia'' will find all
 148+ pages in help namespace containing ''editing'' and all pages in
 149+ project namespace containing ''wikipedia''.
 150+
 151+- searching categories. Syntax is: ''query category:"exact category
 152+ name"''. It is important to note that category names are themselves
 153+ not tokenized. Using logical operators, intersection, union and
 154+ difference of categories can be searched. Since exact category is
 155+ needed (only case is not important), it is maybe best to incorporate
 156+ this somewhere on category page, and have category name put into
 157+ query by MediaWiki instead manually by user.
 158+
 159+Note:
 160+
 161+- namespace prefixes render the old way of picking namespaces to
 162+ search unusable, thus it should be removed from user settings. And
 163+ users should pick only one namespace to be their default for search
 164+ (or all namespaces). In theory, by rewriting the query it could be
 165+ possible to be back compatible with the current way, but it would
 166+ slow down searching for those users, and I wonder if it is important
 167+ to be able to to have any combination of namespace searched by
 168+ default and how many users use this
 169+
 170+- see WikiQueryParser.java for adopted names of namespaces (all: is
 171+ special prefix for all namespace)
 172+
 173+Todo:
 174+
 175+- Before search query is passed to Lucene-Search, localized version of
 176+ namespace names should be replaced with standard ones. This should
 177+ be implemented in MediaWiki. E.g. ''srpski kategorija:jezici'' ->
 178+ ''srpski category:jezici''
 179+
 180+
 181+
 182+== Lucene patch ==
 183+
 184+- Don't use Readers but plain strings when possible. Java streams are
 185+ very slow, whenever we don't need the general Reader interface, I
 186+ replaced it with just Strings (instead of StringReaders).
 187+
 188+- SearchableMul interface, enables retrieval of many documents in
 189+ single call (to minimize network overhead)
 190+
 191+
 192+
 193+== Incremental update ==
 194+
 195+- get load off database, more up-to-date index, etc..
 196+
 197+- however, needs to be tested in production environment. There are two
 198+ scenarios:
 199+
 200+ * don't optimize indexes with merge factor of 2. This means that
 201+ additional segments will be made with new documents, and that old
 202+ documents will be deleted when the size of index doubles (roughly)
 203+
 204+ This is good because incremental update is fast, and download of
 205+ index to searchers is also fast since only new segments need to be
 206+ transfered, and old segment pages are already in memory (since we
 207+ use cp -lr to prepare a copy for rsync).
 208+
 209+ However, search will be slower (according to my tests up to 50%
 210+ slower in worst case), and in worst case twice the memory is
 211+ needed
 212+
 213+ * optimize indexes. To optimize index, indexer needs twice the
 214+ memory of whole index (because brand new segments is made from all
 215+ old segments merged)
 216+
 217+ Rsync diff algorithm might be helpful in transferring the
 218+ optimized index, but probably not too much.
 219+
 220+ On searcher, twice the memory is needed when index is updated
 221+ (actually probably less since in the beginning not all of the
 222+ index is loaded in memory, especially the part of the index with
 223+ positional increments)
 224+
 225+- third scenario is to abandon incremental updates altogether, and
 226+ just rebuild whole database from scratch and snapshot then