Index: trunk/lucene-search-2.0/OVERVIEW.txt |
— | — | @@ -0,0 +1,225 @@ |
| 2 | + |
| 3 | +Lucene Search 2.0 Overview (by Robert Stojnic) |
| 4 | + |
| 5 | + |
| 6 | +== Distributed architecture == |
| 7 | + |
| 8 | +- one indexer / many searchers |
| 9 | + |
| 10 | +- indexers make clean snapshots of indexes and searchers use rsync to |
| 11 | + obtain local copy of the updated index |
| 12 | + |
| 13 | +- index can be either whole (single), mainsplit (two parts, one with |
| 14 | + all articles in main namespace, other with rest of article), and |
| 15 | + split (some fraction of documents in each subindex). |
| 16 | + |
| 17 | +- both indexing and searching can be distributed on many hosts |
| 18 | + |
| 19 | +- there is a global configuration file (mwsearch-global.conf) that |
| 20 | + lays out the global architecture, defines indexes, indexers, |
| 21 | + searchers. It needs to be available to all nodes. |
| 22 | + |
| 23 | +- local configuration file deals with host-specific stuff |
| 24 | + |
| 25 | +- Java RMI is used for communication, it is typically at least 5-6 |
| 26 | + times faster than xmlrpc, and also manages persistent connections. |
| 27 | + Further, Lucene has built-in support for RMI. |
| 28 | + |
| 29 | +- Searchers watch the status of other searchers, and try to ping dead |
| 30 | + searchers (and at same time take them out of rotation while they are |
| 31 | + down) |
| 32 | + |
| 33 | +Notes: |
| 34 | + |
| 35 | +- mainsplit is a special case of more general idea of having database |
| 36 | + split around namespaces. It is a convenient special case because |
| 37 | + database then has minimal size for most searches (taking that most |
| 38 | + searches come from anonymous users wanting to search main |
| 39 | + namespace). Currently, this is the only implemented way to split the |
| 40 | + database around namespaces, but it could be easily extended. |
| 41 | + |
| 42 | + |
| 43 | +== Searching == |
| 44 | + |
| 45 | +- Distributed search (on split index) takes 3 calls per nonlocal |
| 46 | + index: getting IDFs (inverse document frequencies, needed to |
| 47 | + construct global scorer), actual search, and retrieving some number |
| 48 | + of documents (if needed) |
| 49 | + |
| 50 | +- Searchers are organized in search groups. When search is to be |
| 51 | + executed, each searcher tries to use local copies of indexes, and |
| 52 | + for those it doesn't have a local copy, randomly chooses remote |
| 53 | + searcher within its group that has the needed index part |
| 54 | + |
| 55 | +- Search always returns correct number of hits, since queries are |
| 56 | + either filtered (filters are cached and introduce very little |
| 57 | + overhead - one BitSet.get() per result), or rewritten so that only |
| 58 | + those namespaces specified in the query are searched |
| 59 | + |
| 60 | +- Update of index is done in separate thread that prepares a copy for |
| 61 | + rsync (makes a hard-link of current index), and then runs rsync on |
| 62 | + it so that only differences are transfered. After that index is |
| 63 | + opened, typical queries run on it (to warm up the copy, load |
| 64 | + caches), main namespace filter is rebuilt, and then in synchronized |
| 65 | + block the old searcher is replaced with a new one. The old one is |
| 66 | + close after 15 seconds (these 15s is to allow threads to finish with |
| 67 | + the old index) |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | +== Indexing == |
| 72 | + |
| 73 | +- One indexer per each index part. For split indexes there is one main |
| 74 | + indexer that maintains a logical view on the whole index |
| 75 | + |
| 76 | +- Mainsplit and single indexes have very limited overhead, Split |
| 77 | + indexes have a larger overhead, due to the reporting system that |
| 78 | + keeps the operations atomic and makes sure they are correctly |
| 79 | + carried out |
| 80 | + |
| 81 | +- Indexing is fastest if it's done in large batches. However, storing |
| 82 | + large number of articles eats up heap. 64MB heap is eaten up by |
| 83 | + queue size of 30000, but in my testing environment worked fine with |
| 84 | + queue of 5000. |
| 85 | + |
| 86 | +- After testing java XMLRPC implementations it seemed to me that they |
| 87 | + all introduce large overhead, so I implemented hackish HTTP frontend |
| 88 | + for indexer. Article text is transfered raw in POST request, and |
| 89 | + function to be called is encoded in the URL. |
| 90 | + |
| 91 | + |
| 92 | + |
| 93 | +== Wiki parser == |
| 94 | + |
| 95 | +- FastWikiTokenizerEngine.java is a handmade parser for basic wiki |
| 96 | + syntax. This is to replace the slow stripWiki() function. |
| 97 | + |
| 98 | +- Accents are stripped by default, thus no accents are ever |
| 99 | + indexed. AFAIK, this is OK for all languages (and seems to be the |
| 100 | + way major search engines do it). Indexing accented words as aliases |
| 101 | + would be probably unnecessary overhead. |
| 102 | + |
| 103 | +- numbers are also tokenized |
| 104 | + |
| 105 | +- extracts categories and interwiki (interwikis are currently unused) |
| 106 | + |
| 107 | +- skips table properties, names of templates (e.g. so that search for |
| 108 | + "stub" gives meaningful results), image properties, external link |
| 109 | + urls, xml markup |
| 110 | + |
| 111 | +- localization is read at startup from Messages files (to be |
| 112 | + up-to-date), so that parser recognized localized variants of |
| 113 | + Category and Image keywords. Interwiki map is read out of static |
| 114 | + file lib/interwiki.map (which could be update somehow?) |
| 115 | + |
| 116 | + |
| 117 | + |
| 118 | +== Analyzers/Languages == |
| 119 | + |
| 120 | +- Generic Language analyzer consists of filter (e.g. for Serbian |
| 121 | + (convert to latin, etc) and Thai (tokenize words)) and stemmer (for |
| 122 | + English, German, French, Esperanto, Dutch, Russian). Stemmed words |
| 123 | + are indexed alongside with the original words (i.e. as aliases - |
| 124 | + positional increment 0) |
| 125 | + |
| 126 | +- Search queries uses same language analyzer, but stemmed words are |
| 127 | + boosted with 0.5, so that exact match is favored |
| 128 | + |
| 129 | +- Titles are not stemmed, to even more favor exact matches and reduce |
| 130 | + overhead, as words from title usually appear in the article |
| 131 | + |
| 132 | +TODO: |
| 133 | +- Maybe look at more languages, especially Chinese |
| 134 | + |
| 135 | + |
| 136 | + |
| 137 | +== Query Parser == |
| 138 | + |
| 139 | +- Faster with complex queries than Lucene QueryParser |
| 140 | + |
| 141 | +- recognizes subset of QueryParser syntax: AND, OR keywords and |
| 142 | + +,-. Phrases enclosed in "" |
| 143 | + |
| 144 | +- introduces namespace prefixes ''namespace:query'' to limit search to |
| 145 | + ceratin namespace: e.g. ''help:inserting pictures''. Note that |
| 146 | + ''help'' prefix is valid until the end of query of some other prefix |
| 147 | + definition: e.g. ''help:editing project:wikipedia'' will find all |
| 148 | + pages in help namespace containing ''editing'' and all pages in |
| 149 | + project namespace containing ''wikipedia''. |
| 150 | + |
| 151 | +- searching categories. Syntax is: ''query category:"exact category |
| 152 | + name"''. It is important to note that category names are themselves |
| 153 | + not tokenized. Using logical operators, intersection, union and |
| 154 | + difference of categories can be searched. Since exact category is |
| 155 | + needed (only case is not important), it is maybe best to incorporate |
| 156 | + this somewhere on category page, and have category name put into |
| 157 | + query by MediaWiki instead manually by user. |
| 158 | + |
| 159 | +Note: |
| 160 | + |
| 161 | +- namespace prefixes render the old way of picking namespaces to |
| 162 | + search unusable, thus it should be removed from user settings. And |
| 163 | + users should pick only one namespace to be their default for search |
| 164 | + (or all namespaces). In theory, by rewriting the query it could be |
| 165 | + possible to be back compatible with the current way, but it would |
| 166 | + slow down searching for those users, and I wonder if it is important |
| 167 | + to be able to to have any combination of namespace searched by |
| 168 | + default and how many users use this |
| 169 | + |
| 170 | +- see WikiQueryParser.java for adopted names of namespaces (all: is |
| 171 | + special prefix for all namespace) |
| 172 | + |
| 173 | +Todo: |
| 174 | + |
| 175 | +- Before search query is passed to Lucene-Search, localized version of |
| 176 | + namespace names should be replaced with standard ones. This should |
| 177 | + be implemented in MediaWiki. E.g. ''srpski kategorija:jezici'' -> |
| 178 | + ''srpski category:jezici'' |
| 179 | + |
| 180 | + |
| 181 | + |
| 182 | +== Lucene patch == |
| 183 | + |
| 184 | +- Don't use Readers but plain strings when possible. Java streams are |
| 185 | + very slow, whenever we don't need the general Reader interface, I |
| 186 | + replaced it with just Strings (instead of StringReaders). |
| 187 | + |
| 188 | +- SearchableMul interface, enables retrieval of many documents in |
| 189 | + single call (to minimize network overhead) |
| 190 | + |
| 191 | + |
| 192 | + |
| 193 | +== Incremental update == |
| 194 | + |
| 195 | +- get load off database, more up-to-date index, etc.. |
| 196 | + |
| 197 | +- however, needs to be tested in production environment. There are two |
| 198 | + scenarios: |
| 199 | + |
| 200 | + * don't optimize indexes with merge factor of 2. This means that |
| 201 | + additional segments will be made with new documents, and that old |
| 202 | + documents will be deleted when the size of index doubles (roughly) |
| 203 | + |
| 204 | + This is good because incremental update is fast, and download of |
| 205 | + index to searchers is also fast since only new segments need to be |
| 206 | + transfered, and old segment pages are already in memory (since we |
| 207 | + use cp -lr to prepare a copy for rsync). |
| 208 | + |
| 209 | + However, search will be slower (according to my tests up to 50% |
| 210 | + slower in worst case), and in worst case twice the memory is |
| 211 | + needed |
| 212 | + |
| 213 | + * optimize indexes. To optimize index, indexer needs twice the |
| 214 | + memory of whole index (because brand new segments is made from all |
| 215 | + old segments merged) |
| 216 | + |
| 217 | + Rsync diff algorithm might be helpful in transferring the |
| 218 | + optimized index, but probably not too much. |
| 219 | + |
| 220 | + On searcher, twice the memory is needed when index is updated |
| 221 | + (actually probably less since in the beginning not all of the |
| 222 | + index is loaded in memory, especially the part of the index with |
| 223 | + positional increments) |
| 224 | + |
| 225 | +- third scenario is to abandon incremental updates altogether, and |
| 226 | + just rebuild whole database from scratch and snapshot then |