r24629 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r24628‎ | r24629 | r24630 >
Date:18:13, 6 August 2007
Author:rainman
Status:old
Tags:
Comment:
* add release notes
* tidy README a bit
* OVERVIEW is obsolete
Modified paths:
  • /trunk/lucene-search-2/OVERVIEW.txt (deleted) (history)
  • /trunk/lucene-search-2/README.txt (modified) (history)
  • /trunk/lucene-search-2/RELEASE-NOTES.txt (added) (history)

Diff [purge]

Index: trunk/lucene-search-2/OVERVIEW.txt
@@ -1,215 +0,0 @@
2 -
3 -Lucene Search 2.0 Overview (by Robert Stojnic)
4 -
5 -
6 -== Distributed architecture ==
7 -
8 -- one indexer / many searchers
9 -
10 -- indexers make clean snapshots of indexes and searchers use rsync to
11 - obtain local copy of the updated index
12 -
13 -- index can be either whole (single), mainsplit (two parts, one with
14 - all articles in main namespace, other with rest of article), and
15 - split (some fraction of documents in each subindex).
16 -
17 -- both indexing and searching can be distributed on many hosts
18 -
19 -- there is a global configuration file (mwsearch-global.conf) that
20 - lays out the global architecture, defines indexes, indexers,
21 - searchers. It needs to be available to all nodes.
22 -
23 -- local configuration file deals with host-specific stuff
24 -
25 -- Java RMI is used for communication, it is typically at least 5-6
26 - times faster than xmlrpc, and also manages persistent connections.
27 - Further, Lucene has built-in support for RMI.
28 -
29 -- Searchers watch the status of other searchers, and try to ping dead
30 - searchers (and at same time take them out of rotation while they are
31 - down)
32 -
33 -Notes:
34 -
35 -- mainsplit is a special case of more general idea of having database
36 - split around namespaces. It is a convenient special case because
37 - database then has minimal size for most searches (taking that most
38 - searches come from anonymous users wanting to search main
39 - namespace). Currently, this is the only implemented way to split the
40 - database around namespaces, but it could be easily extended.
41 -
42 -
43 -== Searching ==
44 -
45 -- Distributed search (on split index) takes 3 calls per nonlocal
46 - index: getting IDFs (inverse document frequencies, needed to
47 - construct global scorer), actual search, and retrieving some number
48 - of documents (if needed)
49 -
50 -- Searchers are organized in search groups. When search is to be
51 - executed, each searcher tries to use local copies of indexes, and
52 - for those it doesn't have a local copy, randomly chooses remote
53 - searcher within its group that has the needed index part
54 -
55 -- Search always returns correct number of hits, since queries are
56 - either filtered (filters are cached and introduce very little
57 - overhead - one BitSet.get() per result), or rewritten so that only
58 - those namespaces specified in the query are searched
59 -
60 -- Update of index is done in separate thread that prepares a copy for
61 - rsync (makes a hard-link of current index), and then runs rsync on
62 - it so that only differences are transfered. After that index is
63 - opened, typical queries run on it (to warm up the copy, load
64 - caches), main namespace filter is rebuilt, and then in synchronized
65 - block the old searcher is replaced with a new one. The old one is
66 - close after 15 seconds (these 15s is to allow threads to finish with
67 - the old index)
68 -
69 -
70 -
71 -== Indexing ==
72 -
73 -- One indexer per each index part. For split indexes there is one main
74 - indexer that maintains a logical view on the whole index
75 -
76 -- Mainsplit and single indexes have very limited overhead, Split
77 - indexes have a larger overhead, due to the reporting system that
78 - keeps the operations atomic and makes sure they are correctly
79 - carried out
80 -
81 -- Indexing is fastest if it's done in large batches. However, storing
82 - large number of articles eats up heap. 64MB heap is eaten up by
83 - queue size of 30000, but in my testing environment worked fine with
84 - queue of 5000.
85 -
86 -- After testing java XMLRPC implementations it seemed to me that they
87 - all introduce large overhead, so I implemented hackish HTTP frontend
88 - for indexer. Article text is transfered raw in POST request, and
89 - function to be called is encoded in the URL.
90 -
91 -
92 -
93 -== Wiki parser ==
94 -
95 -- FastWikiTokenizerEngine.java is a handmade parser for basic wiki
96 - syntax. This is to replace the slow stripWiki() function.
97 -
98 -- Accents are stripped by default, thus no accents are ever
99 - indexed. AFAIK, this is OK for all languages (and seems to be the
100 - way major search engines do it). Indexing accented words as aliases
101 - would be probably unnecessary overhead.
102 -
103 -- numbers are also tokenized
104 -
105 -- extracts categories and interwiki (interwikis are currently unused)
106 -
107 -- skips table properties, names of templates (e.g. so that search for
108 - "stub" gives meaningful results), image properties, external link
109 - urls, xml markup
110 -
111 -- localization is read at startup from Messages files (to be
112 - up-to-date), so that parser recognized localized variants of
113 - Category and Image keywords. Interwiki map is read out of static
114 - file lib/interwiki.map (which could be update somehow?)
115 -
116 -
117 -
118 -== Analyzers/Languages ==
119 -
120 -- Generic Language analyzer consists of filter (e.g. for Serbian
121 - (convert to latin, etc) and Thai (tokenize words)) and stemmer (for
122 - English, German, French, Esperanto, Dutch, Russian). Stemmed words
123 - are indexed alongside with the original words (i.e. as aliases -
124 - positional increment 0)
125 -
126 -- Search queries uses same language analyzer, but stemmed words are
127 - boosted with 0.5, so that exact match is favored
128 -
129 -- Titles are not stemmed, to even more favor exact matches and reduce
130 - overhead, as words from title usually appear in the article
131 -
132 -TODO:
133 -- Maybe look at more languages, especially Chinese
134 -
135 -
136 -
137 -== Query Parser ==
138 -
139 -- Faster with complex queries than Lucene QueryParser
140 -
141 -- recognizes subset of QueryParser syntax: AND, OR keywords and
142 - +,-. Phrases enclosed in "". Supports wilcards with * in end.
143 -
144 -- introduces namespace prefixes ''namespace:query'' to limit search to
145 - ceratin namespace: e.g. ''help:inserting pictures''. Note that
146 - ''help'' prefix is valid until the end of query of some other prefix
147 - definition: e.g. ''help:editing project:wikipedia'' will find all
148 - pages in help namespace containing ''editing'' and all pages in
149 - project namespace containing ''wikipedia''.
150 -
151 -- prefixes are defined in global configuration, but for generality
152 - (and LuceneSearch) there is also a generic way to make prefixes.
153 - E.g. ''[0,1,2]:query'' will search namespaces 0,1,2. This is
154 - convenient because it allows extended customization on the user
155 - side (i.ei mw extension rewrites custom labels into this syntax).
156 -
157 -- searching categories. Syntax is: ''query incategory:"exact category
158 - name"''. It is important to note that category names are themselves
159 - not tokenized. Using logical operators, intersection, union and
160 - difference of categories can be searched. Since exact category is
161 - needed (only case is not important), it is maybe best to incorporate
162 - this somewhere on category page, and have category name put into
163 - query by MediaWiki instead manually by user.
164 -
165 -Note:
166 -
167 -- namespace prefixes render the old way of picking namespaces to
168 - search unusable, thus it should be removed from user settings. And
169 - users should pick only one namespace to be their default for search
170 - (or all namespaces). In theory, by rewriting the query it could be
171 - possible to be back compatible with the current way, but it would
172 - slow down searching for those users, and I wonder if it is important
173 - to be able to to have any combination of namespace searched by
174 - default and how many users use this
175 -
176 -- see WikiQueryParser.java for adopted names of namespaces (all: is
177 - special prefix for all namespace)
178 -
179 -- Before search query is passed to Lucene-Search, localized version of
180 - namespace names should be replaced with standard ones. This should
181 - be implemented in MediaWiki. E.g. ''srpski kategorija:jezici'' ->
182 - ''srpski incategory:jezici''
183 -
184 -
185 -
186 -== Lucene patch ==
187 -
188 -- Don't use Readers but plain strings when possible. Java streams are
189 - very slow, whenever we don't need the general Reader interface, I
190 - replaced it with just Strings (instead of StringReaders).
191 -
192 -- SearchableMul interface, enables retrieval of many documents in
193 - single call (to minimize network overhead)
194 -
195 -TODO: make patch file for lucene 2.0
196 -
197 -
198 -
199 -== Incremental update ==
200 -
201 -- get load off database, more up-to-date index, etc..
202 -
203 -- Incremental updating is available via OAI-PMH interface. One indexer
204 - can have many incremental updaters delivering latest updates.
205 - Incremental updater maintains a list of status files, in which latest
206 - timestamp of successful updates is stored.
207 -
208 -- Snapshot is made of index at regular intervals, and it's picked up
209 - by searchers (via is a RMI query system) and rsynced. Indexes should
210 - be optimized, and need to be properly warmed up to enable smooth
211 - transition when new index replaces the old. Test indicate than this
212 - scheme will work smoothly until 1/2 of both indexes can be kept in
213 - memory. After that the operation becomes I/O bound, and degrades
214 - search performance during warmup.
215 -
216 -
\ No newline at end of file
Index: trunk/lucene-search-2/RELEASE-NOTES.txt
@@ -0,0 +1,22 @@
 2+Lucene Search 2.0.2
 3+====================
 4+
 5+* Fix bug 10822. Convert underscores to spaces in category names.
 6+
 7+Lucene Search 2.0.1
 8+====================
 9+
 10+* Fix CJK tokenization - tokenize C1C2C2 -> C1C2 C2C3 and wrap
 11+ searches into phrase queries
 12+* Fix đ in Thai, didn't properly resolve into d
 13+
 14+Lucene Search 2.0.0
 15+====================
 16+
 17+* Initial release, almost complete rewrite of the C# version
 18+ New features very briefly:
 19+ - distribute search/indexing
 20+ - accentless search, stemmers for 12 languages
 21+ - link analysis for ranking
 22+ - namespace-prefixed queries
 23+
Index: trunk/lucene-search-2/README.txt
@@ -1,4 +1,4 @@
2 - Lucene Search 2.0: extension for MediaWiki
 2+ Lucene Search 2: extension for MediaWiki
33 ==========================================
44
55 Requirements:
@@ -12,20 +12,11 @@
1313 - Apache XMLRPC 3.0 (for XMLRPC interface)
1414 - Apache Ant 1.6 (for building from source, etc)
1515
16 -Setup:
 16+Installing:
1717
18 - - Edit mwsearch-global.conf and make it available at some URL
19 - - At each host:
20 - * properly setup hostname (otherwise JavaVM gets confused)
21 - * make and set permissions of local directory for indexes
22 - * edit mwsearch.conf:
23 - + MWConfig.global to point to URL of mwsearch-global.conf
24 - + MWConfig.lib to point to local library path (ie with unicode-data etc)
25 - + Localization.url to point to URL of latest message files from MediaWiki
26 - + Indexes.path - base path where you want the deamon to store the indexes,
27 - + Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch package has a sample log4j file you can use)
28 - * setup rsync daemon (see rsyncd.conf-example)
29 - * setup log4j logging subsystem (see mwsearch.log4j-example)
 18+ - Up-to-date instructions and troubleshooting can be found at:
 19+
 20+ http://www.mediawiki.org/wiki/Extension:LuceneSearch
3021
3122 Running:
3223
@@ -50,9 +41,8 @@
5142 table parameters, image parameters (except caption) are not indexed.
5243
5344 - query parser, faster search query parsing, enables prefixes for namespaces,
54 - e.g. 'help:editing pages'. Prefixes are localized within MediaWiki. Can
55 - do category searches e.g. 'smoked category:cheeses'. Rewrites all of these
56 - so that stemmed present are present but add less to document score.
 45+ e.g. 'help:editing pages'. Prefixes are localized within MediaWiki. Rewrites
 46+ all of these so that stemmed present are present but add less to document score.
5747
5848 - (hopefully) robust architecture, with threads pinging hosts that are down,
5949 and search daemons trying alternatives if host holding part of the

Status & tagging log