Index: trunk/lucene-search-2/OVERVIEW.txt |
— | — | @@ -1,215 +0,0 @@ |
2 | | - |
3 | | -Lucene Search 2.0 Overview (by Robert Stojnic) |
4 | | - |
5 | | - |
6 | | -== Distributed architecture == |
7 | | - |
8 | | -- one indexer / many searchers |
9 | | - |
10 | | -- indexers make clean snapshots of indexes and searchers use rsync to |
11 | | - obtain local copy of the updated index |
12 | | - |
13 | | -- index can be either whole (single), mainsplit (two parts, one with |
14 | | - all articles in main namespace, other with rest of article), and |
15 | | - split (some fraction of documents in each subindex). |
16 | | - |
17 | | -- both indexing and searching can be distributed on many hosts |
18 | | - |
19 | | -- there is a global configuration file (mwsearch-global.conf) that |
20 | | - lays out the global architecture, defines indexes, indexers, |
21 | | - searchers. It needs to be available to all nodes. |
22 | | - |
23 | | -- local configuration file deals with host-specific stuff |
24 | | - |
25 | | -- Java RMI is used for communication, it is typically at least 5-6 |
26 | | - times faster than xmlrpc, and also manages persistent connections. |
27 | | - Further, Lucene has built-in support for RMI. |
28 | | - |
29 | | -- Searchers watch the status of other searchers, and try to ping dead |
30 | | - searchers (and at same time take them out of rotation while they are |
31 | | - down) |
32 | | - |
33 | | -Notes: |
34 | | - |
35 | | -- mainsplit is a special case of more general idea of having database |
36 | | - split around namespaces. It is a convenient special case because |
37 | | - database then has minimal size for most searches (taking that most |
38 | | - searches come from anonymous users wanting to search main |
39 | | - namespace). Currently, this is the only implemented way to split the |
40 | | - database around namespaces, but it could be easily extended. |
41 | | - |
42 | | - |
43 | | -== Searching == |
44 | | - |
45 | | -- Distributed search (on split index) takes 3 calls per nonlocal |
46 | | - index: getting IDFs (inverse document frequencies, needed to |
47 | | - construct global scorer), actual search, and retrieving some number |
48 | | - of documents (if needed) |
49 | | - |
50 | | -- Searchers are organized in search groups. When search is to be |
51 | | - executed, each searcher tries to use local copies of indexes, and |
52 | | - for those it doesn't have a local copy, randomly chooses remote |
53 | | - searcher within its group that has the needed index part |
54 | | - |
55 | | -- Search always returns correct number of hits, since queries are |
56 | | - either filtered (filters are cached and introduce very little |
57 | | - overhead - one BitSet.get() per result), or rewritten so that only |
58 | | - those namespaces specified in the query are searched |
59 | | - |
60 | | -- Update of index is done in separate thread that prepares a copy for |
61 | | - rsync (makes a hard-link of current index), and then runs rsync on |
62 | | - it so that only differences are transfered. After that index is |
63 | | - opened, typical queries run on it (to warm up the copy, load |
64 | | - caches), main namespace filter is rebuilt, and then in synchronized |
65 | | - block the old searcher is replaced with a new one. The old one is |
66 | | - close after 15 seconds (these 15s is to allow threads to finish with |
67 | | - the old index) |
68 | | - |
69 | | - |
70 | | - |
71 | | -== Indexing == |
72 | | - |
73 | | -- One indexer per each index part. For split indexes there is one main |
74 | | - indexer that maintains a logical view on the whole index |
75 | | - |
76 | | -- Mainsplit and single indexes have very limited overhead, Split |
77 | | - indexes have a larger overhead, due to the reporting system that |
78 | | - keeps the operations atomic and makes sure they are correctly |
79 | | - carried out |
80 | | - |
81 | | -- Indexing is fastest if it's done in large batches. However, storing |
82 | | - large number of articles eats up heap. 64MB heap is eaten up by |
83 | | - queue size of 30000, but in my testing environment worked fine with |
84 | | - queue of 5000. |
85 | | - |
86 | | -- After testing java XMLRPC implementations it seemed to me that they |
87 | | - all introduce large overhead, so I implemented hackish HTTP frontend |
88 | | - for indexer. Article text is transfered raw in POST request, and |
89 | | - function to be called is encoded in the URL. |
90 | | - |
91 | | - |
92 | | - |
93 | | -== Wiki parser == |
94 | | - |
95 | | -- FastWikiTokenizerEngine.java is a handmade parser for basic wiki |
96 | | - syntax. This is to replace the slow stripWiki() function. |
97 | | - |
98 | | -- Accents are stripped by default, thus no accents are ever |
99 | | - indexed. AFAIK, this is OK for all languages (and seems to be the |
100 | | - way major search engines do it). Indexing accented words as aliases |
101 | | - would be probably unnecessary overhead. |
102 | | - |
103 | | -- numbers are also tokenized |
104 | | - |
105 | | -- extracts categories and interwiki (interwikis are currently unused) |
106 | | - |
107 | | -- skips table properties, names of templates (e.g. so that search for |
108 | | - "stub" gives meaningful results), image properties, external link |
109 | | - urls, xml markup |
110 | | - |
111 | | -- localization is read at startup from Messages files (to be |
112 | | - up-to-date), so that parser recognized localized variants of |
113 | | - Category and Image keywords. Interwiki map is read out of static |
114 | | - file lib/interwiki.map (which could be update somehow?) |
115 | | - |
116 | | - |
117 | | - |
118 | | -== Analyzers/Languages == |
119 | | - |
120 | | -- Generic Language analyzer consists of filter (e.g. for Serbian |
121 | | - (convert to latin, etc) and Thai (tokenize words)) and stemmer (for |
122 | | - English, German, French, Esperanto, Dutch, Russian). Stemmed words |
123 | | - are indexed alongside with the original words (i.e. as aliases - |
124 | | - positional increment 0) |
125 | | - |
126 | | -- Search queries uses same language analyzer, but stemmed words are |
127 | | - boosted with 0.5, so that exact match is favored |
128 | | - |
129 | | -- Titles are not stemmed, to even more favor exact matches and reduce |
130 | | - overhead, as words from title usually appear in the article |
131 | | - |
132 | | -TODO: |
133 | | -- Maybe look at more languages, especially Chinese |
134 | | - |
135 | | - |
136 | | - |
137 | | -== Query Parser == |
138 | | - |
139 | | -- Faster with complex queries than Lucene QueryParser |
140 | | - |
141 | | -- recognizes subset of QueryParser syntax: AND, OR keywords and |
142 | | - +,-. Phrases enclosed in "". Supports wilcards with * in end. |
143 | | - |
144 | | -- introduces namespace prefixes ''namespace:query'' to limit search to |
145 | | - ceratin namespace: e.g. ''help:inserting pictures''. Note that |
146 | | - ''help'' prefix is valid until the end of query of some other prefix |
147 | | - definition: e.g. ''help:editing project:wikipedia'' will find all |
148 | | - pages in help namespace containing ''editing'' and all pages in |
149 | | - project namespace containing ''wikipedia''. |
150 | | - |
151 | | -- prefixes are defined in global configuration, but for generality |
152 | | - (and LuceneSearch) there is also a generic way to make prefixes. |
153 | | - E.g. ''[0,1,2]:query'' will search namespaces 0,1,2. This is |
154 | | - convenient because it allows extended customization on the user |
155 | | - side (i.ei mw extension rewrites custom labels into this syntax). |
156 | | - |
157 | | -- searching categories. Syntax is: ''query incategory:"exact category |
158 | | - name"''. It is important to note that category names are themselves |
159 | | - not tokenized. Using logical operators, intersection, union and |
160 | | - difference of categories can be searched. Since exact category is |
161 | | - needed (only case is not important), it is maybe best to incorporate |
162 | | - this somewhere on category page, and have category name put into |
163 | | - query by MediaWiki instead manually by user. |
164 | | - |
165 | | -Note: |
166 | | - |
167 | | -- namespace prefixes render the old way of picking namespaces to |
168 | | - search unusable, thus it should be removed from user settings. And |
169 | | - users should pick only one namespace to be their default for search |
170 | | - (or all namespaces). In theory, by rewriting the query it could be |
171 | | - possible to be back compatible with the current way, but it would |
172 | | - slow down searching for those users, and I wonder if it is important |
173 | | - to be able to to have any combination of namespace searched by |
174 | | - default and how many users use this |
175 | | - |
176 | | -- see WikiQueryParser.java for adopted names of namespaces (all: is |
177 | | - special prefix for all namespace) |
178 | | - |
179 | | -- Before search query is passed to Lucene-Search, localized version of |
180 | | - namespace names should be replaced with standard ones. This should |
181 | | - be implemented in MediaWiki. E.g. ''srpski kategorija:jezici'' -> |
182 | | - ''srpski incategory:jezici'' |
183 | | - |
184 | | - |
185 | | - |
186 | | -== Lucene patch == |
187 | | - |
188 | | -- Don't use Readers but plain strings when possible. Java streams are |
189 | | - very slow, whenever we don't need the general Reader interface, I |
190 | | - replaced it with just Strings (instead of StringReaders). |
191 | | - |
192 | | -- SearchableMul interface, enables retrieval of many documents in |
193 | | - single call (to minimize network overhead) |
194 | | - |
195 | | -TODO: make patch file for lucene 2.0 |
196 | | - |
197 | | - |
198 | | - |
199 | | -== Incremental update == |
200 | | - |
201 | | -- get load off database, more up-to-date index, etc.. |
202 | | - |
203 | | -- Incremental updating is available via OAI-PMH interface. One indexer |
204 | | - can have many incremental updaters delivering latest updates. |
205 | | - Incremental updater maintains a list of status files, in which latest |
206 | | - timestamp of successful updates is stored. |
207 | | - |
208 | | -- Snapshot is made of index at regular intervals, and it's picked up |
209 | | - by searchers (via is a RMI query system) and rsynced. Indexes should |
210 | | - be optimized, and need to be properly warmed up to enable smooth |
211 | | - transition when new index replaces the old. Test indicate than this |
212 | | - scheme will work smoothly until 1/2 of both indexes can be kept in |
213 | | - memory. After that the operation becomes I/O bound, and degrades |
214 | | - search performance during warmup. |
215 | | - |
216 | | - |
\ No newline at end of file |
Index: trunk/lucene-search-2/RELEASE-NOTES.txt |
— | — | @@ -0,0 +1,22 @@ |
| 2 | +Lucene Search 2.0.2 |
| 3 | +==================== |
| 4 | + |
| 5 | +* Fix bug 10822. Convert underscores to spaces in category names. |
| 6 | + |
| 7 | +Lucene Search 2.0.1 |
| 8 | +==================== |
| 9 | + |
| 10 | +* Fix CJK tokenization - tokenize C1C2C2 -> C1C2 C2C3 and wrap |
| 11 | + searches into phrase queries |
| 12 | +* Fix đ in Thai, didn't properly resolve into d |
| 13 | + |
| 14 | +Lucene Search 2.0.0 |
| 15 | +==================== |
| 16 | + |
| 17 | +* Initial release, almost complete rewrite of the C# version |
| 18 | + New features very briefly: |
| 19 | + - distribute search/indexing |
| 20 | + - accentless search, stemmers for 12 languages |
| 21 | + - link analysis for ranking |
| 22 | + - namespace-prefixed queries |
| 23 | + |
Index: trunk/lucene-search-2/README.txt |
— | — | @@ -1,4 +1,4 @@ |
2 | | - Lucene Search 2.0: extension for MediaWiki |
| 2 | + Lucene Search 2: extension for MediaWiki |
3 | 3 | ========================================== |
4 | 4 | |
5 | 5 | Requirements: |
— | — | @@ -12,20 +12,11 @@ |
13 | 13 | - Apache XMLRPC 3.0 (for XMLRPC interface) |
14 | 14 | - Apache Ant 1.6 (for building from source, etc) |
15 | 15 | |
16 | | -Setup: |
| 16 | +Installing: |
17 | 17 | |
18 | | - - Edit mwsearch-global.conf and make it available at some URL |
19 | | - - At each host: |
20 | | - * properly setup hostname (otherwise JavaVM gets confused) |
21 | | - * make and set permissions of local directory for indexes |
22 | | - * edit mwsearch.conf: |
23 | | - + MWConfig.global to point to URL of mwsearch-global.conf |
24 | | - + MWConfig.lib to point to local library path (ie with unicode-data etc) |
25 | | - + Localization.url to point to URL of latest message files from MediaWiki |
26 | | - + Indexes.path - base path where you want the deamon to store the indexes, |
27 | | - + Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch package has a sample log4j file you can use) |
28 | | - * setup rsync daemon (see rsyncd.conf-example) |
29 | | - * setup log4j logging subsystem (see mwsearch.log4j-example) |
| 18 | + - Up-to-date instructions and troubleshooting can be found at: |
| 19 | + |
| 20 | + http://www.mediawiki.org/wiki/Extension:LuceneSearch |
30 | 21 | |
31 | 22 | Running: |
32 | 23 | |
— | — | @@ -50,9 +41,8 @@ |
51 | 42 | table parameters, image parameters (except caption) are not indexed. |
52 | 43 | |
53 | 44 | - query parser, faster search query parsing, enables prefixes for namespaces, |
54 | | - e.g. 'help:editing pages'. Prefixes are localized within MediaWiki. Can |
55 | | - do category searches e.g. 'smoked category:cheeses'. Rewrites all of these |
56 | | - so that stemmed present are present but add less to document score. |
| 45 | + e.g. 'help:editing pages'. Prefixes are localized within MediaWiki. Rewrites |
| 46 | + all of these so that stemmed present are present but add less to document score. |
57 | 47 | |
58 | 48 | - (hopefully) robust architecture, with threads pinging hosts that are down, |
59 | 49 | and search daemons trying alternatives if host holding part of the |