r24629 MediaWiki - Code Review archive

Repository:	MediaWiki
Revision:	< r24628‎ \| r24629 \| r24630 >
Date:	18:13, 6 August 2007
Author:	rainman
Status:	old
Tags:
Comment:	* add release notes * tidy README a bit * OVERVIEW is obsolete
Modified paths:	/trunk/lucene-search-2/OVERVIEW.txt (deleted) (history) /trunk/lucene-search-2/README.txt (modified) (history) /trunk/lucene-search-2/RELEASE-NOTES.txt (added) (history)

Diff [purge]

Index: trunk/lucene-search-2/OVERVIEW.txt
—	—	@@ -1,215 +0,0 @@
2		-
3		~~-Lucene Search 2.0 Overview (by Robert Stojnic)~~
4		-
5		-
6		~~-== Distributed architecture ==~~
7		-
8		~~-- one indexer / many searchers~~
9		-
10		~~-- indexers make clean snapshots of indexes and searchers use rsync to~~
11		~~- obtain local copy of the updated index~~
12		-
13		~~-- index can be either whole (single), mainsplit (two parts, one with~~
14		~~- all articles in main namespace, other with rest of article), and~~
15		~~- split (some fraction of documents in each subindex).~~
16		-
17		~~-- both indexing and searching can be distributed on many hosts~~
18		-
19		~~-- there is a global configuration file (mwsearch-global.conf) that~~
20		~~- lays out the global architecture, defines indexes, indexers,~~
21		~~- searchers. It needs to be available to all nodes.~~
22		-
23		~~-- local configuration file deals with host-specific stuff~~
24		-
25		~~-- Java RMI is used for communication, it is typically at least 5-6~~
26		~~- times faster than xmlrpc, and also manages persistent connections.~~
27		~~- Further, Lucene has built-in support for RMI.~~
28		-
29		~~-- Searchers watch the status of other searchers, and try to ping dead~~
30		~~- searchers (and at same time take them out of rotation while they are~~
31		~~- down)~~
32		-
33		~~-Notes:~~
34		-
35		~~-- mainsplit is a special case of more general idea of having database~~
36		~~- split around namespaces. It is a convenient special case because~~
37		~~- database then has minimal size for most searches (taking that most~~
38		~~- searches come from anonymous users wanting to search main~~
39		~~- namespace). Currently, this is the only implemented way to split the~~
40		~~- database around namespaces, but it could be easily extended.~~
41		-
42		-
43		~~-== Searching ==~~
44		-
45		~~-- Distributed search (on split index) takes 3 calls per nonlocal~~
46		~~- index: getting IDFs (inverse document frequencies, needed to~~
47		~~- construct global scorer), actual search, and retrieving some number~~
48		~~- of documents (if needed)~~
49		-
50		~~-- Searchers are organized in search groups. When search is to be~~
51		~~- executed, each searcher tries to use local copies of indexes, and~~
52		~~- for those it doesn't have a local copy, randomly chooses remote~~
53		~~- searcher within its group that has the needed index part~~
54		-
55		~~-- Search always returns correct number of hits, since queries are~~
56		~~- either filtered (filters are cached and introduce very little~~
57		~~- overhead - one BitSet.get() per result), or rewritten so that only~~
58		~~- those namespaces specified in the query are searched~~
59		-
60		~~-- Update of index is done in separate thread that prepares a copy for~~
61		~~- rsync (makes a hard-link of current index), and then runs rsync on~~
62		~~- it so that only differences are transfered. After that index is~~
63		~~- opened, typical queries run on it (to warm up the copy, load~~
64		~~- caches), main namespace filter is rebuilt, and then in synchronized~~
65		~~- block the old searcher is replaced with a new one. The old one is~~
66		~~- close after 15 seconds (these 15s is to allow threads to finish with~~
67		~~- the old index)~~
68		-
69		-
70		-
71		~~-== Indexing ==~~
72		-
73		~~-- One indexer per each index part. For split indexes there is one main~~
74		~~- indexer that maintains a logical view on the whole index~~
75		-
76		~~-- Mainsplit and single indexes have very limited overhead, Split~~
77		~~- indexes have a larger overhead, due to the reporting system that~~
78		~~- keeps the operations atomic and makes sure they are correctly~~
79		~~- carried out~~
80		-
81		~~-- Indexing is fastest if it's done in large batches. However, storing~~
82		~~- large number of articles eats up heap. 64MB heap is eaten up by~~
83		~~- queue size of 30000, but in my testing environment worked fine with~~
84		~~- queue of 5000.~~
85		-
86		~~-- After testing java XMLRPC implementations it seemed to me that they~~
87		~~- all introduce large overhead, so I implemented hackish HTTP frontend~~
88		~~- for indexer. Article text is transfered raw in POST request, and~~
89		~~- function to be called is encoded in the URL.~~
90		-
91		-
92		-
93		~~-== Wiki parser ==~~
94		-
95		~~-- FastWikiTokenizerEngine.java is a handmade parser for basic wiki~~
96		~~- syntax. This is to replace the slow stripWiki() function.~~
97		-
98		~~-- Accents are stripped by default, thus no accents are ever~~
99		~~- indexed. AFAIK, this is OK for all languages (and seems to be the~~
100		~~- way major search engines do it). Indexing accented words as aliases~~
101		~~- would be probably unnecessary overhead.~~
102		-
103		~~-- numbers are also tokenized~~
104		-
105		~~-- extracts categories and interwiki (interwikis are currently unused)~~
106		-
107		~~-- skips table properties, names of templates (e.g. so that search for~~
108		~~- "stub" gives meaningful results), image properties, external link~~
109		~~- urls, xml markup~~
110		-
111		~~-- localization is read at startup from Messages files (to be~~
112		~~- up-to-date), so that parser recognized localized variants of~~
113		~~- Category and Image keywords. Interwiki map is read out of static~~
114		~~- file lib/interwiki.map (which could be update somehow?)~~
115		-
116		-
117		-
118		~~-== Analyzers/Languages ==~~
119		-
120		~~-- Generic Language analyzer consists of filter (e.g. for Serbian~~
121		~~- (convert to latin, etc) and Thai (tokenize words)) and stemmer (for~~
122		~~- English, German, French, Esperanto, Dutch, Russian). Stemmed words~~
123		~~- are indexed alongside with the original words (i.e. as aliases -~~
124		~~- positional increment 0)~~
125		-
126		~~-- Search queries uses same language analyzer, but stemmed words are~~
127		~~- boosted with 0.5, so that exact match is favored~~
128		-
129		~~-- Titles are not stemmed, to even more favor exact matches and reduce~~
130		~~- overhead, as words from title usually appear in the article~~
131		-
132		~~-TODO:~~
133		~~-- Maybe look at more languages, especially Chinese~~
134		-
135		-
136		-
137		~~-== Query Parser ==~~
138		-
139		~~-- Faster with complex queries than Lucene QueryParser~~
140		-
141		~~-- recognizes subset of QueryParser syntax: AND, OR keywords and~~
142		~~- +,-. Phrases enclosed in "". Supports wilcards with * in end.~~
143		-
144		~~-- introduces namespace prefixes ''namespace:query'' to limit search to~~
145		~~- ceratin namespace: e.g. ''help:inserting pictures''. Note that~~
146		~~- ''help'' prefix is valid until the end of query of some other prefix~~
147		~~- definition: e.g. ''help:editing project:wikipedia'' will find all~~
148		~~- pages in help namespace containing ''editing'' and all pages in~~
149		~~- project namespace containing ''wikipedia''.~~
150		-
151		~~-- prefixes are defined in global configuration, but for generality~~
152		~~- (and LuceneSearch) there is also a generic way to make prefixes.~~
153		~~- E.g. ''[0,1,2]:query'' will search namespaces 0,1,2. This is~~
154		~~- convenient because it allows extended customization on the user~~
155		~~- side (i.ei mw extension rewrites custom labels into this syntax).~~
156		-
157		~~-- searching categories. Syntax is: ''query incategory:"exact category~~
158		~~- name"''. It is important to note that category names are themselves~~
159		~~- not tokenized. Using logical operators, intersection, union and~~
160		~~- difference of categories can be searched. Since exact category is~~
161		~~- needed (only case is not important), it is maybe best to incorporate~~
162		~~- this somewhere on category page, and have category name put into~~
163		~~- query by MediaWiki instead manually by user.~~
164		-
165		~~-Note:~~
166		-
167		~~-- namespace prefixes render the old way of picking namespaces to~~
168		~~- search unusable, thus it should be removed from user settings. And~~
169		~~- users should pick only one namespace to be their default for search~~
170		~~- (or all namespaces). In theory, by rewriting the query it could be~~
171		~~- possible to be back compatible with the current way, but it would~~
172		~~- slow down searching for those users, and I wonder if it is important~~
173		~~- to be able to to have any combination of namespace searched by~~
174		~~- default and how many users use this~~
175		-
176		~~-- see WikiQueryParser.java for adopted names of namespaces (all: is~~
177		~~- special prefix for all namespace)~~
178		-
179		~~-- Before search query is passed to Lucene-Search, localized version of~~
180		~~- namespace names should be replaced with standard ones. This should~~
181		~~- be implemented in MediaWiki. E.g. ''srpski kategorija:jezici'' ->~~
182		~~- ''srpski incategory:jezici''~~
183		-
184		-
185		-
186		~~-== Lucene patch ==~~
187		-
188		~~-- Don't use Readers but plain strings when possible. Java streams are~~
189		~~- very slow, whenever we don't need the general Reader interface, I~~
190		~~- replaced it with just Strings (instead of StringReaders).~~
191		-
192		~~-- SearchableMul interface, enables retrieval of many documents in~~
193		~~- single call (to minimize network overhead)~~
194		-
195		~~-TODO: make patch file for lucene 2.0~~
196		-
197		-
198		-
199		~~-== Incremental update ==~~
200		-
201		~~-- get load off database, more up-to-date index, etc..~~
202		-
203		~~-- Incremental updating is available via OAI-PMH interface. One indexer~~
204		~~- can have many incremental updaters delivering latest updates.~~
205		~~- Incremental updater maintains a list of status files, in which latest~~
206		~~- timestamp of successful updates is stored.~~
207		-
208		~~-- Snapshot is made of index at regular intervals, and it's picked up~~
209		~~- by searchers (via is a RMI query system) and rsynced. Indexes should~~
210		~~- be optimized, and need to be properly warmed up to enable smooth~~
211		~~- transition when new index replaces the old. Test indicate than this~~
212		~~- scheme will work smoothly until 1/2 of both indexes can be kept in~~
213		~~- memory. After that the operation becomes I/O bound, and degrades~~
214		~~- search performance during warmup.~~
215		-
216		-
\ No newline at end of file
Index: trunk/lucene-search-2/RELEASE-NOTES.txt
—	—	@@ -0,0 +1,22 @@
	2	+Lucene Search 2.0.2
	3	+====================
	4	+
	5	+* Fix bug 10822. Convert underscores to spaces in category names.
	6	+
	7	+Lucene Search 2.0.1
	8	+====================
	9	+
	10	+* Fix CJK tokenization - tokenize C1C2C2 -> C1C2 C2C3 and wrap
	11	+ searches into phrase queries
	12	+* Fix đ in Thai, didn't properly resolve into d
	13	+
	14	+Lucene Search 2.0.0
	15	+====================
	16	+
	17	+* Initial release, almost complete rewrite of the C# version
	18	+ New features very briefly:
	19	+ - distribute search/indexing
	20	+ - accentless search, stemmers for 12 languages
	21	+ - link analysis for ranking
	22	+ - namespace-prefixed queries
	23	+
Index: trunk/lucene-search-2/README.txt
—	—	@@ -1,4 +1,4 @@
2		~~- Lucene Search 2.0: extension for MediaWiki~~
	2	+ Lucene Search 2: extension for MediaWiki
3	3	==========================================
4	4
5	5	Requirements:
—	—	@@ -12,20 +12,11 @@
13	13	- Apache XMLRPC 3.0 (for XMLRPC interface)
14	14	- Apache Ant 1.6 (for building from source, etc)
15	15
16		~~-Setup:~~
	16	+Installing:
17	17
18		~~- - Edit mwsearch-global.conf and make it available at some URL~~
19		~~- - At each host:~~
20		~~- * properly setup hostname (otherwise JavaVM gets confused)~~
21		~~- * make and set permissions of local directory for indexes~~
22		~~- * edit mwsearch.conf:~~
23		~~- + MWConfig.global to point to URL of mwsearch-global.conf~~
24		~~- + MWConfig.lib to point to local library path (ie with unicode-data etc)~~
25		~~- + Localization.url to point to URL of latest message files from MediaWiki~~
26		~~- + Indexes.path - base path where you want the deamon to store the indexes,~~
27		~~- + Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch package has a sample log4j file you can use)~~
28		~~- * setup rsync daemon (see rsyncd.conf-example)~~
29		~~- * setup log4j logging subsystem (see mwsearch.log4j-example)~~
	18	+ - Up-to-date instructions and troubleshooting can be found at:
	19	+
	20	+ http://www.mediawiki.org/wiki/Extension:LuceneSearch
30	21
31	22	Running:
32	23
—	—	@@ -50,9 +41,8 @@
51	42	table parameters, image parameters (except caption) are not indexed.
52	43
53	44	- query parser, faster search query parsing, enables prefixes for namespaces,
54		~~- e.g. 'help:editing pages'. Prefixes are localized within MediaWiki. Can~~
55		~~- do category searches e.g. 'smoked category:cheeses'. Rewrites all of these~~
56		~~- so that stemmed present are present but add less to document score.~~
	45	+ e.g. 'help:editing pages'. Prefixes are localized within MediaWiki. Rewrites
	46	+ all of these so that stemmed present are present but add less to document score.
57	47
58	48	- (hopefully) robust architecture, with threads pinging hosts that are down,
59	49	and search daemons trying alternatives if host holding part of the

Status & tagging log

15:20, 12 September 2011 Meno25 (talk | contribs) changed the status of r24629 [removed: ok added: old]