r21643 MediaWiki - Code Review archive

Repository:	MediaWiki
Revision:	< r21642‎ \| r21643 \| r21644 >
Date:	23:05, 26 April 2007
Author:	rainman
Status:	old
Tags:
Comment:	Added somewhat more detailed overview of lucene-search 2.0 extension.
Modified paths:	/trunk/lucene-search-2.0/OVERVIEW.txt (added) (history)

Diff [purge]

Index: trunk/lucene-search-2.0/OVERVIEW.txt
—	—	@@ -0,0 +1,225 @@
	2	+
	3	+Lucene Search 2.0 Overview (by Robert Stojnic)
	4	+
	5	+
	6	+== Distributed architecture ==
	7	+
	8	+- one indexer / many searchers
	9	+
	10	+- indexers make clean snapshots of indexes and searchers use rsync to
	11	+ obtain local copy of the updated index
	12	+
	13	+- index can be either whole (single), mainsplit (two parts, one with
	14	+ all articles in main namespace, other with rest of article), and
	15	+ split (some fraction of documents in each subindex).
	16	+
	17	+- both indexing and searching can be distributed on many hosts
	18	+
	19	+- there is a global configuration file (mwsearch-global.conf) that
	20	+ lays out the global architecture, defines indexes, indexers,
	21	+ searchers. It needs to be available to all nodes.
	22	+
	23	+- local configuration file deals with host-specific stuff
	24	+
	25	+- Java RMI is used for communication, it is typically at least 5-6
	26	+ times faster than xmlrpc, and also manages persistent connections.
	27	+ Further, Lucene has built-in support for RMI.
	28	+
	29	+- Searchers watch the status of other searchers, and try to ping dead
	30	+ searchers (and at same time take them out of rotation while they are
	31	+ down)
	32	+
	33	+Notes:
	34	+
	35	+- mainsplit is a special case of more general idea of having database
	36	+ split around namespaces. It is a convenient special case because
	37	+ database then has minimal size for most searches (taking that most
	38	+ searches come from anonymous users wanting to search main
	39	+ namespace). Currently, this is the only implemented way to split the
	40	+ database around namespaces, but it could be easily extended.
	41	+
	42	+
	43	+== Searching ==
	44	+
	45	+- Distributed search (on split index) takes 3 calls per nonlocal
	46	+ index: getting IDFs (inverse document frequencies, needed to
	47	+ construct global scorer), actual search, and retrieving some number
	48	+ of documents (if needed)
	49	+
	50	+- Searchers are organized in search groups. When search is to be
	51	+ executed, each searcher tries to use local copies of indexes, and
	52	+ for those it doesn't have a local copy, randomly chooses remote
	53	+ searcher within its group that has the needed index part
	54	+
	55	+- Search always returns correct number of hits, since queries are
	56	+ either filtered (filters are cached and introduce very little
	57	+ overhead - one BitSet.get() per result), or rewritten so that only
	58	+ those namespaces specified in the query are searched
	59	+
	60	+- Update of index is done in separate thread that prepares a copy for
	61	+ rsync (makes a hard-link of current index), and then runs rsync on
	62	+ it so that only differences are transfered. After that index is
	63	+ opened, typical queries run on it (to warm up the copy, load
	64	+ caches), main namespace filter is rebuilt, and then in synchronized
	65	+ block the old searcher is replaced with a new one. The old one is
	66	+ close after 15 seconds (these 15s is to allow threads to finish with
	67	+ the old index)
	68	+
	69	+
	70	+
	71	+== Indexing ==
	72	+
	73	+- One indexer per each index part. For split indexes there is one main
	74	+ indexer that maintains a logical view on the whole index
	75	+
	76	+- Mainsplit and single indexes have very limited overhead, Split
	77	+ indexes have a larger overhead, due to the reporting system that
	78	+ keeps the operations atomic and makes sure they are correctly
	79	+ carried out
	80	+
	81	+- Indexing is fastest if it's done in large batches. However, storing
	82	+ large number of articles eats up heap. 64MB heap is eaten up by
	83	+ queue size of 30000, but in my testing environment worked fine with
	84	+ queue of 5000.
	85	+
	86	+- After testing java XMLRPC implementations it seemed to me that they
	87	+ all introduce large overhead, so I implemented hackish HTTP frontend
	88	+ for indexer. Article text is transfered raw in POST request, and
	89	+ function to be called is encoded in the URL.
	90	+
	91	+
	92	+
	93	+== Wiki parser ==
	94	+
	95	+- FastWikiTokenizerEngine.java is a handmade parser for basic wiki
	96	+ syntax. This is to replace the slow stripWiki() function.
	97	+
	98	+- Accents are stripped by default, thus no accents are ever
	99	+ indexed. AFAIK, this is OK for all languages (and seems to be the
	100	+ way major search engines do it). Indexing accented words as aliases
	101	+ would be probably unnecessary overhead.
	102	+
	103	+- numbers are also tokenized
	104	+
	105	+- extracts categories and interwiki (interwikis are currently unused)
	106	+
	107	+- skips table properties, names of templates (e.g. so that search for
	108	+ "stub" gives meaningful results), image properties, external link
	109	+ urls, xml markup
	110	+
	111	+- localization is read at startup from Messages files (to be
	112	+ up-to-date), so that parser recognized localized variants of
	113	+ Category and Image keywords. Interwiki map is read out of static
	114	+ file lib/interwiki.map (which could be update somehow?)
	115	+
	116	+
	117	+
	118	+== Analyzers/Languages ==
	119	+
	120	+- Generic Language analyzer consists of filter (e.g. for Serbian
	121	+ (convert to latin, etc) and Thai (tokenize words)) and stemmer (for
	122	+ English, German, French, Esperanto, Dutch, Russian). Stemmed words
	123	+ are indexed alongside with the original words (i.e. as aliases -
	124	+ positional increment 0)
	125	+
	126	+- Search queries uses same language analyzer, but stemmed words are
	127	+ boosted with 0.5, so that exact match is favored
	128	+
	129	+- Titles are not stemmed, to even more favor exact matches and reduce
	130	+ overhead, as words from title usually appear in the article
	131	+
	132	+TODO:
	133	+- Maybe look at more languages, especially Chinese
	134	+
	135	+
	136	+
	137	+== Query Parser ==
	138	+
	139	+- Faster with complex queries than Lucene QueryParser
	140	+
	141	+- recognizes subset of QueryParser syntax: AND, OR keywords and
	142	+ +,-. Phrases enclosed in ""
	143	+
	144	+- introduces namespace prefixes ''namespace:query'' to limit search to
	145	+ ceratin namespace: e.g. ''help:inserting pictures''. Note that
	146	+ ''help'' prefix is valid until the end of query of some other prefix
	147	+ definition: e.g. ''help:editing project:wikipedia'' will find all
	148	+ pages in help namespace containing ''editing'' and all pages in
	149	+ project namespace containing ''wikipedia''.
	150	+
	151	+- searching categories. Syntax is: ''query category:"exact category
	152	+ name"''. It is important to note that category names are themselves
	153	+ not tokenized. Using logical operators, intersection, union and
	154	+ difference of categories can be searched. Since exact category is
	155	+ needed (only case is not important), it is maybe best to incorporate
	156	+ this somewhere on category page, and have category name put into
	157	+ query by MediaWiki instead manually by user.
	158	+
	159	+Note:
	160	+
	161	+- namespace prefixes render the old way of picking namespaces to
	162	+ search unusable, thus it should be removed from user settings. And
	163	+ users should pick only one namespace to be their default for search
	164	+ (or all namespaces). In theory, by rewriting the query it could be
	165	+ possible to be back compatible with the current way, but it would
	166	+ slow down searching for those users, and I wonder if it is important
	167	+ to be able to to have any combination of namespace searched by
	168	+ default and how many users use this
	169	+
	170	+- see WikiQueryParser.java for adopted names of namespaces (all: is
	171	+ special prefix for all namespace)
	172	+
	173	+Todo:
	174	+
	175	+- Before search query is passed to Lucene-Search, localized version of
	176	+ namespace names should be replaced with standard ones. This should
	177	+ be implemented in MediaWiki. E.g. ''srpski kategorija:jezici'' ->
	178	+ ''srpski category:jezici''
	179	+
	180	+
	181	+
	182	+== Lucene patch ==
	183	+
	184	+- Don't use Readers but plain strings when possible. Java streams are
	185	+ very slow, whenever we don't need the general Reader interface, I
	186	+ replaced it with just Strings (instead of StringReaders).
	187	+
	188	+- SearchableMul interface, enables retrieval of many documents in
	189	+ single call (to minimize network overhead)
	190	+
	191	+
	192	+
	193	+== Incremental update ==
	194	+
	195	+- get load off database, more up-to-date index, etc..
	196	+
	197	+- however, needs to be tested in production environment. There are two
	198	+ scenarios:
	199	+
	200	+ * don't optimize indexes with merge factor of 2. This means that
	201	+ additional segments will be made with new documents, and that old
	202	+ documents will be deleted when the size of index doubles (roughly)
	203	+
	204	+ This is good because incremental update is fast, and download of
	205	+ index to searchers is also fast since only new segments need to be
	206	+ transfered, and old segment pages are already in memory (since we
	207	+ use cp -lr to prepare a copy for rsync).
	208	+
	209	+ However, search will be slower (according to my tests up to 50%
	210	+ slower in worst case), and in worst case twice the memory is
	211	+ needed
	212	+
	213	+ * optimize indexes. To optimize index, indexer needs twice the
	214	+ memory of whole index (because brand new segments is made from all
	215	+ old segments merged)
	216	+
	217	+ Rsync diff algorithm might be helpful in transferring the
	218	+ optimized index, but probably not too much.
	219	+
	220	+ On searcher, twice the memory is needed when index is updated
	221	+ (actually probably less since in the beginning not all of the
	222	+ index is loaded in memory, especially the part of the index with
	223	+ positional increments)
	224	+
	225	+- third scenario is to abandon incremental updates altogether, and
	226	+ just rebuild whole database from scratch and snapshot then