r23139 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r23138‎ | r23139 | r23140 >
Date:21:20, 20 June 2007
Author:rainman
Status:old
Tags:
Comment:
Final tweeks:
* Merge redirect field with keyword
* Don't add exact-accent aliases, enlarges the index too much, but
doesn't help much in searching. Maybe make an option to add it
for some languages if necessary
* Drop the dictionaries, and use actual sampled query data for warmup
* Make global config add all parts for logical db names in [search-group]
Modified paths:
  • /trunk/lucene-search-2.0/README.txt (modified) (history)
  • /trunk/lucene-search-2.0/lib/dict/english.txt.gz (deleted) (history)
  • /trunk/lucene-search-2.0/lib/dict/french.txt.gz (deleted) (history)
  • /trunk/lucene-search-2.0/lib/dict/german.txt.gz (deleted) (history)
  • /trunk/lucene-search-2.0/lib/dict/terms-de.txt.gz (added) (history)
  • /trunk/lucene-search-2.0/lib/dict/terms-en.txt.gz (added) (history)
  • /trunk/lucene-search-2.0/lib/dict/terms-es.txt.gz (added) (history)
  • /trunk/lucene-search-2.0/lib/dict/terms-fr.txt.gz (added) (history)
  • /trunk/lucene-search-2.0/lib/dict/terms-it.txt.gz (added) (history)
  • /trunk/lucene-search-2.0/lib/dict/terms-pt.txt.gz (added) (history)
  • /trunk/lucene-search-2.0/lsearch-global.conf (modified) (history)
  • /trunk/lucene-search-2.0/lsearch.conf (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/Analyzers.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/CategoryAnalyzer.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/FastWikiTokenizerEngine.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/FieldBuilder.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/FieldNameFactory.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/KeywordsAnalyzer.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/QueryLanguageAnalyzer.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/WikiQueryParser.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/WikiTokenizer.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/benchmark/Benchmark.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/benchmark/StreamTerms.java (added) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/benchmark/WordTerms.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/config/GlobalConfiguration.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/config/IndexId.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/config/StartupManager.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/highlight/HighlightDaemon.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/importer/DumpImporter.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/importer/Importer.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/importer/SimpleIndexWriter.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/index/WikiIndexModifier.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/interoperability/RMIMessenger.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/interoperability/RMIMessengerClient.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/interoperability/RMIMessengerImpl.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/oai/IncrementalUpdater.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/SearchEngine.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/SearcherCache.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/UpdateThread.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/Warmup.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/EnglishAnalyzer.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/FastWikiTokenizerTest.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/GlobalConfigurationTest.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/WikiQueryParserTest.java (modified) (history)
  • /trunk/lucene-search-2.0/src/org/wikimedia/lsearch/util/Localization.java (modified) (history)
  • /trunk/lucene-search-2.0/test-data/mwsearch-global.test (modified) (history)

Diff [purge]

Index: trunk/lucene-search-2.0/test-data/mwsearch-global.test
@@ -19,14 +19,16 @@
2020 # host : db1.role, db2.role
2121 # Mulitple hosts can search multiple dbs (N-N mapping)
2222 [Search-Group]
23 -192.168.0.2 : entest, entest.mainpart
 23+192.168.0.2 : entest.mainpart
2424 192.168.0.5 : entest.mainpart, entest.restpart
2525 [Search-Group]
2626 192.168.0.4 : frtest.part1, frtest.part2
2727 192.168.0.6 : frtest.part3, detest
2828 [Search-Group]
29 -192.168.0.10 : entest, entest.mainpart
 29+192.168.0.10 :entest.mainpart
3030 192.168.0.2 : entest.restpart, rutest
 31+[Search-Group]
 32+192.168.0.1 : njawiki
3133
3234 # Index nodes
3335 # host: db1.role, db2.role
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/importer/DumpImporter.java
@@ -87,7 +87,7 @@
8888 // nop
8989 }
9090
91 - public void closeIndex(){
 91+ public void closeIndex() throws IOException {
9292 writer.close();
9393 }
9494
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/importer/SimpleIndexWriter.java
@@ -77,7 +77,7 @@
7878 }
7979 }
8080 writer.setSimilarity(new WikiSimilarity());
81 - int glMergeFactor = iid.getIntParam("mergeFactor",2);
 81+ int glMergeFactor = iid.getIntParam("mergeFactor",10);
8282 int glMaxBufDocs = iid.getIntParam("maxBufDocs",10);
8383 if(mergeFactor!=null)
8484 writer.setMergeFactor(mergeFactor);
@@ -122,8 +122,9 @@
123123 }
124124 }
125125
126 - /** Close and (if specified in global config) optimize indexes */
127 - public void close(){
 126+ /** Close and (if specified in global config) optimize indexes
 127+ * @throws IOException */
 128+ public void close() throws IOException{
128129 for(Entry<String,IndexWriter> en : indexes.entrySet()){
129130 IndexId iid = IndexId.get(en.getKey());
130131 IndexWriter writer = en.getValue();
@@ -137,6 +138,7 @@
138139 writer.close();
139140 } catch(IOException e){
140141 log.warn("I/O error optimizing/closing index at "+iid.getImportPath());
 142+ throw e;
141143 }
142144 }
143145 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/importer/Importer.java
@@ -139,7 +139,13 @@
140140 long end = System.currentTimeMillis();
141141
142142 log.info("Closing/optimizing index...");
143 - dp.closeIndex();
 143+ try{
 144+ dp.closeIndex();
 145+ } catch(IOException e){
 146+ e.printStackTrace();
 147+ log.fatal("Cannot close/optimize index : "+e.getMessage());
 148+ System.exit(1);
 149+ }
144150
145151 long finalEnd = System.currentTimeMillis();
146152
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/config/GlobalConfiguration.java
@@ -145,6 +145,27 @@
146146 System.out.println("ERROR in GlobalConfiguration: Default path for index absent. Check section [Index-Path].");
147147 return false;
148148 }
 149+ // expand logical index names on searchers
 150+ for(String host : search.keySet()){
 151+ ArrayList<String> hostsearch = search.get(host);
 152+ for(String dbname : hostsearch.toArray(new String[]{})){
 153+ Hashtable<String, Hashtable<String,String>> types = database.get(dbname);
 154+ if(types != null){ // if not null, dbrole is dbname
 155+ if(types.containsKey("mainsplit")){
 156+ hostsearch.add(dbname+".mainpart");
 157+ hostsearch.add(dbname+".restpart");
 158+ } else if(types.containsKey("split")){
 159+ int factor = Integer.parseInt(database.get(dbname).get("split").get("number"));
 160+ for(int i=1;i<=factor;i++)
 161+ hostsearch.add(dbname+".part"+i);
 162+ } else if(types.containsKey("nssplit")){
 163+ int factor = Integer.parseInt(database.get(dbname).get("nssplit").get("number"));
 164+ for(int i=1;i<=factor;i++)
 165+ hostsearch.add(dbname+".nspart"+i);
 166+ }
 167+ }
 168+ }
 169+ }
149170 // for each DB check if the corresponding parts are defined
150171 // if not, put them in with default values
151172 for(String dbname : database.keySet()){
@@ -161,6 +182,13 @@
162183 if(!types.contains(dbpart))
163184 database.get(dbname).put(dbpart,new Hashtable<String,String>());
164185 }
 186+ } else if(types.contains("nssplit")){
 187+ int factor = Integer.parseInt(database.get(dbname).get("nssplit").get("number"));
 188+ for(int i=1;i<factor+1;i++){
 189+ String dbpart = "nspart"+i;
 190+ if(!types.contains(dbpart))
 191+ database.get(dbname).put(dbpart,new Hashtable<String,String>());
 192+ }
165193 }
166194 }
167195 // check if every db.type has an indexer and searcher
@@ -196,7 +224,7 @@
197225 }
198226 }
199227 boolean searched = (getSearchHosts(dbrole).size() != 0);
200 - if(!searched && !(typeid.equals("mainsplit") || typeid.equals("split"))){
 228+ if(!searched && !(typeid.equals("mainsplit") || typeid.equals("split") || typeid.equals("nssplit"))){
201229 System.out.println("WARNING: in Global Configuration: index "+dbrole+" is not searched by any host.");
202230 }
203231 }
@@ -663,8 +691,15 @@
664692 type.equals("restpart") || type.matches("part[1-9][0-9]*")){
665693
666694 // all params are optional, if absent default will be used
667 - if(tokens.length>1)
668 - params.put("optimize",tokens[1].trim().toLowerCase());
 695+ if(tokens.length>1){
 696+ String token = tokens[1].trim().toLowerCase();
 697+ if(token.equals("true") || token.equals("false"))
 698+ params.put("optimize",token);
 699+ else{
 700+ System.err.println("Expecting true/false as second paramter of type "+type+" in database def: "+role);
 701+ System.exit(1);
 702+ }
 703+ }
669704 if(tokens.length>2)
670705 params.put("mergeFactor",tokens[2]);
671706 if(tokens.length>3)
@@ -718,8 +753,15 @@
719754 params.put("namespaces",ns);
720755
721756 // all params are optional, if absent default will be used
722 - if(tokens.length>1)
723 - params.put("optimize",tokens[1].trim().toLowerCase());
 757+ if(tokens.length>1){
 758+ String token = tokens[1].trim().toLowerCase();
 759+ if(token.equals("true") || token.equals("false"))
 760+ params.put("optimize",token);
 761+ else{
 762+ System.err.println("Expecting true/false as third paramter of type "+type+" in database def: "+role);
 763+ System.exit(1);
 764+ }
 765+ }
724766 if(tokens.length>2)
725767 params.put("mergeFactor",tokens[2]);
726768 if(tokens.length>3)
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/config/StartupManager.java
@@ -63,12 +63,12 @@
6464 }
6565 }
6666 if(global.isSearcher()){
 67+ // startup
 68+ (new SearchServer()).start();
6769 // warmup local indexes
6870 SearcherCache.getInstance().warmupLocalCache();
69 - // startup
70 - (new SearchServer()).start();
7171 UpdateThread.getInstance().start(); // updater for local indexes
72 - NetworkStatusThread.getInstance().start(); // network monitor
 72+ NetworkStatusThread.getInstance().start(); // network monitor
7373 }
7474
7575 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/config/IndexId.java
@@ -484,5 +484,12 @@
485485 } else
486486 return null;
487487 }
 488+
 489+ /** Return the set of namespaces which are searched by this nssplit part */
 490+ public HashSet<String> getNamespaceSet() {
 491+ return namespaceSet;
 492+ }
 493+
 494+
488495
489496 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/SearcherCache.java
@@ -165,6 +165,8 @@
166166 HashSet<IndexId> mys = global.getMySearch();
167167 for(IndexId iid : mys){
168168 try {
 169+ if(iid.isLogical())
 170+ continue;
169171 IndexSearcherMul is = getLocalSearcher(iid);
170172 Warmup.warmupIndexSearcher(is,iid,false);
171173 } catch (IOException e) {
@@ -262,6 +264,8 @@
263265 log.debug("Openning local index for "+iid);
264266 if(!iid.isMySearch())
265267 throw new IOException(iid+" is not searched by this host.");
 268+ if(iid.isLogical())
 269+ throw new IOException(iid+" will not open logical index.");
266270 try {
267271 searcher = new IndexSearcherMul(iid.getCanonicalSearchPath());
268272 searcher.setSimilarity(new WikiSimilarity());
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/UpdateThread.java
@@ -57,9 +57,14 @@
5858 }
5959 // get the new snapshots via rsync, might be lengthy
6060 for(LocalIndex li : forUpdate){
61 - log.debug("Syncing "+li.iid);
62 - rebuild(li); // rsync, update registry, cache
63 - pending.remove(li.iid.toString());
 61+ try{
 62+ log.debug("Syncing "+li.iid);
 63+ rebuild(li); // rsync, update registry, cache
 64+ pending.remove(li.iid.toString());
 65+ } catch(Exception e){
 66+ e.printStackTrace();
 67+ log.error("Error syncing "+li+" : "+e.getMessage());
 68+ }
6469 }
6570 }
6671 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/SearchEngine.java
@@ -19,6 +19,7 @@
2020 import org.apache.lucene.search.Searcher;
2121 import org.apache.lucene.search.TopDocs;
2222 import org.wikimedia.lsearch.analyzers.Analyzers;
 23+import org.wikimedia.lsearch.analyzers.FieldBuilder;
2324 import org.wikimedia.lsearch.analyzers.FieldNameFactory;
2425 import org.wikimedia.lsearch.analyzers.WikiQueryParser;
2526 import org.wikimedia.lsearch.beans.ResultSet;
@@ -92,18 +93,24 @@
9394 }
9495
9596 /** Search mainpart or restpart of the split index */
96 - public SearchResults searchPart(IndexId iid, Query q, NamespaceFilterWrapper filter, int offset, int limit, boolean explain){
 97+ public SearchResults searchPart(IndexId iid, String searchterm, Query q, NamespaceFilterWrapper filter, int offset, int limit, boolean explain){
9798 if( ! (iid.isMainsplit() || iid.isNssplit()))
9899 return null;
99 - try {
 100+ try {
100101 SearcherCache cache = SearcherCache.getInstance();
101102 IndexSearcherMul searcher;
102103 long searchStart = System.currentTimeMillis();
103104
104105 searcher = cache.getLocalSearcher(iid);
105 -
106 - Hits hits = searcher.search(q,filter);
107 - return makeSearchResults(searcher,hits,offset,limit,iid,q.toString(),q,searchStart,explain);
 106+ NamespaceFilterWrapper localfilter = filter;
 107+ if(iid.isMainsplit() && iid.isMainPart())
 108+ localfilter = null;
 109+ else if(iid.isNssplit() && !iid.isLogical() && iid.getNamespaceSet().size()==1)
 110+ localfilter = null;
 111+ if(localfilter != null)
 112+ log.info("Using local filter: "+localfilter);
 113+ Hits hits = searcher.search(q,localfilter);
 114+ return makeSearchResults(searcher,hits,offset,limit,iid,searchterm,q,searchStart,explain);
108115 } catch (IOException e) {
109116 SearchResults res = new SearchResults();
110117 res.setErrorMsg("Internal error in SearchEngine: "+e.getMessage());
@@ -121,8 +128,8 @@
122129 Analyzer analyzer = Analyzers.getSearcherAnalyzer(iid,exactCase);
123130 if(nsDefault == null || nsDefault.cardinality() == 0)
124131 nsDefault = new NamespaceFilter("0"); // default to main namespace
125 - FieldNameFactory ff = new FieldNameFactory(exactCase);
126 - WikiQueryParser parser = new WikiQueryParser(ff.contents(),nsDefault,analyzer,ff,WikiQueryParser.NamespacePolicy.IGNORE);
 132+ FieldBuilder.BuilderSet bs = new FieldBuilder(global.getLanguage(iid.getDBname()),exactCase).getBuilder(exactCase);
 133+ WikiQueryParser parser = new WikiQueryParser(bs.getFields().contents(),nsDefault,analyzer,bs,WikiQueryParser.NamespacePolicy.IGNORE);
127134 HashSet<NamespaceFilter> fields = parser.getFieldNamespaces(searchterm);
128135 NamespaceFilterWrapper nsfw = null;
129136 Query q = null;
@@ -183,7 +190,7 @@
184191 return res;
185192 }
186193 RMIMessengerClient messenger = new RMIMessengerClient();
187 - return messenger.searchPart(piid,q,nsfw,offset,limit,explain,host);
 194+ return messenger.searchPart(piid,searchterm,q,nsfw,offset,limit,explain,host);
188195 }
189196 }
190197 // normal search
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/search/Warmup.java
@@ -10,6 +10,7 @@
1111 import org.apache.lucene.search.Query;
1212 import org.apache.lucene.search.TermQuery;
1313 import org.wikimedia.lsearch.analyzers.Analyzers;
 14+import org.wikimedia.lsearch.analyzers.FieldBuilder;
1415 import org.wikimedia.lsearch.analyzers.FieldNameFactory;
1516 import org.wikimedia.lsearch.analyzers.WikiQueryParser;
1617 import org.wikimedia.lsearch.benchmark.SampleTerms;
@@ -62,9 +63,10 @@
6364
6465 /** Warmup index using some number of simple searches */
6566 protected static void warmupSearchTerms(IndexSearcherMul is, IndexId iid, int count, boolean useDelay) {
66 - FieldNameFactory fields = new FieldNameFactory();
67 - WikiQueryParser parser = new WikiQueryParser(fields.contents(),"0",Analyzers.getSearcherAnalyzer(iid,false),fields,WikiQueryParser.NamespacePolicy.IGNORE);
68 - Terms terms = getTermsForLang(global.getLanguage(iid.getDBname()));
 67+ String lang = global.getLanguage(iid.getDBname());
 68+ FieldBuilder.BuilderSet b = new FieldBuilder(lang).getBuilder();
 69+ WikiQueryParser parser = new WikiQueryParser(b.getFields().contents(),"0",Analyzers.getSearcherAnalyzer(iid,false),b,WikiQueryParser.NamespacePolicy.IGNORE);
 70+ Terms terms = getTermsForLang(lang);
6971
7072 try{
7173 for(int i=0; i < count ; i++){
@@ -88,17 +90,15 @@
8991 }
9092
9193 /** Get database of example search terms for language */
92 - protected static Terms getTermsForLang(String language) {
 94+ protected static Terms getTermsForLang(String lang) {
9395 String lib = Configuration.open().getString("MWConfig","lib","./lib");
94 - if(language.equals("en"))
 96+ if("en".equals(lang) || "de".equals(lang) || "es".equals(lang) || "fr".equals(lang) || "it".equals(lang) || "pt".equals(lang))
 97+ langTerms.put(lang,new WordTerms(lib+"/dict/terms-"+lang+".txt.gz"));
 98+ if(lang.equals("sample"))
9599 return new SampleTerms();
96 - if(language.equals("fr") && langTerms.get("fr")==null)
97 - langTerms.put("fr",new WordTerms(lib+"/dict/french.txt.gz"));
98 - if(language.equals("de") && langTerms.get("de")==null)
99 - langTerms.put("de",new WordTerms(lib+"/dict/german.txt.gz"));
100100
101 - if(langTerms.containsKey(language))
102 - return langTerms.get(language);
 101+ if(langTerms.containsKey(lang))
 102+ return langTerms.get(lang);
103103 else
104104 return langTerms.get("en");
105105 }
@@ -119,8 +119,9 @@
120120 /** Just run one complex query and rebuild the main namespace filter */
121121 public static void simpleWarmup(IndexSearcherMul is, IndexId iid){
122122 try{
123 - FieldNameFactory fields = new FieldNameFactory();
124 - WikiQueryParser parser = new WikiQueryParser(fields.contents(),"0",Analyzers.getSearcherAnalyzer(iid,false),fields,WikiQueryParser.NamespacePolicy.IGNORE);
 123+ String lang = global.getLanguage(iid.getDBname());
 124+ FieldBuilder.BuilderSet b = new FieldBuilder(lang).getBuilder();
 125+ WikiQueryParser parser = new WikiQueryParser(b.getFields().contents(),"0",Analyzers.getSearcherAnalyzer(iid,false),b,WikiQueryParser.NamespacePolicy.IGNORE);
125126 Query q = parser.parseFourPass("a OR very OR long OR title OR involving OR both OR wikipedia OR and OR pokemons",WikiQueryParser.NamespacePolicy.IGNORE,iid.getDBname());
126127 is.search(q,new NamespaceFilterWrapper(new NamespaceFilter("0")));
127128 } catch (IOException e) {
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/highlight/HighlightDaemon.java
@@ -23,6 +23,7 @@
2424 import org.apache.lucene.search.highlight.TextFragment;
2525 import org.wikimedia.lsearch.analyzers.Analyzers;
2626 import org.wikimedia.lsearch.analyzers.FastWikiTokenizerEngine;
 27+import org.wikimedia.lsearch.analyzers.FieldBuilder;
2728 import org.wikimedia.lsearch.analyzers.FieldNameFactory;
2829 import org.wikimedia.lsearch.analyzers.FilterFactory;
2930 import org.wikimedia.lsearch.analyzers.WikiQueryParser;
@@ -126,9 +127,9 @@
127128 boolean exactCase = global.exactCaseIndex(iid.getDBname());
128129 String lang = global.getLanguage(dbname);
129130 Analyzer analyzer = Analyzers.getSearcherAnalyzer(iid,exactCase);
130 - FieldNameFactory fields = new FieldNameFactory(exactCase);
131 - WikiQueryParser parser = new WikiQueryParser(fields.contents(),
132 - new NamespaceFilter("0"),analyzer,fields,WikiQueryParser.NamespacePolicy.IGNORE);
 131+ FieldBuilder.BuilderSet bs = new FieldBuilder(lang,exactCase).getBuilder(exactCase);
 132+ WikiQueryParser parser = new WikiQueryParser(bs.getFields().contents(),
 133+ new NamespaceFilter("0"),analyzer,bs,WikiQueryParser.NamespacePolicy.IGNORE);
133134 Query q = parser.parseFourPass(query,WikiQueryParser.NamespacePolicy.IGNORE,iid.getDBname());
134135 Scorer scorer = new QueryScorer(q);
135136 SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span class=\"searchmatch\">","</span>");
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/index/WikiIndexModifier.java
@@ -172,7 +172,7 @@
173173 }
174174 }
175175 writer.setSimilarity(new WikiSimilarity());
176 - int mergeFactor = iid.getIntParam("mergeFactor",2);
 176+ int mergeFactor = iid.getIntParam("mergeFactor",10);
177177 int maxBufDocs = iid.getIntParam("maxBufDocs",10);
178178 writer.setMergeFactor(mergeFactor);
179179 writer.setMaxBufferedDocs(maxBufDocs);
@@ -429,21 +429,20 @@
430430 title.setBoost(rankBoost);
431431 doc.add(title);
432432
433 - Field stemtitle = new Field(fields.stemtitle(), article.getTitle(),Field.Store.NO, Field.Index.TOKENIZED);
434 - //log.info(article.getNamespace()+":"+article.getTitle()+" has rank "+article.getRank()+" and redirect: "+((article.getRedirects()==null)? "" : article.getRedirects().size()));
435 - stemtitle.setBoost(rankBoost);
436 - doc.add(stemtitle);
 433+ if(bs.getFilters().hasStemmer()){
 434+ Field stemtitle = new Field(fields.stemtitle(), article.getTitle(),Field.Store.NO, Field.Index.TOKENIZED);
 435+ //log.info(article.getNamespace()+":"+article.getTitle()+" has rank "+article.getRank()+" and redirect: "+((article.getRedirects()==null)? "" : article.getRedirects().size()));
 436+ stemtitle.setBoost(rankBoost);
 437+ doc.add(stemtitle);
 438+ }
437439
438440 // put the best redirects as alternative titles
439441 makeAltTitles(doc,fields.alttitle(),article);
440442
441 - // add titles of redirects, generated from analyzer
442 - makeKeywordField(doc,fields.redirect(),rankBoost);
 443+ bs.setAddKeywords(checkKeywordPreconditions(article,iid));
 444+ // most significant words in the text, gets extra score, from analyzer
 445+ makeKeywordField(doc,fields.keyword(),rankBoost);
443446
444 - if(checkKeywordPreconditions(article,iid))
445 - // most significat words in the text, gets extra score, from analyzer
446 - makeKeywordField(doc,fields.keyword(),rankBoost);
447 -
448447 // the next fields are generated using wikitokenizer
449448 doc.add(new Field(fields.contents(), "",
450449 Field.Store.NO, Field.Index.TOKENIZED));
@@ -453,10 +452,6 @@
454453 // keyword.setBoost(calculateKeywordsBoost(tokenizer.getTokens().size()));
455454 }
456455 // make analyzer
457 - if(article.getTitle().equalsIgnoreCase("wiki")){
458 - int b =10;
459 - b++;
460 - }
461456 String text = article.getContents();
462457 Object[] ret = Analyzers.getIndexerAnalyzer(text,builder,article.getRedirectKeywords());
463458 perFieldAnalyzer = (PerFieldAnalyzerWrapper) ret[0];
@@ -487,7 +482,7 @@
488483 if(ranks.get(i) == 0)
489484 break; // we don't want redirects with zero links
490485 //log.info("For "+article+" alttitle"+(i+1)+" "+redirects.get(i)+" = "+ranks.get(i));
491 - Field alttitle = new Field("alttitle"+(i+1), redirects.get(i),Field.Store.NO, Field.Index.TOKENIZED);
 486+ Field alttitle = new Field(prefix+(i+1), redirects.get(i),Field.Store.NO, Field.Index.TOKENIZED);
492487 alttitle.setBoost(calculateArticleRank(ranks.get(i)));
493488 doc.add(alttitle);
494489 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/FieldBuilder.java
@@ -12,9 +12,12 @@
1313 public class BuilderSet{
1414 FilterFactory filters;
1515 FieldNameFactory fields;
 16+ boolean addKeywords; // wether to add keywords from beginning of article
 17+
1618 public BuilderSet(FilterFactory filters, FieldNameFactory fields) {
1719 this.filters = filters;
1820 this.fields = fields;
 21+ this.addKeywords = false;
1922 }
2023 public FieldNameFactory getFields() {
2124 return fields;
@@ -24,11 +27,23 @@
2528 }
2629 public boolean isExactCase() {
2730 return fields.isExactCase();
28 - }
 31+ }
 32+ public boolean isAddKeywords() {
 33+ return addKeywords;
 34+ }
 35+ public void setAddKeywords(boolean addKeywords) {
 36+ this.addKeywords = addKeywords;
 37+ }
 38+
2939 }
3040
3141 protected BuilderSet[] builders = new BuilderSet[2];
3242
 43+ /** Construct case-insensitive field builder */
 44+ public FieldBuilder(String lang){
 45+ this(lang,false);
 46+ }
 47+
3348 public FieldBuilder(String lang, boolean exactCase){
3449 if(exactCase){
3550 builders = new BuilderSet[2];
@@ -49,5 +64,18 @@
5065 return builders;
5166 }
5267
 68+ /** Get the case-insensitive builder */
 69+ public BuilderSet getBuilder(){
 70+ return getBuilder(false);
 71+ }
5372
 73+ /** Get BuilderSet for exactCase value */
 74+ public BuilderSet getBuilder(boolean exactCase){
 75+ if(exactCase && builders.length > 1)
 76+ return builders[1];
 77+ else
 78+ return builders[0];
 79+ }
 80+
 81+
5482 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/WikiTokenizer.java
@@ -35,9 +35,6 @@
3636 *
3737 * @param str
3838 */
39 - public WikiTokenizer(String str, boolean exactCase){
40 - this(str,null,exactCase);
41 - }
4239
4340 public WikiTokenizer(String str, String lang, boolean exactCase){
4441 parser = new FastWikiTokenizerEngine(str,lang,exactCase);
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/FastWikiTokenizerEngine.java
@@ -51,7 +51,7 @@
5252 private char cl; // lowercased character
5353 private boolean numberToken; // if the buffer holds a number token
5454 private int headings = 0; // how many headings did we see
55 - private int templateLevel = 0; // level of nestedness of templates
 55+ private int templateLevel = 0; // level of nestedness of templates
5656
5757 private int prefixLen = 0;
5858 private final char[] prefixBuf = new char[MAX_WORD_LEN];
@@ -78,7 +78,7 @@
7979 enum ParserState { WORD, LINK_BEGIN, LINK_WORDS, LINK_END, LINK_KEYWORD,
8080 LINK_FETCH, IGNORE, EXTERNAL_URL, EXTERNAL_WORDS,
8181 TEMPLATE_BEGIN, TEMPLATE_WORDS, TEMPLATE_END,
82 - TABLE_BEGIN};
 82+ TABLE_BEGIN, CATEGORY_WORDS };
8383
8484 enum FetchState { WORD, CATEGORY, INTERWIKI, KEYWORD };
8585
@@ -109,10 +109,6 @@
110110 }
111111 }
112112
113 - public FastWikiTokenizerEngine(String text, boolean exactCase){
114 - this(text,null,exactCase);
115 - }
116 -
117113 public FastWikiTokenizerEngine(String text, String lang, boolean exactCase){
118114 this.text = text.toCharArray();
119115 this.textString = text;
@@ -227,22 +223,27 @@
228224 }
229225 }
230226 // make the original buffered version
231 - Token exact;
232 - if(exactCase)
233 - exact = new Token(
234 - new String(buffer, 0, length), start, start + length);
235 - else
236 - exact = new Token(
237 - new String(buffer, 0, length).toLowerCase(), start, start + length);
238 - if(addDecomposed && decompLength!=0)
239 - exact.setType("unicode");
240 - tokens.add(exact);
 227+ // TODO: maybe do this optionally for some languages
 228+ /* if(!("de".equals(language) && aliasLength!=0)){
 229+ Token exact;
 230+ if(exactCase)
 231+ exact = new Token(
 232+ new String(buffer, 0, length), start, start + length);
 233+ else
 234+ exact = new Token(
 235+ new String(buffer, 0, length).toLowerCase(), start, start + length);
 236+ if(addDecomposed && decompLength!=0)
 237+ exact.setType("unicode");
 238+ tokens.add(exact);
 239+ } */
241240 // add decomposed token to stream
242 - if(addDecomposed && decompLength!=0){
 241+ if(decompLength!=0){
243242 Token t = new Token(
244243 new String(decompBuffer, 0, decompLength), start, start + length);
245 - t.setPositionIncrement(0);
246 - t.setType("transliteration");
 244+ /*if(!"de".equals(language)){
 245+ t.setPositionIncrement(0);
 246+ t.setType("transliteration");
 247+ } */
247248 tokens.add(t);
248249 }
249250 // add alias (if any) token to stream
@@ -434,6 +435,7 @@
435436 String prefix = "";
436437 char ignoreEnd = ' '; // end of ignore block
437438 int pipeInx = 0;
 439+ int fetchStart = -1; // start index if link fetching
438440
439441 if(tokens == null)
440442 tokens = new ArrayList<Token>();
@@ -448,7 +450,7 @@
449451 c = text[cur];
450452
451453 // actions for various parser states
452 - switch(state){
 454+ switch(state){
453455 case WORD:
454456 switch(c){
455457 case '=':
@@ -548,6 +550,7 @@
549551 cur = semicolonInx;
550552 fetch = FetchState.CATEGORY;
551553 state = ParserState.LINK_FETCH;
 554+ fetchStart = cur;
552555 continue;
553556 } else if(isInterwiki(prefix)){
554557 cur = semicolonInx;
@@ -615,7 +618,7 @@
616619
617620 if(length<buffer.length)
618621 buffer[length++] = c;
619 - continue;
 622+ continue;
620623 case LINK_END:
621624 if(c == ']'){ // good link ending
622625 state = ParserState.WORD;
@@ -628,6 +631,13 @@
629632 categories.add(new String(buffer,0,length));
630633 length = 0;
631634 fetch = FetchState.WORD;
 635+ // index category words
 636+ if(fetchStart != -1){
 637+ cur = fetchStart;
 638+ state = ParserState.CATEGORY_WORDS;
 639+ } else
 640+ System.err.print("ERROR: Inconsistent parser state, attepmted category backtrace for uninitalized fetchStart.");
 641+ fetchStart = -1;
632642 continue;
633643 case INTERWIKI:
634644 interwikis.put(prefix,
@@ -648,6 +658,22 @@
649659 continue;
650660 }
651661 continue;
 662+ case CATEGORY_WORDS:
 663+ if(c == ']'){
 664+ state = ParserState.WORD; // end of category
 665+ continue;
 666+ } else if(c == '|'){ // ignore everything up to ]
 667+ for( lookup = cur + 1 ; lookup < textLength ; lookup++ ){
 668+ if(text[lookup] == ']'){ // we know the syntax is correct since we checked it in LINK_FETCH
 669+ state = ParserState.WORD;
 670+ cur = lookup;
 671+ break;
 672+ }
 673+ }
 674+ continue;
 675+ }
 676+ addLetter();
 677+ continue;
652678 case TABLE_BEGIN:
653679 // ignore everything up to the newspace, since they are table display params
654680 while(cur < textLength && (text[cur]!='\r' && text[cur]!='\n'))
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/WikiQueryParser.java
@@ -105,6 +105,7 @@
106106 private NamespacePolicy namespacePolicy;
107107 protected NamespaceFilter defaultNamespaceFilter;
108108 protected static GlobalConfiguration global=null;
 109+ protected FieldBuilder.BuilderSet builder;
109110 protected FieldNameFactory fields;
110111
111112 /** default value for boolean queries */
@@ -130,8 +131,8 @@
131132 * @param field default field name
132133 * @param analyzer
133134 */
134 - public WikiQueryParser(String field, Analyzer analyzer, FieldNameFactory fields){
135 - this(field,(NamespaceFilter)null,analyzer,fields,NamespacePolicy.LEAVE);
 135+ public WikiQueryParser(String field, Analyzer analyzer, FieldBuilder.BuilderSet builder){
 136+ this(field,(NamespaceFilter)null,analyzer,builder,NamespacePolicy.LEAVE);
136137 }
137138
138139 /**
@@ -142,14 +143,15 @@
143144 * @param analyzer
144145 * @param nsPolicy
145146 */
146 - public WikiQueryParser(String field, String namespace, Analyzer analyzer, FieldNameFactory fields, NamespacePolicy nsPolicy){
147 - this(field,new NamespaceFilter(namespace),analyzer,fields,nsPolicy);
 147+ public WikiQueryParser(String field, String namespace, Analyzer analyzer, FieldBuilder.BuilderSet builder, NamespacePolicy nsPolicy){
 148+ this(field,new NamespaceFilter(namespace),analyzer,builder,nsPolicy);
148149 }
149150
150 - public WikiQueryParser(String field, NamespaceFilter nsfilter, Analyzer analyzer, FieldNameFactory fields, NamespacePolicy nsPolicy){
 151+ public WikiQueryParser(String field, NamespaceFilter nsfilter, Analyzer analyzer, FieldBuilder.BuilderSet builder, NamespacePolicy nsPolicy){
151152 defaultField = field;
152153 this.analyzer = analyzer;
153 - this.fields = fields;
 154+ this.builder = builder;
 155+ this.fields = builder.getFields();
154156 tokens = new ArrayList<Token>();
155157 this.namespacePolicy = nsPolicy;
156158 disableTitleAliases = true;
@@ -999,6 +1001,8 @@
10001002 } else if(q instanceof PhraseQuery){ // -> SpanNearQuery(slop=0,inOrder=true)
10011003 PhraseQuery pq = (PhraseQuery)q;
10021004 Term[] terms = pq.getTerms();
 1005+ if(terms == null || terms.length==0)
 1006+ continue;
10031007 if(terms[0].field().equals("category")){
10041008 categories.add(q);
10051009 } else{
@@ -1081,12 +1085,6 @@
10821086 defaultBoost = olfDefaultBoost;
10831087 defaultAliasBoost = ALIAS_BOOST;
10841088
1085 - BooleanQuery qs = multiplySpans(qt,0,fields.redirect(),REDIRECT_BOOST);
1086 - // merge queries
1087 - if(qs != null){
1088 - for(BooleanClause bc : qs.getClauses())
1089 - bq.add(bc);
1090 - }
10911089 if(bq.getClauses() == null || bq.getClauses().length==0)
10921090 return null;
10931091 else
@@ -1099,12 +1097,15 @@
11001098 String contentField = defaultField;
11011099 float olfDefaultBoost = defaultBoost;
11021100 defaultField = fields.title(); // now parse the title part
1103 - defaultBoost = TITLE_BOOST;
 1101+ if(ADD_STEM_TITLE && builder.getFilters().hasStemmer())
 1102+ defaultBoost = TITLE_BOOST; // we have stem titles
 1103+ else
 1104+ defaultBoost = TITLE_BOOST+STEM_TITLE_BOOST; // no stem titles, add-up boosts
11041105 defaultAliasBoost = TITLE_ALIAS_BOOST;
11051106 Query qt = parseRaw(queryText);
11061107 Query qs = null;
11071108 // stemmed title
1108 - if(ADD_STEM_TITLE){
 1109+ if(ADD_STEM_TITLE && builder.getFilters().hasStemmer()){
11091110 defaultField = fields.stemtitle();
11101111 defaultBoost = STEM_TITLE_BOOST;
11111112 defaultAliasBoost = STEM_TITLE_ALIAS_BOOST;
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/Analyzers.java
@@ -55,7 +55,7 @@
5656 PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(new SimpleAnalyzer());
5757 WikiTokenizer tokenizer = null;
5858 for(FieldBuilder.BuilderSet bs : builder.getBuilders()){
59 - tokenizer = addFieldsForIndexing(perFieldAnalyzer,text,bs.getFilters(),bs.getFields(),redirects,bs.isExactCase());
 59+ tokenizer = addFieldsForIndexing(perFieldAnalyzer,text,bs.getFilters(),bs.getFields(),redirects,bs.isExactCase(),bs.isAddKeywords());
6060 }
6161 return new Object[] {perFieldAnalyzer,tokenizer};
6262 }
@@ -64,26 +64,30 @@
6565 * Add some fields to indexer's analyzer.
6666 *
6767 */
68 - public static WikiTokenizer addFieldsForIndexing(PerFieldAnalyzerWrapper perFieldAnalyzer, String text, FilterFactory filters, FieldNameFactory fields, ArrayList<String> redirects, boolean exactCase) {
 68+ public static WikiTokenizer addFieldsForIndexing(PerFieldAnalyzerWrapper perFieldAnalyzer, String text, FilterFactory filters, FieldNameFactory fields, ArrayList<String> redirects, boolean exactCase, boolean addKeywords) {
6969 // parse wiki-text to get categories
7070 WikiTokenizer tokenizer = new WikiTokenizer(text,filters.getLanguage(),exactCase);
7171 tokenizer.tokenize();
7272 ArrayList<String> categories = tokenizer.getCategories();
7373
 74+ ArrayList<String> allKeywords = new ArrayList<String>();
 75+ if(addKeywords && tokenizer.getKeywords()!=null)
 76+ allKeywords.addAll(tokenizer.getKeywords());
 77+ if(redirects!=null)
 78+ allKeywords.addAll(redirects);
 79+
7480 perFieldAnalyzer.addAnalyzer(fields.contents(),
7581 new LanguageAnalyzer(filters,tokenizer));
7682 perFieldAnalyzer.addAnalyzer("category",
77 - new CategoryAnalyzer(categories));
 83+ new CategoryAnalyzer(categories,exactCase));
7884 perFieldAnalyzer.addAnalyzer(fields.title(),
7985 getTitleAnalyzer(filters.getNoStemmerFilterFactory(),exactCase));
8086 perFieldAnalyzer.addAnalyzer(fields.stemtitle(),
8187 getTitleAnalyzer(filters,exactCase));
8288 setAltTitleAnalyzer(perFieldAnalyzer,fields.alttitle(),
8389 getTitleAnalyzer(filters.getNoStemmerFilterFactory(),exactCase));
84 - setKeywordAnalyzer(perFieldAnalyzer,fields.redirect(),
85 - new KeywordsAnalyzer(redirects,filters.getNoStemmerFilterFactory(),fields.redirect(),exactCase));
8690 setKeywordAnalyzer(perFieldAnalyzer,fields.keyword(),
87 - new KeywordsAnalyzer(tokenizer.getKeywords(),filters.getNoStemmerFilterFactory(),fields.keyword(),exactCase));
 91+ new KeywordsAnalyzer(allKeywords,filters.getNoStemmerFilterFactory(),fields.keyword(),exactCase));
8892 return tokenizer;
8993 }
9094
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/FieldNameFactory.java
@@ -46,13 +46,6 @@
4747 return "alttitle";
4848 }
4949
50 - public String redirect(){
51 - if(exactCase)
52 - return "redirect_exact";
53 - else
54 - return "redirect";
55 - }
56 -
5750 public String keyword(){
5851 if(exactCase)
5952 return "keyword_exact";
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/KeywordsAnalyzer.java
@@ -61,7 +61,7 @@
6262 keywordsBySize.add(new ArrayList<String>());
6363 // arange keywords into a list by token number
6464 for(String k : keywords){
65 - ArrayList<Token> parsed = new FastWikiTokenizerEngine(k,exactCase).parse();
 65+ ArrayList<Token> parsed = new FastWikiTokenizerEngine(k,filters.getLanguage(),exactCase).parse();
6666 if(parsed.size() == 0)
6767 continue;
6868 else if(parsed.size() < KEYWORD_LEVELS)
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/QueryLanguageAnalyzer.java
@@ -25,7 +25,7 @@
2626 */
2727 @Override
2828 public TokenStream tokenStream(String fieldName, String text) {
29 - wikitokenizer = new WikiTokenizer(text,exactCase);
 29+ wikitokenizer = new WikiTokenizer(text,filters.getLanguage(),exactCase);
3030 return super.tokenStream(fieldName,(Reader)null);
3131 }
3232
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/analyzers/CategoryAnalyzer.java
@@ -19,6 +19,7 @@
2020 protected Iterator<String> tokensIt;
2121 protected int start;
2222
 23+
2324 ArrayTokenStream(ArrayList<String> tokens){
2425 this.tokens = tokens;
2526 tokensIt = tokens.iterator();
@@ -28,7 +29,12 @@
2930 @Override
3031 public Token next() throws IOException {
3132 if(tokensIt.hasNext()){
32 - String text = tokensIt.next();
 33+ String text;
 34+ if(!exactCase)
 35+ text = tokensIt.next().toLowerCase();
 36+ else
 37+ text = tokensIt.next();
 38+
3339 Token token = new Token(text,start,start+text.length());
3440 start += text.length()+1;
3541 return token;
@@ -39,9 +45,11 @@
4046 }
4147
4248 ArrayList<String> categories;
 49+ protected boolean exactCase;
4350
44 - public CategoryAnalyzer(ArrayList<String> categories) {
 51+ public CategoryAnalyzer(ArrayList<String> categories, boolean exactCase) {
4552 this.categories = categories;
 53+ this.exactCase = exactCase;
4654 }
4755
4856 @Override
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/oai/IncrementalUpdater.java
@@ -60,14 +60,15 @@
6161
6262 /**
6363 * Syntax:
64 - * java IncrementalUpdater [-d] [-t timestamp] [-s sleep] [-f dblist] [-e dbname] [-n] dbname1 dbname2 ...
 64+ * java IncrementalUpdater [-d] [-t timestamp] [-s sleep] [-f dblist] [-e dbname] [-n] [--no-ranks] dbname1 dbname2 ...
6565 * Options:
6666 * -d - daemonize, otherwise runs only one round of updates to dbs
67 - * -s - sleep time after one cycle (default: 30000ms)
 67+ * -s - sleep time after one cycle (default: 30s)
6868 * -t - default timestamp if status file is missing (default: 2001-01-01)
6969 * -f - file to read databases from
7070 * -n - wait for notification of flush after done updating one db (default: true)
7171 * -e - exclude dbname from incremental updates (overrides -f)
 72+ * --no-ranks - don't fetch ranks
7273 *
7374 * @param args
7475 */
@@ -81,12 +82,13 @@
8283 boolean notification = true;
8384 HashSet<String> excludeList = new HashSet<String>();
8485 HashSet<String> firstPass = new HashSet<String>(); // if dbname is here, then it's our update pass
 86+ boolean fetchReferences = true;
8587 // args
8688 for(int i=0; i<args.length; i++){
8789 if(args[i].equals("-d"))
8890 daemon = true;
8991 else if(args[i].equals("-s"))
90 - sleepTime = Long.parseLong(args[++i]);
 92+ sleepTime = Long.parseLong(args[++i])*1000;
9193 else if(args[i].equals("-t"))
9294 timestamp = args[++i];
9395 else if(args[i].equals("-f"))
@@ -95,6 +97,8 @@
9698 excludeList.add(args[++i]);
9799 else if(args[i].equals("-n"))
98100 notification = true;
 101+ else if(args[i].equals("--no-ranks"))
 102+ fetchReferences = false;
99103 else if(args[i].equals("--help"))
100104 break;
101105 else if(args[i].startsWith("-")){
@@ -119,14 +123,15 @@
120124 }
121125 }
122126 if(dbnames.size() == 0){
123 - System.out.println("Syntax: java IncrementalUpdater [-d] [-s sleep] [-t timestamp] [-e dbname] [-f dblist] dbname1 dbname2 ...");
 127+ System.out.println("Syntax: java IncrementalUpdater [-d] [-s sleep] [-t timestamp] [-e dbname] [-f dblist] [-n] [--no-ranks] dbname1 dbname2 ...");
124128 System.out.println("Options:");
125129 System.out.println(" -d - daemonize, otherwise runs only one round of updates to dbs");
126 - System.out.println(" -s - sleep time after one cycle (default: "+sleepTime+"ms)");
 130+ System.out.println(" -s - sleep time in seconds after one cycle (default: "+sleepTime+"ms)");
127131 System.out.println(" -t - timestamp to start from (if status is missing default: "+timestamp+")");
128132 System.out.println(" -f - dblist file, one dbname per line");
129133 System.out.println(" -n - wait for notification of flush after done updating one db (default: "+notification+")");
130134 System.out.println(" -e - exclude dbname from incremental updates (overrides -f)");
 135+ System.out.println(" --no-ranks - don't try to fetch any article rank data");
131136 return;
132137 }
133138 // config
@@ -173,8 +178,10 @@
174179 continue;
175180 boolean hasMore = false;
176181 do{
177 - // fetch references for records
178 - fetchReferences(records,dbname);
 182+ if(fetchReferences){
 183+ // fetch references for records
 184+ fetchReferences(records,dbname);
 185+ }
179186 for(IndexUpdateRecord rec : records){
180187 Article ar = rec.getArticle();
181188 log.info("Sending "+ar+" with rank "+ar.getReferences()+" and "+ar.getRedirects().size()+" redirects: "+ar.getRedirects());
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/util/Localization.java
@@ -123,10 +123,12 @@
124124 log.warn("Property Localization.url not set in config file. Localization disabled.");
125125 return false;
126126 }
 127+ if(!loc.endsWith("/"))
 128+ loc += "/";
127129 log.info("Reading localization for "+langCode);
128130 URL url;
129131 try {
130 - url = new URL(MessageFormat.format(loc,langCode));
 132+ url = new URL(MessageFormat.format(loc+"Messages{0}.php",langCode));
131133
132134 PHPParser parser = new PHPParser();
133135 String text = parser.readURL(url);
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/FastWikiTokenizerTest.java
@@ -86,7 +86,7 @@
8787 showTokens(text);
8888 text = "This are [[bean]]s and more [[bla]]njah also Großmann";
8989 showTokens(text);
90 - text = "[[Category:Blah Blah?!]], and [[:Category:Link to something]]";
 90+ text = "[[Category:Blah Blah?!]], and [[:Category:Link to something]] [[Category:Mathematics|Name]]";
9191 showTokens(text);
9292 text = "[[sr:Glavna stranica]], and [[:Category:Link to category]]";
9393 showTokens(text);
@@ -114,7 +114,7 @@
115115 for(int i=0;i<2000;i++){
116116 for(TestArticle article : articles){
117117 String text = article.content;
118 - FastWikiTokenizerEngine parser = new FastWikiTokenizerEngine(text,false);
 118+ FastWikiTokenizerEngine parser = new FastWikiTokenizerEngine(text,"en",false);
119119 parser.parse();
120120 }
121121 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/EnglishAnalyzer.java
@@ -58,6 +58,6 @@
5959 if(streams.get(fieldName) != null)
6060 return streams.get(fieldName);
6161
62 - return new AliasPorterStemFilter(new WikiTokenizer(text,false));
 62+ return new AliasPorterStemFilter(new WikiTokenizer(text,"en",false));
6363 }
6464 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/WikiQueryParserTest.java
@@ -10,6 +10,7 @@
1111 import org.apache.lucene.search.BooleanQuery;
1212 import org.apache.lucene.search.Query;
1313 import org.wikimedia.lsearch.analyzers.Analyzers;
 14+import org.wikimedia.lsearch.analyzers.FieldBuilder;
1415 import org.wikimedia.lsearch.analyzers.FieldNameFactory;
1516 import org.wikimedia.lsearch.analyzers.WikiQueryParser;
1617 import org.wikimedia.lsearch.analyzers.WikiQueryParser.NamespacePolicy;
@@ -37,10 +38,10 @@
3839 WikiQueryParser.ALT_TITLE_BOOST = 6;
3940 WikiQueryParser.KEYWORD_BOOST = 0.05f;
4041 WikiIndexModifier.ALT_TITLES = 3;
41 - WikiQueryParser.ADD_STEM_TITLE=false;
 42+ FieldBuilder.BuilderSet bs = new FieldBuilder("").getBuilder();
4243 FieldNameFactory ff = new FieldNameFactory();
4344 try{
44 - WikiQueryParser parser = new WikiQueryParser(ff.contents(),new SimpleAnalyzer(),ff);
 45+ WikiQueryParser parser = new WikiQueryParser(bs.getFields().contents(),new SimpleAnalyzer(),bs);
4546 Query q;
4647 HashSet<String> fields;
4748
@@ -115,11 +116,11 @@
116117 assertTrue(fields.contains("contents"));
117118
118119 // namespace policies
119 - parser = new WikiQueryParser(ff.contents(),"0",new SimpleAnalyzer(), ff, WikiQueryParser.NamespacePolicy.IGNORE);
 120+ parser = new WikiQueryParser(ff.contents(),"0",new SimpleAnalyzer(), bs, WikiQueryParser.NamespacePolicy.IGNORE);
120121 q = parser.parseRaw("help:making breakfast incategory:food");
121122 assertEquals("+contents:making +contents:breakfast +category:food",q.toString());
122123
123 - parser = new WikiQueryParser(ff.contents(),"0",new SimpleAnalyzer(), ff, WikiQueryParser.NamespacePolicy.REWRITE);
 124+ parser = new WikiQueryParser(ff.contents(),"0",new SimpleAnalyzer(), bs, WikiQueryParser.NamespacePolicy.REWRITE);
124125 q = parser.parseRaw("help:making breakfast incategory:food");
125126 assertEquals("+namespace:12 +(+contents:making +contents:breakfast +category:food)",q.toString());
126127
@@ -141,7 +142,7 @@
142143
143144 // ====== English Analyzer ========
144145
145 - parser = new WikiQueryParser(ff.contents(),"0",new EnglishAnalyzer(), ff, WikiQueryParser.NamespacePolicy.REWRITE);
 146+ parser = new WikiQueryParser(ff.contents(),"0",new EnglishAnalyzer(), bs, WikiQueryParser.NamespacePolicy.REWRITE);
146147 q = parser.parseRaw("main_talk:laziness");
147148 assertEquals("+namespace:1 +(contents:laziness contents:lazi^0.5)",q.toString());
148149
@@ -157,7 +158,7 @@
158159 q = parser.parse("(help:making something incategory:blah) OR (rest incategory:crest)");
159160 assertEquals("(+namespace:12 +(+(+(contents:making contents:make^0.5) title:making^2.0) +(+(contents:something contents:someth^0.5) title:something^2.0) +category:blah)) (+namespace:0 +(+(+contents:rest +category:crest) title:rest^2.0))",q.toString());
160161
161 - parser = new WikiQueryParser(ff.contents(),new EnglishAnalyzer(),ff);
 162+ parser = new WikiQueryParser(ff.contents(),new EnglishAnalyzer(),bs);
162163
163164 q = parser.parseRaw("laziness");
164165 assertEquals("contents:laziness contents:lazi^0.5",q.toString());
@@ -207,7 +208,10 @@
208209 // Tests with actual params :)
209210 // ==================================
210211 Analyzer analyzer = Analyzers.getSearcherAnalyzer("en");
211 - parser = new WikiQueryParser(ff.contents(),"0",analyzer,ff,NamespacePolicy.LEAVE);
 212+ bs = new FieldBuilder("en").getBuilder();
 213+ parser = new WikiQueryParser(bs.getFields().contents(),"0",analyzer,bs,NamespacePolicy.LEAVE);
 214+ WikiQueryParser.ADD_STEM_TITLE = false;
 215+ WikiQueryParser.STEM_TITLE_BOOST = 0;
212216 q = parser.parseTwoPass("beans everyone",null);
213217 assertEquals("(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5)) (+title:beans^2.0 +title:everyone^2.0)",q.toString());
214218
@@ -289,14 +293,14 @@
290294
291295 // Redirect third/forth pass tests
292296 q = parser.parseFourPass("beans",NamespacePolicy.IGNORE,true);
293 - assertEquals("(contents:beans contents:bean^0.5) title:beans^2.0 (alttitle1:beans^6.0 alttitle2:beans^6.0 alttitle3:beans^6.0 redirect1:beans^0.2 redirect2:beans^0.1 redirect3:beans^0.06666667 redirect4:beans^0.05 redirect5:beans^0.04) (keyword1:beans^0.05 keyword2:beans^0.025 keyword3:beans^0.016666668 keyword4:beans^0.0125 keyword5:beans^0.01)",q.toString());
 297+ assertEquals("(contents:beans contents:bean^0.5) title:beans^2.0 (alttitle1:beans^6.0 alttitle2:beans^6.0 alttitle3:beans^6.0) (keyword1:beans^0.05 keyword2:beans^0.025 keyword3:beans^0.016666668 keyword4:beans^0.0125 keyword5:beans^0.01)",q.toString());
294298
295299 q = parser.parseFourPass("beans everyone",NamespacePolicy.IGNORE,true);
296 - assertEquals("(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5)) (+title:beans^2.0 +title:everyone^2.0) ((+alttitle1:beans^6.0 +alttitle1:everyone^6.0) (+alttitle2:beans^6.0 +alttitle2:everyone^6.0) (+alttitle3:beans^6.0 +alttitle3:everyone^6.0) spanNear([redirect1:beans, redirect1:everyone], 100, false)^0.2 spanNear([redirect2:beans, redirect2:everyone], 100, false)^0.1 spanNear([redirect3:beans, redirect3:everyone], 100, false)^0.06666667 spanNear([redirect4:beans, redirect4:everyone], 100, false)^0.05 spanNear([redirect5:beans, redirect5:everyone], 100, false)^0.04) (spanNear([keyword1:beans, keyword1:everyone], 100, false)^0.05 spanNear([keyword2:beans, keyword2:everyone], 100, false)^0.025 spanNear([keyword3:beans, keyword3:everyone], 100, false)^0.016666668 spanNear([keyword4:beans, keyword4:everyone], 100, false)^0.0125 spanNear([keyword5:beans, keyword5:everyone], 100, false)^0.01)",q.toString());
 300+ assertEquals("(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5)) (+title:beans^2.0 +title:everyone^2.0) ((+alttitle1:beans^6.0 +alttitle1:everyone^6.0) (+alttitle2:beans^6.0 +alttitle2:everyone^6.0) (+alttitle3:beans^6.0 +alttitle3:everyone^6.0)) (spanNear([keyword1:beans, keyword1:everyone], 100, false)^0.05 spanNear([keyword2:beans, keyword2:everyone], 100, false)^0.025 spanNear([keyword3:beans, keyword3:everyone], 100, false)^0.016666668 spanNear([keyword4:beans, keyword4:everyone], 100, false)^0.0125 spanNear([keyword5:beans, keyword5:everyone], 100, false)^0.01)",q.toString());
297301
298302 // TODO: check if this query will be optimized by lucene (categories)
299303 q = parser.parseFourPass("beans everyone incategory:mouse",NamespacePolicy.IGNORE,true);
300 - assertEquals("(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5) +category:mouse) (+title:beans^2.0 +title:everyone^2.0 +category:mouse) ((+alttitle1:beans^6.0 +alttitle1:everyone^6.0 +category:mouse) (+alttitle2:beans^6.0 +alttitle2:everyone^6.0 +category:mouse) (+alttitle3:beans^6.0 +alttitle3:everyone^6.0 +category:mouse) (+spanNear([redirect1:beans, redirect1:everyone], 100, false)^0.2 +category:mouse) (+spanNear([redirect2:beans, redirect2:everyone], 100, false)^0.1 +category:mouse) (+spanNear([redirect3:beans, redirect3:everyone], 100, false)^0.06666667 +category:mouse) (+spanNear([redirect4:beans, redirect4:everyone], 100, false)^0.05 +category:mouse) (+spanNear([redirect5:beans, redirect5:everyone], 100, false)^0.04 +category:mouse)) ((+spanNear([keyword1:beans, keyword1:everyone], 100, false)^0.05 +category:mouse) (+spanNear([keyword2:beans, keyword2:everyone], 100, false)^0.025 +category:mouse) (+spanNear([keyword3:beans, keyword3:everyone], 100, false)^0.016666668 +category:mouse) (+spanNear([keyword4:beans, keyword4:everyone], 100, false)^0.0125 +category:mouse) (+spanNear([keyword5:beans, keyword5:everyone], 100, false)^0.01 +category:mouse))",q.toString());
 304+ assertEquals("(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5) +category:mouse) (+title:beans^2.0 +title:everyone^2.0 +category:mouse) ((+alttitle1:beans^6.0 +alttitle1:everyone^6.0 +category:mouse) (+alttitle2:beans^6.0 +alttitle2:everyone^6.0 +category:mouse) (+alttitle3:beans^6.0 +alttitle3:everyone^6.0 +category:mouse)) ((+spanNear([keyword1:beans, keyword1:everyone], 100, false)^0.05 +category:mouse) (+spanNear([keyword2:beans, keyword2:everyone], 100, false)^0.025 +category:mouse) (+spanNear([keyword3:beans, keyword3:everyone], 100, false)^0.016666668 +category:mouse) (+spanNear([keyword4:beans, keyword4:everyone], 100, false)^0.0125 +category:mouse) (+spanNear([keyword5:beans, keyword5:everyone], 100, false)^0.01 +category:mouse))",q.toString());
301305
302306 q = parser.parseFourPass("beans OR everyone",NamespacePolicy.IGNORE,true);
303307 assertEquals("((contents:beans contents:bean^0.5) (contents:everyone contents:everyon^0.5)) (title:beans^2.0 title:everyone^2.0) ((alttitle1:beans^6.0 alttitle1:everyone^6.0) (alttitle2:beans^6.0 alttitle2:everyone^6.0) (alttitle3:beans^6.0 alttitle3:everyone^6.0))",q.toString());
@@ -305,7 +309,7 @@
306310 assertEquals("(+(contents:beans contents:bean^0.5) -(contents:everyone)) (+title:beans^2.0 -title:everyone^2.0) ((+alttitle1:beans^6.0 -alttitle1:everyone^6.0) (+alttitle2:beans^6.0 -alttitle2:everyone^6.0) (+alttitle3:beans^6.0 -alttitle3:everyone^6.0))",q.toString());
307311
308312 q = parser.parseFourPass("[0,1,2]:beans everyone",NamespacePolicy.REWRITE,true);
309 - assertEquals("(+(namespace:0 namespace:1 namespace:2) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5))) (+(namespace:0 namespace:1 namespace:2) +(+title:beans^2.0 +title:everyone^2.0)) ((+(namespace:0 namespace:1 namespace:2) +(+alttitle1:beans^6.0 +alttitle1:everyone^6.0)) (+(namespace:0 namespace:1 namespace:2) +(+alttitle2:beans^6.0 +alttitle2:everyone^6.0)) (+(namespace:0 namespace:1 namespace:2) +(+alttitle3:beans^6.0 +alttitle3:everyone^6.0)) (+(namespace:0 namespace:1 namespace:2) +spanNear([redirect1:beans, redirect1:everyone], 100, false)^0.2) (+(namespace:0 namespace:1 namespace:2) +spanNear([redirect2:beans, redirect2:everyone], 100, false)^0.1) (+(namespace:0 namespace:1 namespace:2) +spanNear([redirect3:beans, redirect3:everyone], 100, false)^0.06666667) (+(namespace:0 namespace:1 namespace:2) +spanNear([redirect4:beans, redirect4:everyone], 100, false)^0.05) (+(namespace:0 namespace:1 namespace:2) +spanNear([redirect5:beans, redirect5:everyone], 100, false)^0.04)) ((+(namespace:0 namespace:1 namespace:2) +spanNear([keyword1:beans, keyword1:everyone], 100, false)^0.05) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword2:beans, keyword2:everyone], 100, false)^0.025) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword3:beans, keyword3:everyone], 100, false)^0.016666668) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword4:beans, keyword4:everyone], 100, false)^0.0125) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword5:beans, keyword5:everyone], 100, false)^0.01))",q.toString());
 313+ assertEquals("(+(namespace:0 namespace:1 namespace:2) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5))) (+(namespace:0 namespace:1 namespace:2) +(+title:beans^2.0 +title:everyone^2.0)) ((+(namespace:0 namespace:1 namespace:2) +(+alttitle1:beans^6.0 +alttitle1:everyone^6.0)) (+(namespace:0 namespace:1 namespace:2) +(+alttitle2:beans^6.0 +alttitle2:everyone^6.0)) (+(namespace:0 namespace:1 namespace:2) +(+alttitle3:beans^6.0 +alttitle3:everyone^6.0))) ((+(namespace:0 namespace:1 namespace:2) +spanNear([keyword1:beans, keyword1:everyone], 100, false)^0.05) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword2:beans, keyword2:everyone], 100, false)^0.025) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword3:beans, keyword3:everyone], 100, false)^0.016666668) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword4:beans, keyword4:everyone], 100, false)^0.0125) (+(namespace:0 namespace:1 namespace:2) +spanNear([keyword5:beans, keyword5:everyone], 100, false)^0.01))",q.toString());
310314
311315 q = parser.parseFourPass("[0,1,2]:beans everyone [0]:mainly",NamespacePolicy.REWRITE,true);
312316 assertEquals("((+(namespace:0 namespace:1 namespace:2) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5))) (+namespace:0 +(contents:mainly contents:main^0.5))) ((+(namespace:0 namespace:1 namespace:2) +(+title:beans^2.0 +title:everyone^2.0)) (+namespace:0 +title:mainly^2.0)) (((+(namespace:0 namespace:1 namespace:2) +(+alttitle1:beans^6.0 +alttitle1:everyone^6.0)) (+namespace:0 +alttitle1:mainly^6.0)) ((+(namespace:0 namespace:1 namespace:2) +(+alttitle2:beans^6.0 +alttitle2:everyone^6.0)) (+namespace:0 +alttitle2:mainly^6.0)) ((+(namespace:0 namespace:1 namespace:2) +(+alttitle3:beans^6.0 +alttitle3:everyone^6.0)) (+namespace:0 +alttitle3:mainly^6.0)))",q.toString());
@@ -315,53 +319,61 @@
316320
317321 // alternative transliterations
318322 q = parser.parseFourPass("Something for Gödels",NamespacePolicy.IGNORE,true);
319 - assertEquals("(+(contents:something contents:someth^0.5) +contents:for +((contents:gödels contents:gödel^0.5) (contents:godels contents:godel^0.5) (contents:goedels contents:goedel^0.5))) (+title:something^2.0 +title:for^2.0 +((title:gödels^2.0 title:godels^2.0 title:goedels^2.0))) ((+alttitle1:something^6.0 +alttitle1:for^6.0 +((alttitle1:gödels^6.0 alttitle1:godels^6.0 alttitle1:goedels^6.0))) (+alttitle2:something^6.0 +alttitle2:for^6.0 +((alttitle2:gödels^6.0 alttitle2:godels^6.0 alttitle2:goedels^6.0))) (+alttitle3:something^6.0 +alttitle3:for^6.0 +((alttitle3:gödels^6.0 alttitle3:godels^6.0 alttitle3:goedels^6.0))))",q.toString());
 323+ assertEquals("(+(contents:something contents:someth^0.5) +contents:for +(+(contents:godels contents:godel^0.5) (contents:goedels contents:goedel^0.5))) (+title:something^2.0 +title:for^2.0 +(title:godels^2.0 title:goedels^2.0)) ((+alttitle1:something^6.0 +alttitle1:for^6.0 +(alttitle1:godels^6.0 alttitle1:goedels^6.0)) (+alttitle2:something^6.0 +alttitle2:for^6.0 +(alttitle2:godels^6.0 alttitle2:goedels^6.0)) (+alttitle3:something^6.0 +alttitle3:for^6.0 +(alttitle3:godels^6.0 alttitle3:goedels^6.0)))",q.toString());
320324
321325 q = parser.parseFourPass("Something for Gödel",NamespacePolicy.IGNORE,true);
322 - assertEquals("(+(contents:something contents:someth^0.5) +contents:for +((contents:gödel contents:godel contents:goedel))) (+title:something^2.0 +title:for^2.0 +((title:gödel^2.0 title:godel^2.0 title:goedel^2.0))) ((+alttitle1:something^6.0 +alttitle1:for^6.0 +((alttitle1:gödel^6.0 alttitle1:godel^6.0 alttitle1:goedel^6.0))) (+alttitle2:something^6.0 +alttitle2:for^6.0 +((alttitle2:gödel^6.0 alttitle2:godel^6.0 alttitle2:goedel^6.0))) (+alttitle3:something^6.0 +alttitle3:for^6.0 +((alttitle3:gödel^6.0 alttitle3:godel^6.0 alttitle3:goedel^6.0))))",q.toString());
 326+ assertEquals("(+(contents:something contents:someth^0.5) +contents:for +(contents:godel contents:goedel)) (+title:something^2.0 +title:for^2.0 +(title:godel^2.0 title:goedel^2.0)) ((+alttitle1:something^6.0 +alttitle1:for^6.0 +(alttitle1:godel^6.0 alttitle1:goedel^6.0)) (+alttitle2:something^6.0 +alttitle2:for^6.0 +(alttitle2:godel^6.0 alttitle2:goedel^6.0)) (+alttitle3:something^6.0 +alttitle3:for^6.0 +(alttitle3:godel^6.0 alttitle3:goedel^6.0)))",q.toString());
323327
 328+ // Backward compatiblity for complex filters
 329+ analyzer = Analyzers.getSearcherAnalyzer("en");
 330+ bs = new FieldBuilder("en").getBuilder();
 331+ parser = new WikiQueryParser(bs.getFields().contents(),"0,1,4,12",analyzer,bs,NamespacePolicy.IGNORE);
 332+
 333+ q = parser.parseTwoPass("beans everyone",NamespacePolicy.REWRITE);
 334+ assertEquals("(+(namespace:0 namespace:1 namespace:4 namespace:12) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5))) (+(namespace:0 namespace:1 namespace:4 namespace:12) +(+title:beans^2.0 +title:everyone^2.0))",q.toString());
 335+
 336+ q = parser.parseTwoPass("beans main:everyone",NamespacePolicy.REWRITE);
 337+ assertEquals("((+(namespace:0 namespace:1 namespace:4 namespace:12) +(contents:beans contents:bean^0.5)) (+namespace:0 +(contents:everyone contents:everyon^0.5))) ((+(namespace:0 namespace:1 namespace:4 namespace:12) +title:beans^2.0) (+namespace:0 +title:everyone^2.0))",q.toString());
 338+
 339+ q = parser.parseTwoPass("beans everyone incategory:cheeses",NamespacePolicy.REWRITE);
 340+ assertEquals("(+(namespace:0 namespace:1 namespace:4 namespace:12) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5) +category:cheeses)) (+(namespace:0 namespace:1 namespace:4 namespace:12) +(+title:beans^2.0 +title:everyone^2.0 +category:cheeses))",q.toString());
 341+
 342+ q = parser.parseTwoPass("all_talk: beans everyone",NamespacePolicy.REWRITE);
 343+ assertEquals("(+(namespace:1 namespace:3 namespace:5 namespace:7 namespace:9 namespace:11 namespace:13 namespace:15) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5))) (+(namespace:1 namespace:3 namespace:5 namespace:7 namespace:9 namespace:11 namespace:13 namespace:15) +(+title:beans^2.0 +title:everyone^2.0))",q.toString());
 344+
 345+
324346 // Test field extraction
325347 HashSet<NamespaceFilter> fs = parser.getFieldNamespaces("main:something [1]:else all:oh []:nja");
326348 assertEquals(3,fs.size());
327349 assertTrue(fs.contains(new NamespaceFilter("0")));
328350 assertTrue(fs.contains(new NamespaceFilter("1")));
329351 assertTrue(fs.contains(new NamespaceFilter()));
 352+
 353+ WikiQueryParser.ADD_STEM_TITLE = true;
 354+ WikiQueryParser.STEM_TITLE_BOOST = 1;
330355
331356 // Localization tests
332357 analyzer = Analyzers.getSearcherAnalyzer("sr");
333 - parser = new WikiQueryParser(ff.contents(),"0",analyzer,ff,NamespacePolicy.LEAVE);
 358+ bs = new FieldBuilder("sr").getBuilder();
 359+ parser = new WikiQueryParser(bs.getFields().contents(),"0",analyzer,bs,NamespacePolicy.LEAVE);
334360
335361 q = parser.parseTwoPass("all:добродошли на википедију",NamespacePolicy.IGNORE);
336 - assertEquals("(+(contents:добродошли contents:dobrodosli^0.5) +(contents:на contents:na^0.5) +(contents:википедију contents:vikipediju^0.5)) (+(title:добродошли^2.0 title:dobrodosli^0.4) +(title:на^2.0 title:na^0.4) +(title:википедију^2.0 title:vikipediju^0.4))",q.toString());
 362+ assertEquals("(+(contents:добродошли contents:dobrodosli^0.5) +(contents:на contents:na^0.5) +(contents:википедију contents:vikipediju^0.5)) (+(title:добродошли^3.0 title:dobrodosli^0.6) +(title:на^3.0 title:na^0.6) +(title:википедију^3.0 title:vikipediju^0.6))",q.toString());
337363
338364 q = parser.parseTwoPass("all:dobrodošli na šđčćž",NamespacePolicy.IGNORE);
339 - assertEquals("(+(contents:dobrodošli contents:dobrodosli) +contents:na +(+contents:šdjčćž +contents:sdjccz)) (+(title:dobrodošli^2.0 title:dobrodosli^2.0) +title:na^2.0 +(+title:šdjčćž^2.0 +title:sdjccz^2.0))",q.toString());
 365+ assertEquals("(+contents:dobrodosli +contents:na +contents:sdjccz) (+title:dobrodosli^3.0 +title:na^3.0 +title:sdjccz^3.0)",q.toString());
340366
341367 analyzer = Analyzers.getSearcherAnalyzer("th");
342 - parser = new WikiQueryParser(ff.contents(),"0",analyzer,ff,NamespacePolicy.LEAVE);
 368+ bs = new FieldBuilder("th").getBuilder();
 369+ parser = new WikiQueryParser(bs.getFields().contents(),"0",analyzer,bs,NamespacePolicy.LEAVE);
343370
344371 q = parser.parseTwoPass("ภาษาไทย",NamespacePolicy.IGNORE);
345 - assertEquals("(+contents:ภาษา +contents:ไทย) (+title:ภาษา^2.0 +title:ไทย^2.0)",q.toString());
 372+ assertEquals("(+contents:ภาษา +contents:ไทย) (+title:ภาษา^3.0 +title:ไทย^3.0)",q.toString());
346373
347374 q = parser.parseTwoPass("help:ภาษาไทย",NamespacePolicy.REWRITE);
348 - assertEquals("(+namespace:12 +(+contents:ภาษา +contents:ไทย)) (+namespace:12 +(+title:ภาษา^2.0 +title:ไทย^2.0))",q.toString());
 375+ assertEquals("(+namespace:12 +(+contents:ภาษา +contents:ไทย)) (+namespace:12 +(+title:ภาษา^3.0 +title:ไทย^3.0))",q.toString());
349376
350 - // Backward compatiblity for complex filters
351 - analyzer = Analyzers.getSearcherAnalyzer("en");
352 - parser = new WikiQueryParser(ff.contents(),"0,1,4,12",analyzer,ff,NamespacePolicy.IGNORE);
353377
354 - q = parser.parseTwoPass("beans everyone",NamespacePolicy.REWRITE);
355 - assertEquals("(+(namespace:0 namespace:1 namespace:4 namespace:12) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5))) (+(namespace:0 namespace:1 namespace:4 namespace:12) +(+title:beans^2.0 +title:everyone^2.0))",q.toString());
356 -
357 - q = parser.parseTwoPass("beans main:everyone",NamespacePolicy.REWRITE);
358 - assertEquals("((+(namespace:0 namespace:1 namespace:4 namespace:12) +(contents:beans contents:bean^0.5)) (+namespace:0 +(contents:everyone contents:everyon^0.5))) ((+(namespace:0 namespace:1 namespace:4 namespace:12) +title:beans^2.0) (+namespace:0 +title:everyone^2.0))",q.toString());
359 -
360 - q = parser.parseTwoPass("beans everyone incategory:cheeses",NamespacePolicy.REWRITE);
361 - assertEquals("(+(namespace:0 namespace:1 namespace:4 namespace:12) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5) +category:cheeses)) (+(namespace:0 namespace:1 namespace:4 namespace:12) +(+title:beans^2.0 +title:everyone^2.0 +category:cheeses))",q.toString());
362 -
363 - q = parser.parseTwoPass("all_talk: beans everyone",NamespacePolicy.REWRITE);
364 - assertEquals("(+(namespace:1 namespace:3 namespace:5 namespace:7 namespace:9 namespace:11 namespace:13 namespace:15) +(+(contents:beans contents:bean^0.5) +(contents:everyone contents:everyon^0.5))) (+(namespace:1 namespace:3 namespace:5 namespace:7 namespace:9 namespace:11 namespace:13 namespace:15) +(+title:beans^2.0 +title:everyone^2.0))",q.toString());
365 -
366378 } catch(Exception e){
367379 e.printStackTrace();
368380 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/test/GlobalConfigurationTest.java
@@ -119,20 +119,20 @@
120120
121121 String[] ssr = (String[]) sr.toArray(new String [] {} );
122122
123 - assertEquals("entest",ssr[0]);
124 - assertEquals("entest.mainpart",ssr[1]);
125 - assertEquals("entest.restpart",ssr[2]);
126 - assertEquals("rutest",ssr[3]);
127 - assertEquals(4,ssr.length);
 123+ assertEquals("entest.mainpart",ssr[0]);
 124+ assertEquals("entest.restpart",ssr[1]);
 125+ assertEquals("rutest",ssr[2]);
 126+ assertEquals(3,ssr.length);
128127
129128 // search groups
130129 Hashtable<Integer,Hashtable<String,ArrayList<String>>> sg = testgc.getSearchGroups();
131130
132131 Hashtable<String,ArrayList<String>> g0 = sg.get(new Integer(0));
133 - assertEquals("{192.168.0.5=[entest.mainpart, entest.restpart], 192.168.0.2=[entest, entest.mainpart]}",g0.toString());
 132+ assertEquals("{192.168.0.5=[entest.mainpart, entest.restpart], 192.168.0.2=[entest.mainpart]}",g0.toString());
134133 Hashtable<String,ArrayList<String>> g1 = sg.get(new Integer(1));
135 - assertEquals("{192.168.0.6=[frtest.part3, detest], 192.168.0.4=[frtest.part1, frtest.part2]}",g1.toString());
 134+ assertEquals("{192.168.0.6=[frtest.part3, detest], 192.168.0.4=[frtest.part1, frtest.part2]}",g1.toString());
136135
 136+
137137 // index
138138 Hashtable index = testgc.getIndex();
139139 ArrayList ir = (ArrayList) index.get("192.168.0.5");
@@ -251,6 +251,7 @@
252252 assertEquals("njawiki.nspart3",njawiki.getPartByNamespace("4").toString());
253253 assertEquals("njawiki.nspart1",njawiki.getPartByNamespace("0").toString());
254254 assertEquals("njawiki.nspart2",njawiki.getPartByNamespace("12").toString());
 255+ assertEquals("[192.168.0.1]",njawiki.getSearchHosts().toString());
255256
256257 IndexId njawiki2 = IndexId.get("njawiki.nspart2");
257258 assertFalse(njawiki2.isLogical());
@@ -258,6 +259,7 @@
259260 assertTrue(njawiki2.isNssplit());
260261 assertEquals(3,njawiki2.getSplitFactor());
261262 assertEquals(2,njawiki2.getPartNum());
 263+ assertEquals("[192.168.0.1]",njawiki2.getSearchHosts().toString());
262264
263265 }
264266 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/interoperability/RMIMessengerClient.java
@@ -163,11 +163,11 @@
164164 }
165165 }
166166
167 - public SearchResults searchPart(IndexId iid, Query query, NamespaceFilterWrapper filter, int offset, int limit, boolean explain, String host){
 167+ public SearchResults searchPart(IndexId iid, String searchterm, Query query, NamespaceFilterWrapper filter, int offset, int limit, boolean explain, String host){
168168 try {
169169 RMIMessenger r = messengerFromCache(host);
170170 log.debug("Calling searchPart("+iid+",("+query+"),"+offset+","+limit+") on "+host);
171 - SearchResults res = r.searchPart(iid.toString(),query,filter,offset,limit,explain);
 171+ SearchResults res = r.searchPart(iid.toString(),searchterm,query,filter,offset,limit,explain);
172172 log.debug(" \\-> got: "+res);
173173 return res;
174174 } catch (Exception e) {
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/interoperability/RMIMessenger.java
@@ -63,7 +63,7 @@
6464 * @param limit
6565 * @throws RemoteException
6666 */
67 - public SearchResults searchPart(String dbrole, Query query, NamespaceFilterWrapper filter, int offset, int limit, boolean explain) throws RemoteException;
 67+ public SearchResults searchPart(String dbrole, String searchterm, Query query, NamespaceFilterWrapper filter, int offset, int limit, boolean explain) throws RemoteException;
6868
6969 /**
7070 * Returns index queue size. Needed for incremental updater, so it doesn't overload the indexer.
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/interoperability/RMIMessengerImpl.java
@@ -81,9 +81,9 @@
8282 }
8383
8484 // inherit javadoc
85 - public SearchResults searchPart(String dbrole, Query query, NamespaceFilterWrapper filter, int offset, int limit, boolean explain) throws RemoteException {
 85+ public SearchResults searchPart(String dbrole, String searchterm, Query query, NamespaceFilterWrapper filter, int offset, int limit, boolean explain) throws RemoteException {
8686 log.debug("Received request searchMainPart("+dbrole+","+query+","+offset+","+limit+")");
87 - return new SearchEngine().searchPart(IndexId.get(dbrole),query,filter,offset,limit,explain);
 87+ return new SearchEngine().searchPart(IndexId.get(dbrole),searchterm,query,filter,offset,limit,explain);
8888 }
8989
9090 // inherit javadoc
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/benchmark/StreamTerms.java
@@ -0,0 +1,52 @@
 2+package org.wikimedia.lsearch.benchmark;
 3+
 4+import java.io.BufferedReader;
 5+import java.io.FileInputStream;
 6+import java.io.IOException;
 7+import java.io.InputStreamReader;
 8+import java.util.zip.GZIPInputStream;
 9+
 10+/** Reads terms from an endless stream of terms */
 11+public class StreamTerms implements Terms {
 12+ BufferedReader in = null;
 13+ String path;
 14+
 15+ public StreamTerms(String path){
 16+ this.path = path;
 17+ open();
 18+ }
 19+
 20+ protected void open(){
 21+ try{
 22+ if(in != null)
 23+ in.close();
 24+ if(path.endsWith(".gz"))
 25+ in = new BufferedReader(
 26+ new InputStreamReader(
 27+ new GZIPInputStream(
 28+ new FileInputStream(path))));
 29+ else
 30+ in = new BufferedReader(
 31+ new InputStreamReader(
 32+ new FileInputStream(path)));
 33+ } catch(IOException e){
 34+ e.printStackTrace();
 35+ }
 36+ }
 37+
 38+ public String next() {
 39+ try {
 40+ return in.readLine();
 41+ } catch (IOException e) {
 42+ // try reopening the stream
 43+ open();
 44+ try {
 45+ return in.readLine();
 46+ } catch (IOException e1) {
 47+ e1.printStackTrace();
 48+ return null;
 49+ }
 50+ }
 51+ }
 52+
 53+}
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/benchmark/WordTerms.java
@@ -7,16 +7,19 @@
88 import java.util.ArrayList;
99 import java.util.zip.GZIPInputStream;
1010
 11+import org.apache.log4j.Logger;
 12+
1113 /** Benchmark terms from a dictionary of words (word : frequency) */
1214 public class WordTerms implements Terms {
 15+ Logger log = Logger.getLogger(WordTerms.class);
1316 /** load words from file, e.g. ./test-data/words-wikilucene.ngram.gz */
1417 public static ArrayList<String> loadWordFreq(String path) throws IOException {
1518 BufferedReader in;
1619 if(path.endsWith(".gz"))
1720 in = new BufferedReader(
18 - new InputStreamReader(
19 - new GZIPInputStream(
20 - new FileInputStream(path))));
 21+ new InputStreamReader(
 22+ new GZIPInputStream(
 23+ new FileInputStream(path))));
2124 else
2225 in = new BufferedReader(
2326 new InputStreamReader(
@@ -27,13 +30,17 @@
2831 int freqSum = 0;
2932 int freq,count=0;
3033 while((line = in.readLine())!=null){
31 - String[] parts = line.split(" : ");
32 - if(parts.length > 1){
33 - freq = Integer.parseInt(parts[1]);
34 - freqSum += freq;
 34+ try{
 35+ String[] parts = line.split(" : ");
 36+ if(parts.length > 1){
 37+ freq = Integer.parseInt(parts[1]);
 38+ freqSum += freq;
 39+ }
 40+ words.add(parts[0].trim());
 41+ } catch(NumberFormatException e){
 42+ words.add(line.trim());
3543 }
3644 count++;
37 - words.add(parts[0].trim());
3845 }
3946 //System.out.println("Loaded "+count+" words with frequency sum of "+freqSum);
4047 return words;
@@ -45,6 +52,7 @@
4653 try {
4754 words = loadWordFreq(path);
4855 } catch (IOException e) {
 56+ log.error("Cannot open dictionary of search terms in "+path);
4957 e.printStackTrace();
5058 }
5159 }
Index: trunk/lucene-search-2.0/src/org/wikimedia/lsearch/benchmark/Benchmark.java
@@ -171,15 +171,16 @@
172172 public static void main(String[] args) {
173173 String host = "127.0.0.1";
174174 int port = 8123;
175 - String database = "wikilucene";
 175+ String database = "enwiki";
176176 String verb = "search";
177 - String namespace = "main";
 177+ String namespace = "";
178178 String namespaceFilter= "0";
179179 String lang = "en-b";
180180 int runs = 5000;
181181 int threads = 10;
182 - int words = 2;
 182+ int words = 1;
183183 sample = true;
 184+ String wordfile = null;
184185 Terms terms;
185186
186187 for(int i = 0; i < args.length; i++) {
@@ -195,6 +196,8 @@
196197 runs = Integer.parseInt(args[++i]);
197198 } else if (args[i].equals("-v")) {
198199 database = args[++i];
 200+ } else if (args[i].equals("-wf")) {
 201+ wordfile = args[++i];
199202 } else if (args[i].equals("-n") || args[i].equals("-ns")) {
200203 namespace = args[++i];
201204 } else if (args[i].equals("-f") ) {
@@ -218,19 +221,17 @@
219222 " -n namespace (default: "+namespace+")\n"+
220223 " -f namespace filter (default: "+namespaceFilter+")\n"+
221224 " -l language (default: "+lang+")\n"+
222 - " -s show sample url (default: "+sample+")\n");
 225+ " -s show sample url (default: "+sample+")\n"+
 226+ " -wf <file> use file with search terms (default: none)\n");
223227 return;
224228 } else{
225229 System.out.println("Unrecognized switch: "+args[i]);
226230 return;
227231 }
228232 }
229 - if(lang.equals("en"))
230 - terms = new WordTerms("./lib/dict/english.txt.gz");
231 - else if(lang.equals("de"))
232 - terms = new WordTerms("./lib/dict/german.txt.gz");
233 - else if(lang.equals("fr"))
234 - terms = new WordTerms("./lib/dict/french.txt.gz");
 233+ if("en".equals(lang) || "de".equals(lang) || "es".equals(lang) || "fr".equals(lang) || "it".equals(lang) || "pt".equals(lang))
 234+ terms = new WordTerms("./lib/dict/terms-"+lang+".txt.gz");
 235+
235236 else if(lang.equals("sample"))
236237 terms = new SampleTerms();
237238 else
Index: trunk/lucene-search-2.0/lsearch-global.conf
@@ -16,13 +16,14 @@
1717 #wikilucene : (single) (language,en) (warmup,0)
1818 wikidev : (single) (language,sr)
1919 wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[])
 20+wikilucene : (language,en) (warmup,10)
2021
2122 # Search groups
2223 # Index parts of a split index are always taken from the node's group
2324 # host : db1.part db2.part
2425 # Mulitple hosts can search multiple dbs (N-N mapping)
2526 [Search-Group]
26 -oblak : wikilucene wikidev wikilucene.nspart1 wikilucene.nspart2 wikilucene.nspart3
 27+oblak : wikilucene wikidev
2728
2829 # Index nodes
2930 # host: db1.part db2.part
Index: trunk/lucene-search-2.0/lsearch.conf
@@ -82,12 +82,9 @@
8383 # Log, ganglia, localization
8484 ################################################
8585
86 -# URL to message files, {0} is replaced with language code, i.e. En
87 -Localization.url=file:///var/www/html/wiki-lucene/phase3/languages/messages/Messages{0}.php
 86+# URL to MediaWiki message files
 87+Localization.url=file:///var/www/html/wiki-lucene/phase3/languages/messages
8888
89 -# Pattern for OAI repo. {0} is replaced with dbname, {1} with language
90 -OAI.repo=http://localhost/wiki-lucene/phase3/index.php/Special:OAIRepository
91 -
9289 # Username/password for password authenticated OAI repo
9390 OAI.username=user
9491 OAI.password=pass
Index: trunk/lucene-search-2.0/lib/dict/french.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/english.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/german.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/terms-en.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Property changes on: trunk/lucene-search-2.0/lib/dict/terms-en.txt.gz
___________________________________________________________________
Added: svn:mime-type
9592 + application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/terms-pt.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Property changes on: trunk/lucene-search-2.0/lib/dict/terms-pt.txt.gz
___________________________________________________________________
Added: svn:mime-type
9693 + application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/terms-es.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Property changes on: trunk/lucene-search-2.0/lib/dict/terms-es.txt.gz
___________________________________________________________________
Added: svn:mime-type
9794 + application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/terms-fr.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Property changes on: trunk/lucene-search-2.0/lib/dict/terms-fr.txt.gz
___________________________________________________________________
Added: svn:mime-type
9895 + application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/terms-de.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Property changes on: trunk/lucene-search-2.0/lib/dict/terms-de.txt.gz
___________________________________________________________________
Added: svn:mime-type
9996 + application/octet-stream
Index: trunk/lucene-search-2.0/lib/dict/terms-it.txt.gz
Cannot display: file marked as a binary type.
svn:mime-type = application/octet-stream
Property changes on: trunk/lucene-search-2.0/lib/dict/terms-it.txt.gz
___________________________________________________________________
Added: svn:mime-type
10097 + application/octet-stream
Index: trunk/lucene-search-2.0/README.txt
@@ -21,8 +21,9 @@
2222 * edit mwsearch.conf:
2323 + MWConfig.global to point to URL of mwsearch-global.conf
2424 + MWConfig.lib to point to local library path (ie with unicode-data etc)
25 - + Localization.url to point to URL pattern of latest
26 - message files from MediaWiki
 25+ + Localization.url to point to URL of latest message files from MediaWiki
 26+ + Indexes.path - base path where you want the deamon to store the indexes,
 27+ + Logging.logconfig - local path to log4j configuration file, e.g. /etc/lsearch.log4j (the lsearch package has a sample log4j file you can use)
2728 * setup rsync daemon (see rsyncd.conf-example)
2829 * setup log4j logging subsystem (see mwsearch.log4j-example)
2930

Status & tagging log