r60743 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r60742‎ | r60743 | r60744 >
Date:19:51, 6 January 2010
Author:philip
Status:resolved (Comments)
Tags:
Comment:
1. Add conditions to stripForSearch for LuceneSearch / MWSearch.
2. Add double-width roman characters conversion support to zh, gan, and yue.
Modified paths:
  • /trunk/phase3/languages/Language.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageGan.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageJa.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageYue.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageZh.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageZh_hans.php (modified) (history)

Diff [purge]

Index: trunk/phase3/languages/Language.php
@@ -1697,12 +1697,11 @@
16981698 * @return String
16991699 */
17001700 function stripForSearch( $string ) {
1701 - global $wgDBtype;
1702 - if ( $wgDBtype != 'mysql' ) {
 1701+ global $wgDBtype, $wgSearchType;
 1702+ if ( $wgDBtype != 'mysql' or $wgSearchType == 'LuceneSearch' ) {
17031703 return $string;
17041704 }
17051705
1706 -
17071706 wfProfileIn( __METHOD__ );
17081707
17091708 // MySQL fulltext index doesn't grok utf-8, so we
Index: trunk/phase3/languages/classes/LanguageZh_hans.php
@@ -9,18 +9,32 @@
1010 }
1111
1212 function stripForSearch( $string ) {
13 - // Eventually this should be a word segmentation;
14 - // for now just treat each character as a word.
15 - //
16 - // Note we put a space on both sides to cover cases
17 - // where a number or Latin char follows a Han char.
18 - //
19 - // @todo Fixme: only do this for Han characters...
20 - $t = preg_replace(
21 - "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
22 - " $1 ", $string);
23 - $t = preg_replace( '/ +/', ' ', $t );
24 - $t = trim( $t );
25 - return parent::stripForSearch( $t );
 13+ wfProfileIn( __METHOD__ );
 14+ global $wgSearchType;
 15+
 16+ $s = $string;
 17+
 18+ // Double-width roman characters: ff00-ff5f ~= 0020-007f
 19+ $s = preg_replace( '/\xef\xbc([\x80-\xbf])/e', 'chr((ord("$1") & 0x3f) + 0x20)', $s );
 20+ $s = preg_replace( '/\xef\xbd([\x80-\x99])/e', 'chr((ord("$1") & 0x3f) + 0x60)', $s );
 21+
 22+ if ( $wgSearchType != 'LuceneSearch' ) {
 23+ // Eventually this should be a word segmentation;
 24+ // for now just treat each character as a word.
 25+ // Not for LuceneSearch, because LSearch will
 26+ // split the text to words itself.
 27+ // @todo Fixme: only do this for Han characters...
 28+ $s = preg_replace(
 29+ "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
 30+ " $1 ", $s);
 31+ $s = preg_replace( '/ +/', ' ', $s );
 32+ }
 33+
 34+ $s = trim( $s );
 35+
 36+ // Do general case folding and UTF-8 armoring
 37+ $s = parent::stripForSearch( $s );
 38+ wfProfileOut( __METHOD__ );
 39+ return $s;
2640 }
2741 }
Index: trunk/phase3/languages/classes/LanguageGan.php
@@ -139,23 +139,36 @@
140140 // word segmentation
141141 function stripForSearch( $string ) {
142142 wfProfileIn( __METHOD__ );
 143+ global $wgSearchType;
143144
144 - // eventually this should be a word segmentation
145 - // for now just treat each character as a word
146 - // @todo Fixme: only do this for Han characters...
147 - $t = preg_replace(
148 - "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
149 - " $1", $string);
 145+ // always convert to gan-hans before indexing. it should be
 146+ // better to use gan-hans for search, since conversion from
 147+ // Traditional to Simplified is less ambiguous than the
 148+ // other way around
 149+ $s = $this->mConverter->autoConvert($string, 'gan-hans');
150150
151 - //always convert to gan-hans before indexing. it should be
152 - //better to use gan-hans for search, since conversion from
153 - //Traditional to Simplified is less ambiguous than the
154 - //other way around
 151+ // Double-width roman characters: ff00-ff5f ~= 0020-007f
 152+ $s = preg_replace( '/\xef\xbc([\x80-\xbf])/e', 'chr((ord("$1") & 0x3f) + 0x20)', $s );
 153+ $s = preg_replace( '/\xef\xbd([\x80-\x99])/e', 'chr((ord("$1") & 0x3f) + 0x60)', $s );
155154
156 - $t = $this->mConverter->autoConvert($t, 'gan-hans');
157 - $t = parent::stripForSearch( $t );
 155+ if ( $wgSearchType != 'LuceneSearch' ) {
 156+ // eventually this should be a word segmentation;
 157+ // for now just treat each character as a word.
 158+ // Not for LuceneSearch, because LSearch will
 159+ // split the text to words itself.
 160+ // @todo Fixme: only do this for Han characters...
 161+ $s = preg_replace(
 162+ "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
 163+ " $1 ", $s);
 164+ $s = preg_replace( '/ +/', ' ', $s );
 165+ }
 166+
 167+ $s = trim( $s );
 168+
 169+ // Do general case folding and UTF-8 armoring
 170+ $s = parent::stripForSearch( $s );
158171 wfProfileOut( __METHOD__ );
159 - return $t;
 172+ return $s;
160173
161174 }
162175
Index: trunk/phase3/languages/classes/LanguageJa.php
@@ -11,23 +11,26 @@
1212 # need to fold cases and convert to hex
1313 $s = $string;
1414
15 - # Strip known punctuation ?
16 - #$s = preg_replace( '/\xe3\x80[\x80-\xbf]/', '', $s ); # U3000-303f
 15+ # not for LuceneSearch, because LSearch
 16+ # will split the text to words itself
 17+ if ( $wgSearchType != 'LuceneSearch' ) {
 18+ # Strip known punctuation ?
 19+ #$s = preg_replace( '/\xe3\x80[\x80-\xbf]/', '', $s ); # U3000-303f
1720
18 - # Space strings of like hiragana/katakana/kanji
19 - $hiragana = '(?:\xe3(?:\x81[\x80-\xbf]|\x82[\x80-\x9f]))'; # U3040-309f
20 - $katakana = '(?:\xe3(?:\x82[\xa0-\xbf]|\x83[\x80-\xbf]))'; # U30a0-30ff
21 - $kanji = '(?:\xe3[\x88-\xbf][\x80-\xbf]'
22 - . '|[\xe4-\xe8][\x80-\xbf]{2}'
23 - . '|\xe9[\x80-\xa5][\x80-\xbf]'
24 - . '|\xe9\xa6[\x80-\x99])';
25 - # U3200-9999 = \xe3\x88\x80-\xe9\xa6\x99
26 - $s = preg_replace( "/({$hiragana}+|{$katakana}+|{$kanji}+)/", ' $1 ', $s );
27 -
 21+ # Space strings of like hiragana/katakana/kanji
 22+ $hiragana = '(?:\xe3(?:\x81[\x80-\xbf]|\x82[\x80-\x9f]))'; # U3040-309f
 23+ $katakana = '(?:\xe3(?:\x82[\xa0-\xbf]|\x83[\x80-\xbf]))'; # U30a0-30ff
 24+ $kanji = '(?:\xe3[\x88-\xbf][\x80-\xbf]'
 25+ . '|[\xe4-\xe8][\x80-\xbf]{2}'
 26+ . '|\xe9[\x80-\xa5][\x80-\xbf]'
 27+ . '|\xe9\xa6[\x80-\x99])';
 28+ # U3200-9999 = \xe3\x88\x80-\xe9\xa6\x99
 29+ $s = preg_replace( "/({$hiragana}+|{$katakana}+|{$kanji}+)/", ' $1 ', $s );
 30+ }
2831 # Double-width roman characters: ff00-ff5f ~= 0020-007f
2932 $s = preg_replace( '/\xef\xbc([\x80-\xbf])/e', 'chr((ord("$1") & 0x3f) + 0x20)', $s );
3033 $s = preg_replace( '/\xef\xbd([\x80-\x99])/e', 'chr((ord("$1") & 0x3f) + 0x60)', $s );
31 -
 34+
3235 # Do general case folding and UTF-8 armoring
3336 return parent::stripForSearch( $s );
3437 }
Index: trunk/phase3/languages/classes/LanguageYue.php
@@ -5,17 +5,31 @@
66 class LanguageYue extends Language {
77 function stripForSearch( $string ) {
88 wfProfileIn( __METHOD__ );
 9+ global $wgSearchType;
910
10 - // eventually this should be a word segmentation
11 - // for now just treat each character as a word
12 - // @todo Fixme: only do this for Han characters...
13 - $t = preg_replace(
14 - "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
15 - " $1", $string);
 11+ $s = $string;
1612
 13+ // Double-width roman characters: ff00-ff5f ~= 0020-007f
 14+ $s = preg_replace( '/\xef\xbc([\x80-\xbf])/e', 'chr((ord("$1") & 0x3f) + 0x20)', $s );
 15+ $s = preg_replace( '/\xef\xbd([\x80-\x99])/e', 'chr((ord("$1") & 0x3f) + 0x60)', $s );
 16+
 17+ if ( $wgSearchType != 'LuceneSearch' ) {
 18+ // eventually this should be a word segmentation;
 19+ // for now just treat each character as a word.
 20+ // Not for LuceneSearch, because LSearch will
 21+ // split the text to words itself.
 22+ // @todo Fixme: only do this for Han characters...
 23+ $s = preg_replace(
 24+ "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
 25+ " $1 ", $s);
 26+ $s = preg_replace( '/ +/', ' ', $s );
 27+ }
 28+
 29+ $s = trim( $s );
 30+
1731 // Do general case folding and UTF-8 armoring
18 - $t = parent::stripForSearch( $t );
 32+ $s = parent::stripForSearch( $s );
1933 wfProfileOut( __METHOD__ );
20 - return $t;
 34+ return $s;
2135 }
2236 }
Index: trunk/phase3/languages/classes/LanguageZh.php
@@ -176,15 +176,14 @@
177177 function stripForSearch( $string ) {
178178 wfProfileIn( __METHOD__ );
179179
180 - //always convert to zh-hans before indexing. it should be
181 - //better to use zh-hans for search, since conversion from
182 - //Traditional to Simplified is less ambiguous than the
183 - //other way around
184 -
185 - $t = $this->mConverter->autoConvert( $string, 'zh-hans' );
186 - $t = parent::stripForSearch( $t );
 180+ // always convert to zh-hans before indexing. it should be
 181+ // better to use zh-hans for search, since conversion from
 182+ // Traditional to Simplified is less ambiguous than the
 183+ // other way around
 184+ $s = $this->mConverter->autoConvert( $string, 'zh-hans' );
 185+ $s = parent::stripForSearch( $s );
187186 wfProfileOut( __METHOD__ );
188 - return $t;
 187+ return $s;
189188
190189 }
191190

Follow-up revisions

RevisionCommit summaryAuthorDate
r60764follow-up r60743....philip04:50, 7 January 2010
r61856Follow up r60742, r60743, r60764, r60766, r61214, r61390. Split stripForSearc...philip15:09, 2 February 2010
r63622Reverted r60743 and related: Lucene already does some fairly complex normalis...tstarling00:34, 12 March 2010
r63623As in r63622, reverted r60743 and related: Lucene doesn't need any language-s...tstarling00:39, 12 March 2010

Comments

#Comment by MaxSem (talk | contribs)   19:54, 6 January 2010

Are those plugs in core for the extension really unavoidable?

#Comment by PhiLiP (talk | contribs)   20:14, 6 January 2010

I could change "$wgSearchType == 'LuceneSearch'" to "$wgSearchType != null", and "$wgSearchType != 'LuceneSearch'" to "$wgSearchType == null". But I think using "null" would take risks on other extensions.

#Comment by Reedy (talk | contribs)   19:56, 6 January 2010

And the use of "or"??

#Comment by PhiLiP (talk | contribs)   20:18, 6 January 2010

You mean this?

$wgDBtype != 'mysql' or $wgSearchType == 'LuceneSearch'

I made the change on r60745. Sorry to forget the follow-up...

#Comment by Bryan (talk | contribs)   20:55, 6 January 2010

Adding switches in core for extensions is not a solution. Use a method like SearchEngine::requiresStrip() or something, similarly like we do for things like Database::implicitGroupBy() and related functions

#Comment by PhiLiP (talk | contribs)   04:57, 7 January 2010

Added a new parameter to Language::stripForSearch(). So the switch would be controlled by extensions but not core. Apply on r60764 and r60765.

#Comment by PhiLiP (talk | contribs)   04:57, 7 January 2010

Added a new parameter to Language::stripForSearch(). So the switch would be controlled by extensions but not core. Apply on r60764 and r60766.

#Comment by Nikerabbit (talk | contribs)   22:26, 6 January 2010

There also seems to be a some amount of code duplication.

#Comment by PhiLiP (talk | contribs)   04:54, 7 January 2010

reduced the code duplication in r60764.

#Comment by Tim Starling (talk | contribs)   07:05, 20 January 2010

I'll comment here on the whole project (r60742, r60743, r60764, r60766, r61214).

It doesn't make sense to have a function called stripForSearch() which has a parameter called $doStrip. The name of the parameter implies that if it's set to false, the function will do nothing and there's no point in calling it. Also, boolean parameters are generally bad and make the code hard to read.

I think a good solution would be to split stripForSearch into two: SearchEngine::normalizeContent() and Language::normalizeForSearch(). I think the caller in the SearchEngine subclass should call both (if necessary), one should not call the other.

There's only two extensions (AdvancedSearch and MWSearch) and a few core files that call $wgContLang->stripForSearch(), they can all be fixed.

#Comment by PhiLiP (talk | contribs)   15:16, 2 February 2010

I split stripForSearch() into wordSegmentation() and normalizeForSearch() on r61856. And apply such changes to extensions on r61857.

Status & tagging log