r61856 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r61855‎ | r61856 | r61857 >
Date:15:09, 2 February 2010
Author:philip
Status:resolved (Comments)
Tags:
Comment:
Follow up r60742, r60743, r60764, r60766, r61214, r61390. Split stripForSearch into wordSegmentation and normalizeForSearch. So the wordSegmentation could be called by search engines separately.
Modified paths:
  • /trunk/phase3/includes/Title.php (modified) (history)
  • /trunk/phase3/includes/search/SearchIBM_DB2.php (modified) (history)
  • /trunk/phase3/includes/search/SearchMySQL.php (modified) (history)
  • /trunk/phase3/includes/search/SearchOracle.php (modified) (history)
  • /trunk/phase3/includes/search/SearchSqlite.php (modified) (history)
  • /trunk/phase3/includes/search/SearchUpdate.php (modified) (history)
  • /trunk/phase3/languages/Language.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageGan.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageJa.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageYue.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageZh.php (modified) (history)
  • /trunk/phase3/languages/classes/LanguageZh_hans.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/search/SearchMySQL.php
@@ -80,7 +80,7 @@
8181 // fulltext engine.
8282 // For Chinese this also inserts spaces between adjacent Han characters.
8383 $strippedVariants = array_map(
84 - array( $wgContLang, 'stripForSearch' ),
 84+ array( $wgContLang, 'normalizeForSearch' ),
8585 $variants );
8686
8787 // Some languages such as Chinese force all variants to a canonical
@@ -95,7 +95,7 @@
9696 $stripped = $this->normalizeText( $stripped );
9797 if( $nonQuoted && strpos( $stripped, ' ' ) !== false ) {
9898 // Hack for Chinese: we need to toss in quotes for
99 - // multiple-character phrases since stripForSearch()
 99+ // multiple-character phrases since normalizeForSearch()
100100 // added spaces between them to make word breaks.
101101 $stripped = '"' . trim( $stripped ) . '"';
102102 }
@@ -324,13 +324,16 @@
325325 global $wgContLang;
326326
327327 wfProfileIn( __METHOD__ );
 328+
 329+ // Some languages such as Chinese require word segmentation
 330+ $out = $wgContLang->wordSegmentation( $string );
328331
329332 // MySQL fulltext index doesn't grok utf-8, so we
330333 // need to fold cases and convert to hex
331334 $out = preg_replace_callback(
332335 "/([\\xc0-\\xff][\\x80-\\xbf]*)/",
333336 array( $this, 'stripForSearchCallback' ),
334 - $wgContLang->lc( $string ) );
 337+ $wgContLang->lc( $out ) );
335338
336339 // And to add insult to injury, the default indexing
337340 // ignores short words... Pad them so we can pass them
Index: trunk/phase3/includes/search/SearchOracle.php
@@ -217,7 +217,7 @@
218218
219219 private function escapeTerm($t) {
220220 global $wgContLang;
221 - $t = $wgContLang->stripForSearch($t);
 221+ $t = $wgContLang->normalizeForSearch($t);
222222 $t = isset($this->reservedWords[strtoupper($t)]) ? '{'.$t.'}' : $t;
223223 $t = preg_replace('/^"(.*)"$/', '($1)', $t);
224224 $t = preg_replace('/([-&|])/', '\\\\$1', $t);
Index: trunk/phase3/includes/search/SearchIBM_DB2.php
@@ -158,10 +158,10 @@
159159 if( is_array( $temp_terms )) {
160160 $temp_terms = array_unique( array_values( $temp_terms ));
161161 foreach( $temp_terms as $t )
162 - $q[] = $terms[1] . $wgContLang->stripForSearch( $t );
 162+ $q[] = $terms[1] . $wgContLang->normalizeForSearch( $t );
163163 }
164164 else
165 - $q[] = $terms[1] . $wgContLang->stripForSearch( $terms[2] );
 165+ $q[] = $terms[1] . $wgContLang->normalizeForSearch( $terms[2] );
166166
167167 if (!empty($terms[3])) {
168168 $regexp = preg_quote( $terms[3], '/' );
Index: trunk/phase3/includes/search/SearchSqlite.php
@@ -92,7 +92,7 @@
9393 // fulltext engine.
9494 // For Chinese this also inserts spaces between adjacent Han characters.
9595 $strippedVariants = array_map(
96 - array( $wgContLang, 'stripForSearch' ),
 96+ array( $wgContLang, 'normalizeForSearch' ),
9797 $variants );
9898
9999 // Some languages such as Chinese force all variants to a canonical
@@ -106,7 +106,7 @@
107107 foreach( $strippedVariants as $stripped ) {
108108 if( $nonQuoted && strpos( $stripped, ' ' ) !== false ) {
109109 // Hack for Chinese: we need to toss in quotes for
110 - // multiple-character phrases since stripForSearch()
 110+ // multiple-character phrases since normalizeForSearch()
111111 // added spaces between them to make word breaks.
112112 $stripped = '"' . trim( $stripped ) . '"';
113113 }
Index: trunk/phase3/includes/search/SearchUpdate.php
@@ -43,7 +43,7 @@
4444 }
4545
4646 # Language-specific strip/conversion
47 - $text = $wgContLang->stripForSearch( $this->mText );
 47+ $text = $wgContLang->normalizeForSearch( $this->mText );
4848
4949 wfProfileIn( $fname.'-regexps' );
5050 $text = preg_replace( "/<\\/?\\s*[A-Za-z][^>]*?>/",
Index: trunk/phase3/includes/Title.php
@@ -435,7 +435,7 @@
436436 global $wgContLang;
437437
438438 $lc = SearchEngine::legalSearchChars() . '&#;';
439 - $t = $wgContLang->stripForSearch( $title );
 439+ $t = $wgContLang->normalizeForSearch( $title );
440440 $t = preg_replace( "/[^{$lc}]+/", ' ', $t );
441441 $t = $wgContLang->lc( $t );
442442
Index: trunk/phase3/languages/Language.php
@@ -1686,15 +1686,26 @@
16871687 function hasWordBreaks() {
16881688 return true;
16891689 }
 1690+
 1691+ /**
 1692+ * Some languages such as Chinese require word segmentation,
 1693+ * Specify such segmentation when overridden in derived class.
 1694+ *
 1695+ * @param $string String
 1696+ * @return String
 1697+ */
 1698+ function wordSegmentation( $string ) {
 1699+ return $string;
 1700+ }
16901701
16911702 /**
1692 - * Some languages have special punctuation to strip out.
 1703+ * Some languages have special punctuation need to be normalized.
16931704 * Make such changes here.
16941705 *
16951706 * @param $string String
16961707 * @return String
16971708 */
1698 - function stripForSearch( $string, $doStrip = true ) {
 1709+ function normalizeForSearch( $string ) {
16991710 return $string;
17001711 }
17011712
@@ -1708,7 +1719,7 @@
17091720 return $string;
17101721 }
17111722
1712 - protected static function wordSegmentation( $string, $pattern ) {
 1723+ protected static function insertSpace( $string, $pattern ) {
17131724 $string = preg_replace( $pattern, " $1 ", $string );
17141725 $string = preg_replace( '/ +/', ' ', $string );
17151726 return $string;
Index: trunk/phase3/languages/classes/LanguageZh_hans.php
@@ -7,25 +7,26 @@
88 function hasWordBreaks() {
99 return false;
1010 }
11 -
12 - function stripForSearch( $string, $doStrip = true ) {
 11+
 12+ /**
 13+ * Eventually this should be a word segmentation;
 14+ * for now just treat each character as a word.
 15+ * @todo Fixme: only do this for Han characters...
 16+ */
 17+ function wordSegmentation( $string ) {
 18+ $reg = "/([\\xc0-\\xff][\\x80-\\xbf]*)/";
 19+ $s = self::insertSpace( $string, $reg );
 20+ return $s;
 21+ }
 22+
 23+ function normalizeForSearch( $string ) {
1324 wfProfileIn( __METHOD__ );
1425
1526 // Double-width roman characters
1627 $s = self::convertDoubleWidth( $string );
17 -
18 - if ( $doStrip == true ) {
19 - // Eventually this should be a word segmentation;
20 - // for now just treat each character as a word.
21 - // @todo Fixme: only do this for Han characters...
22 - $reg = "/([\\xc0-\\xff][\\x80-\\xbf]*)/";
23 - $s = self::wordSegmentation( $s, $reg );
24 - }
25 -
2628 $s = trim( $s );
 29+ $s = parent::normalizeForSearch( $s );
2730
28 - // Do general case folding and UTF-8 armoring
29 - $s = parent::stripForSearch( $s, $doStrip );
3031 wfProfileOut( __METHOD__ );
3132 return $s;
3233 }
Index: trunk/phase3/languages/classes/LanguageJa.php
@@ -6,30 +6,29 @@
77 * @ingroup Language
88 */
99 class LanguageJa extends Language {
10 - function stripForSearch( $string, $doStrip = true ) {
 10+ function wordSegmentation( $string ) {
 11+ // Strip known punctuation ?
 12+ // $s = preg_replace( '/\xe3\x80[\x80-\xbf]/', '', $s ); # U3000-303f
1113
12 - $s = $string;
 14+ // Space strings of like hiragana/katakana/kanji
 15+ $hiragana = '(?:\xe3(?:\x81[\x80-\xbf]|\x82[\x80-\x9f]))'; # U3040-309f
 16+ $katakana = '(?:\xe3(?:\x82[\xa0-\xbf]|\x83[\x80-\xbf]))'; # U30a0-30ff
 17+ $kanji = '(?:\xe3[\x88-\xbf][\x80-\xbf]'
 18+ . '|[\xe4-\xe8][\x80-\xbf]{2}'
 19+ . '|\xe9[\x80-\xa5][\x80-\xbf]'
 20+ . '|\xe9\xa6[\x80-\x99])';
 21+ # U3200-9999 = \xe3\x88\x80-\xe9\xa6\x99
 22+ $reg = "/({$hiragana}+|{$katakana}+|{$kanji}+)/";
 23+ $s = self::insertSpace( $string, $reg );
 24+ return $s;
 25+ }
1326
14 - if ( $doStrip == true ) {
15 - // Strip known punctuation ?
16 - // $s = preg_replace( '/\xe3\x80[\x80-\xbf]/', '', $s ); # U3000-303f
17 -
18 - // Space strings of like hiragana/katakana/kanji
19 - $hiragana = '(?:\xe3(?:\x81[\x80-\xbf]|\x82[\x80-\x9f]))'; # U3040-309f
20 - $katakana = '(?:\xe3(?:\x82[\xa0-\xbf]|\x83[\x80-\xbf]))'; # U30a0-30ff
21 - $kanji = '(?:\xe3[\x88-\xbf][\x80-\xbf]'
22 - . '|[\xe4-\xe8][\x80-\xbf]{2}'
23 - . '|\xe9[\x80-\xa5][\x80-\xbf]'
24 - . '|\xe9\xa6[\x80-\x99])';
25 - # U3200-9999 = \xe3\x88\x80-\xe9\xa6\x99
26 - $reg = "/({$hiragana}+|{$katakana}+|{$kanji}+)/";
27 - $s = self::wordSegmentation( $s, $reg );
28 - }
 27+ function normalizeForSearch( $string ) {
2928 // Double-width roman characters
30 - $s = self::convertDoubleWidth( $s );
 29+ $s = self::convertDoubleWidth( $string );
3130
3231 # Do general case folding and UTF-8 armoring
33 - return parent::stripForSearch( $s, $doStrip );
 32+ return parent::normalizeForSearch( $s );
3433 }
3534
3635 # Italic is not appropriate for Japanese script
Index: trunk/phase3/languages/classes/LanguageGan.php
@@ -135,9 +135,9 @@
136136 }
137137
138138 // word segmentation
139 - function stripForSearch( $string, $doStrip = true, $autoVariant = 'gan-hans' ) {
140 - // LanguageZh::stripForSearch
141 - return parent::stripForSearch( $string, $doStrip, $autoVariant );
 139+ function normalizeForSearch( $string, $autoVariant = 'gan-hans' ) {
 140+ // LanguageZh::normalizeForSearch
 141+ return parent::normalizeForSearch( $string, $autoVariant );
142142 }
143143
144144 function convertForSearchResult( $termsArray ) {
Index: trunk/phase3/languages/classes/LanguageZh.php
@@ -170,8 +170,23 @@
171171 "\"$1\"", $text);
172172 }
173173
174 - // word segmentation
175 - function stripForSearch( $string, $doStrip = true, $autoVariant = 'zh-hans' ) {
 174+ /**
 175+ * word segmentation
 176+ */
 177+ function wordSegmentation( $string ) {
 178+ // LanguageZh_hans::wordSegmentation
 179+ $s = parent::wordSegmentation( $string );
 180+ return $s;
 181+ }
 182+
 183+ /**
 184+ * auto convert to zh-hans and normalize special characters.
 185+ *
 186+ * @param $string String
 187+ * @param $autoVariant String, default to 'zh-hans'
 188+ * @return String
 189+ */
 190+ function normalizeForSearch( $string, $autoVariant = 'zh-hans' ) {
176191 wfProfileIn( __METHOD__ );
177192
178193 // always convert to zh-hans before indexing. it should be
@@ -179,8 +194,8 @@
180195 // Traditional to Simplified is less ambiguous than the
181196 // other way around
182197 $s = $this->mConverter->autoConvert( $string, $autoVariant );
183 - // LanguageZh_hans::stripForSearch
184 - $s = parent::stripForSearch( $s, $doStrip );
 198+ // LanguageZh_hans::normalizeForSearch
 199+ $s = parent::normalizeForSearch( $s );
185200 wfProfileOut( __METHOD__ );
186201 return $s;
187202
Index: trunk/phase3/languages/classes/LanguageYue.php
@@ -3,24 +3,29 @@
44 * @ingroup Language
55 */
66 class LanguageYue extends Language {
7 - function stripForSearch( $string, $doStrip = true ) {
 7+ function hasWordBreaks() {
 8+ return false;
 9+ }
 10+
 11+ /**
 12+ * Eventually this should be a word segmentation;
 13+ * for now just treat each character as a word.
 14+ * @todo Fixme: only do this for Han characters...
 15+ */
 16+ function wordSegmentation( $string ) {
 17+ $reg = "/([\\xc0-\\xff][\\x80-\\xbf]*)/";
 18+ $s = self::insertSpace( $string, $reg );
 19+ return $s;
 20+ }
 21+
 22+ function normalizeForSearch( $string ) {
823 wfProfileIn( __METHOD__ );
924
1025 // Double-width roman characters
1126 $s = self::convertDoubleWidth( $string );
12 -
13 - if ( $doStrip == true ) {
14 - // eventually this should be a word segmentation;
15 - // for now just treat each character as a word.
16 - // @todo Fixme: only do this for Han characters...
17 - $reg = "/([\\xc0-\\xff][\\x80-\\xbf]*)/";
18 - $s = self::wordSegmentation( $s, $reg );
19 - }
20 -
2127 $s = trim( $s );
 28+ $s = parent::normalizeForSearch( $s );
2229
23 - // Do general case folding and UTF-8 armoring
24 - $s = parent::stripForSearch( $s, $doStrip );
2530 wfProfileOut( __METHOD__ );
2631 return $s;
2732 }

Follow-up revisions

RevisionCommit summaryAuthorDate
r61857Follow up r61856. Apply related changes in extensions.philip15:09, 2 February 2010
r61859Follow up r61856, no need.philip15:26, 2 February 2010
r63456follow-up r61856 — wordsegmentation shoudl be done for all search engines, ...mah04:07, 9 March 2010
r63458follow-up r61856 — wordsegmentation should be done for all search engines, ...mah04:19, 9 March 2010
r63578Follow-up r61856...mah21:54, 10 March 2010
r63637Revert r61856, r63457, experimental changes not suitable for immediate releas...tstarling18:22, 12 March 2010

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r60742Add stripForSearch in MWSearch. So we could regularize text before indexing a...philip19:44, 6 January 2010
r607431. Add conditions to stripForSearch for LuceneSearch / MWSearch....philip19:51, 6 January 2010
r60764follow-up r60743....philip04:50, 7 January 2010
r60766follow-up r60742. adapt to the code changes made in r60764.philip04:53, 7 January 2010
r61214Factored MySQL-specific munging out of Language::stripForSearch() to Database...maxsem20:54, 18 January 2010
r61390Fixed r61214: moved MySQL munging to SearchEngine, updated calls. Can we kill...maxsem20:36, 22 January 2010

Comments

#Comment by Tim Starling (talk | contribs)   07:54, 17 February 2010

OK, that's good. But I think there's some upgrade problems to work out here. In 1.15, word segmentation for Japanese and Chinese was done for all search engines and DBMSes. Here you do word segmentation only in MySQL. So anyone with a non-MySQL database from 1.15 in those languages will not be able to search for anything in 1.16, since their index will be full of segmented text and the interface will be querying for unsegmented text.

That might not be very many users, but we at least need to make some notes about this in UPGRADE and RELEASE-NOTES and on the wiki.

I also have a question about whether those other search engines can even cope with unsegmented text. Will they just fall over, in the same way as MySQL?

#Comment by MarkAHershberger (talk | contribs)   22:07, 8 March 2010

I've tested these changes and will update UPGRADE. RELEASE-NOTES already contains a note:

* (bug 8445) Multiple-character search terms are now handled properly for Chinese

This fix reveals new bugs with search and muti-byte characters, but I don't think this fix caused them, only revealed them, so I'm marking this "ok".

#Comment by MarkAHershberger (talk | contribs)   22:22, 8 March 2010

sorry, spoke too soon.

#Comment by Bryan (talk | contribs)   13:08, 10 March 2010

Function names should begin with a verb, so wordSegmentation needs to be renamed.

#Comment by MarkAHershberger (talk | contribs)   21:59, 10 March 2010

resolved in r63578 & r63456.

Status & tagging log