r63456 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r63455‎ | r63456 | r63457 >
Date:04:07, 9 March 2010
Author:mah
Status:reverted (Comments)
Tags:
Comment:
follow-up r61856 — wordsegmentation shoudl be done for all search engines, not just mysql
Modified paths:
  • /branches/REL1_16/phase3/includes/search/SearchEngine.php (modified) (history)
  • /branches/REL1_16/phase3/includes/search/SearchMySQL.php (modified) (history)
  • /branches/REL1_16/phase3/includes/search/SearchUpdate.php (modified) (history)
  • /branches/REL1_16/phase3/languages/classes/LanguageZh_hans.php (modified) (history)

Diff [purge]

Index: branches/REL1_16/phase3/languages/classes/LanguageZh_hans.php
@@ -25,6 +25,8 @@
2626 // Double-width roman characters
2727 $s = self::convertDoubleWidth( $string );
2828 $s = trim( $s );
 29+ $s = self::wordSegmentation( $s );
 30+ $s = self::convertDoubleWidth( $s );
2931 $s = parent::normalizeForSearch( $s );
3032
3133 wfProfileOut( __METHOD__ );
Index: branches/REL1_16/phase3/includes/search/SearchUpdate.php
@@ -37,7 +37,7 @@
3838
3939 if( $this->mText === false ) {
4040 $search->updateTitle($this->mId,
41 - Title::indexTitle( $this->mNamespace, $this->mTitle ));
 41+ $search->normalizeText( Title::indexTitle( $this->mNamespace, $this->mTitle ) ) );
4242 wfProfileOut( $fname );
4343 return;
4444 }
@@ -97,8 +97,8 @@
9898 wfRunHooks( 'SearchUpdate', array( $this->mId, $this->mNamespace, $this->mTitle, &$text ) );
9999
100100 # Perform the actual update
101 - $search->update($this->mId, Title::indexTitle( $this->mNamespace, $this->mTitle ),
102 - $text);
 101+ $search->update($this->mId, $search->normalizeText( Title::indexTitle( $this->mNamespace, $this->mTitle ) ),
 102+ $search->normalizeText( $text ) );
103103
104104 wfProfileOut( $fname );
105105 }
Index: branches/REL1_16/phase3/includes/search/SearchEngine.php
@@ -56,7 +56,10 @@
5757 * @return string
5858 */
5959 public function normalizeText( $string ) {
60 - return $string;
 60+ global $wgContLang;
 61+
 62+ // Some languages such as Chinese require word segmentation
 63+ return $wgContLang->wordSegmentation( $string );
6164 }
6265
6366 /**
Index: branches/REL1_16/phase3/includes/search/SearchMySQL.php
@@ -325,8 +325,7 @@
326326
327327 wfProfileIn( __METHOD__ );
328328
329 - // Some languages such as Chinese require word segmentation
330 - $out = $wgContLang->wordSegmentation( $string );
 329+ $out = parent::normalizeText( $string );
331330
332331 // MySQL fulltext index doesn't grok utf-8, so we
333332 // need to fold cases and convert to hex

Follow-up revisions

RevisionCommit summaryAuthorDate
r63457follow up r63456 remove extra line.mah04:14, 9 March 2010

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r61856Follow up r60742, r60743, r60764, r60766, r61214, r61390. Split stripForSearc...philip15:09, 2 February 2010

Comments

#Comment by Tim Starling (talk | contribs)   18:21, 12 March 2010

Doing word segmentation in MW is almost certainly detrimental for Lucene, and may be unnecessary for other search engines as well. Calling SearchEngine::normalizeText() in SearchUpdate::doUpdate() is not a good idea without at least removing the redundant calls to the same function in SearchMySQL::update(). Perhaps the change here was to the benefit of PostgreSQL and Oracle, but I'm sure you didn't test it in those DBMSes. We can consider this in trunk but not in a branch that's going to be released very soon.

Based on code review, I'm pretty sure that with the MWSearch patches reverted, Philip's project in general is not a harmful change compared to how it was in 1.15. That's why I'm reverting this to get the 1.16 branch back to how Philip left it.

Status & tagging log