r107135 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r107134‎ | r107135 | r107136 >
Date:05:56, 23 December 2011
Author:tstarling
Status:ok (Comments)
Tags:
Comment:
* Fix for bug 33331, r67994: be much more aggressive in splitting punctuation into "words". Re-introduce a function like the my_istext() before r67994, except it's called isLetter() now. Split up character-by-character except sequences of such "letters".
* Updated 003.phpt, I think the new behaviour is better.
* Added a comment about a possible future direction: ICU.
Modified paths:
  • /trunk/extensions/wikidiff2/debian/changelog (modified) (history)
  • /trunk/extensions/wikidiff2/tests/003.phpt (modified) (history)
  • /trunk/extensions/wikidiff2/wikidiff2.cpp (modified) (history)
  • /trunk/extensions/wikidiff2/wikidiff2.h (modified) (history)

Diff [purge]

Index: trunk/extensions/wikidiff2/wikidiff2.cpp
@@ -325,6 +325,13 @@
326326 }
327327
328328 // Split a string into words
 329+//
 330+// TODO: I think the best way to do this would be to use ICU BreakIterator
 331+// instead of libthai + DIY. Basically you'd run BreakIterators from several
 332+// different locales (en, th, ja) and merge the results, i.e. if a break occurs
 333+// in any locale at a given position, split the string. I don't know if the
 334+// quality of the Thai dictionary in ICU matches the one in libthai, we would
 335+// have to check this somehow.
329336 void Wikidiff2::explodeWords(const String & text, WordVector &words)
330337 {
331338 // Don't try to do a word-level diff on very long lines
@@ -360,10 +367,12 @@
361368 tisText += (char)thaiChar;
362369 charSizes += (char)(p - charStart);
363370
364 - if (!isSpace(ch) && lastChar && isSpace(lastChar)) {
 371+ if (isLetter(ch)) {
 372+ if (lastChar && !isLetter(lastChar)) {
 373+ breaks.insert(charIndex);
 374+ }
 375+ } else {
365376 breaks.insert(charIndex);
366 - } else if (isChineseJapanese(ch)) {
367 - breaks.insert(charIndex);
368377 }
369378 charIndex++;
370379 lastChar = ch;
Index: trunk/extensions/wikidiff2/debian/changelog
@@ -1,3 +1,9 @@
 2+php5-wikidiff2 (1.1.1-1) lucid; urgency=low
 3+
 4+ * Fixed bug 33331 (word breaking around punctuation)
 5+
 6+ -- Tim Starling <tstarling@wikimedia.org> Fri, 23 Dec 2011 15:33:40 +1100
 7+
28 php5-wikidiff2 (1.1.0-2) lucid; urgency=low
39
410 * Include a config file so the extension loads
Index: trunk/extensions/wikidiff2/tests/003.phpt
@@ -28,7 +28,7 @@
2929 </tr>
3030 <tr>
3131 <td class="diff-marker">−</td>
32 - <td class="diff-deletedline"><div><span class="diffchange diffchange-inline">!!FUZZY!!Rajaa</span></div></td>
 32+ <td class="diff-deletedline"><div><span class="diffchange diffchange-inline">!!FUZZY!!</span>Rajaa</div></td>
3333 <td class="diff-marker">+</td>
34 - <td class="diff-addedline"><div><span class="diffchange diffchange-inline">Rajaa</span></div></td>
 34+ <td class="diff-addedline"><div>Rajaa</div></td>
3535 </tr>
Index: trunk/extensions/wikidiff2/wikidiff2.h
@@ -43,7 +43,7 @@
4444 void printWordDiffSide(WordDiff &worddiff, bool added);
4545 void printTextWithDiv(const String & input);
4646 void printText(const String & input);
47 - inline bool isChineseJapanese(int ch);
 47+ inline bool isLetter(int ch);
4848 inline bool isSpace(int ch);
4949 void debugPrintWordDiff(WordDiff & worddiff);
5050
@@ -54,11 +54,23 @@
5555 void explodeLines(const String & text, StringVector &lines);
5656 };
5757
58 -bool Wikidiff2::isChineseJapanese(int ch)
 58+bool Wikidiff2::isLetter(int ch)
5959 {
60 - if (ch >= 0x3000 && ch <= 0x9fff) return true;
61 - if (ch >= 0x20000 && ch <= 0x2a000) return true;
62 - return false;
 60+ // Standard alphanumeric
 61+ if ((ch >= '0' && ch <= '9') ||
 62+ (ch == '_') ||
 63+ (ch >= 'A' && ch <= 'Z') ||
 64+ (ch >= 'a' && ch <= 'z'))
 65+ {
 66+ return true;
 67+ }
 68+ // Punctuation and control characters
 69+ if (ch < 0xc0) return false;
 70+ // Chinese, Japanese: split up character by character
 71+ if (ch >= 0x3000 && ch <= 0x9fff) return false;
 72+ if (ch >= 0x20000 && ch <= 0x2a000) return false;
 73+ // Otherwise assume it's from a language that uses spaces
 74+ return true;
6375 }
6476
6577 bool Wikidiff2::isSpace(int ch)

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r67994* Added Thai word break support. Adds a libthai dependency....tstarling13:29, 14 June 2010

Comments

#Comment by Duplicatebug (talk | contribs)   22:20, 6 January 2012

scaptrap? Needs recompile of wikidiff2 after deployment.

#Comment by Reedy (talk | contribs)   22:22, 6 January 2012

Not exactly scaptrap. We use scaptrap when pushing this revision to site (this isn't a php revision) needs config or other action

Recompiling can be done at any stage

#Comment by Duplicatebug (talk | contribs)   22:32, 6 January 2012

At the moment you are using things like 1.18wmf1 to push revisions to site. Regardless of which tag is added, this revision needs further action after deployment. Please do not wait more than 8 months before the next recompile is done (bug 27720). Thanks.

Status & tagging log