r90092 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r90091‎ | r90092 | r90093 >
Date:21:42, 14 June 2011
Author:wikinaut
Status:resolved (Comments)
Tags:
Comment:
fix for bug29371 . regex wordwrap with UTF8: do not use \b metacharacter. The problem is that JavaScript recognizes word boundaries only before/after ASCII letters (and numbers/underscore)
Modified paths:
  • /trunk/phase3/resources/jquery/jquery.autoEllipsis.js (modified) (history)
  • /trunk/phase3/resources/jquery/jquery.highlightText.js (modified) (history)
  • /trunk/phase3/resources/jquery/jquery.suggestions.js (modified) (history)

Diff [purge]

Index: trunk/phase3/resources/jquery/jquery.suggestions.js
@@ -229,7 +229,7 @@
230230 } else {
231231 result = selected.prev();
232232 if ( selected.length == 0 ) {
233 - // we are at the begginning, so lets jump to the last item
 233+ // we are at the beginning, so lets jump to the last item
234234 if ( context.data.$container.find( '.suggestions-special' ).html() != "" ) {
235235 result = context.data.$container.find( '.suggestions-special' );
236236 } else {
Index: trunk/phase3/resources/jquery/jquery.autoEllipsis.js
@@ -5,7 +5,7 @@
66
77 // Cache ellipsed substrings for every string-width-position combination
88 var cache = { };
9 -// Use a seperate cache when match highlighting is enabled
 9+// Use a separate cache when match highlighting is enabled
1010 var matchTextCache = { };
1111
1212 $.fn.autoEllipsis = function( options ) {
Index: trunk/phase3/resources/jquery/jquery.highlightText.js
@@ -21,9 +21,10 @@
2222 // if this is a text node
2323 if ( node.nodeType == 3 ) {
2424 // TODO - need to be smarter about the character matching here.
25 - // non latin characters can make regex think a new word has begun.
26 - // look for an occurence of our pattern and store the starting position
27 - var pos = node.data.search( new RegExp( "\\b" + $.escapeRE( pat ), "i" ) );
 25+ // non latin characters can make regex think a new word has begun: do not use \b
 26+ // http://stackoverflow.com/questions/3787072/regex-wordwrap-with-utf8-characters-in-js
 27+ // look for an occurence of our pattern and store the starting position
 28+ var pos = node.data.search( new RegExp( "(^|\\s)" + $.escapeRE( pat ), "i" ) );
2829 if ( pos >= 0 ) {
2930 // create the span wrapper for the matched text
3031 var spannode = document.createElement( 'span' );

Follow-up revisions

RevisionCommit summaryAuthorDate
r94609tests for jquery.highlightText...hashar08:04, 16 August 2011
r94702Fix a regression in search highlighting from r90092...brion22:42, 16 August 2011
r95262follow up to r94807 : more test cases incl. hebrew RTL and Japanese, the test...wikinaut22:15, 22 August 2011
r96622MFT r94702: fix for a jquery.highlightText regression in r90092brion22:26, 8 September 2011
r96623MFT r90092, r94702 - jquery.highlightText.js fixes for non-ASCII chars in Vec...brion22:27, 8 September 2011
r96625MFT r90092, r94702 - jquery.highlightText.js fixes for non-ASCII chars in Vec...brion22:28, 8 September 2011

Comments

#Comment by Wikinaut (talk | contribs)   21:44, 14 June 2011

fix for bug29371 for cases where the needle begins with a non-ASCII like "Ö" when it should highlight the "Ö" in "Österreich"

#Comment by Brion VIBBER (talk | contribs)   21:51, 14 June 2011

Just a warning; this will fail to match on cases where the boundary was ASCII punctuation, though I don't know whether or not there are cases where that currently matters.

Needs qunit test cases.

#Comment by Wikinaut (talk | contribs)   21:59, 14 June 2011
#Comment by Wikinaut (talk | contribs)   00:23, 24 June 2011

lowered from fixme to new, because until now I could not detect any problem.

#Comment by 😂 (talk | contribs)   01:40, 2 August 2011

Putting back to fixme, definitely needs tests as Brion said above.

#Comment by Wikinaut (talk | contribs)   06:44, 16 August 2011

Hello all, in order to get this fixed, I need test cases. Test cases needed!

I only can say, it *does* reliably work for me with non-ASCIIs on several installations, and I tend to st status to "ok". Please, dear Code Reviewer, give me hints how to test the "border" case.

#Comment by Wikinaut (talk | contribs)   06:44, 16 August 2011

Urgently: test cases needed. For me, it works.

#Comment by Hashar (talk | contribs)   08:05, 16 August 2011

Basic framework in r94609 . You should be able to add more tests easily from there.

#Comment by Brion VIBBER (talk | contribs)   22:43, 16 August 2011

r94702 fixes a regression found in a test case that Hashar had commented out in r94609, enabled since r94681.

#Comment by Wikinaut (talk | contribs)   00:30, 17 August 2011

tested on my machine, looks, as if this is fixed now. in other words: thank you both

#Comment by Wikinaut (talk | contribs)   00:35, 17 August 2011

status is still "fixme". should it now become "resolved", or "ok" ?

#Comment by Brion VIBBER (talk | contribs)   00:52, 17 August 2011

One more case I'm concerned about is going to be regressions on punctuation that appears before a word:

  • "Goat-horned epidendrum", match on "goat horn" turns up "<span class="highlight">Goat</span>-<span class="highlight">horn</span>ed epidendrum" in current deployment. Current regex I think won't match on the "horn".

If that's cleared up then I'm probably ok marking it resolved -- you should only set this one back to 'new' though to remind others to check, I don't think the system will let you resolve it yourself.


It's also worth testing texts that might be more complicated to match entirely, like:

  • similar cases with non-ASCII punctuation chars, such as em-dash, German or French quotes, or CJK brackets/quotes/etc
  • text without spaces (Japanese, Chinese, Thai text?)

Those however probably already don't really work well, so they're not going to be regressions from the previous code. :) So don't worry about adding/fixing those yet.

#Comment by Hashar (talk | contribs)   06:50, 23 August 2011

> French quotes

That is the practical name for american people. European people are more in poetry than in explicit description, so those quotes are named : "guillemets". (was just to tease you as an american :b)

Wikinaut, you will probably have to copy paste the french guillemets from there: « » French also use extensively the apostrophe. Most key set will use a single quote ' (0x27), but some layouts output a single guillemet ’ (0x92 or ’). Example using all of that: « L'oiseau est sur l’île » (single guillemet last).


Note: The single guillemet is really what we want in French, unfortunately, since we lost the computer war, we rely on the practical american single quote nowaday :-)

#Comment by Wikinaut (talk | contribs)   07:00, 23 August 2011

Yes, but first let's test and concentrate to check highlighting for what makes the characters, and characters with accents and combining characters with diacritical marks (including combining accents) http://en.wikipedia.org/wiki/Combining_character .

BTW, highlight works it works for « Hashar » . Will be added as test case.

#Comment by Wikinaut (talk | contribs)   22:53, 22 August 2011

Set to new, as the qunit tests show it working, incl. Japanese, and also RTL language hebrew.

Please see follow up r95262 .

#Comment by Wikinaut (talk | contribs)   05:53, 24 August 2011

Also this can be now please be set to "resolved", too; has numerous follow-ups especially follow-up r94702 and test case submissions.

#Comment by Brion VIBBER (talk | contribs)   22:29, 8 September 2011

This & the followup r94702 are now merged to 1.17wmf1, rel1.17, and rel1.18 (which already had this one).

Marked resolved!

Status & tagging log