r83716 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r83715‎ | r83716 | r83717 >
Date:20:50, 11 March 2011
Author:simetrical
Status:ok (Comments)
Tags:
Comment:
Normalize named entities to numeric

We should never be outputting named entities other than the ones in XML,
< > & ", because that will break well-formedness unless
we have a DTD in the doctype, which we don't in HTML5 mode.

I stuck with outputting numeric entities here instead of UTF-8 because
some characters are hard to read in UTF-8 (e.g.,  ). Maybe it
would be nicer if we decoded to UTF-8 except for whitespace and control
characters, or something like that, but it's a detail.

I'll backport to 1.17 and add RELEASE-NOTES there, which is why I added
the line to HISTORY instead of RELEASE-NOTES.
Modified paths:
  • /trunk/phase3/HISTORY (modified) (history)
  • /trunk/phase3/includes/Sanitizer.php (modified) (history)
  • /trunk/phase3/tests/parser/parserTests.txt (modified) (history)

Diff [purge]

Index: trunk/phase3/HISTORY
@@ -455,6 +455,7 @@
456456 * (bug 20244) Installer does not validate SQLite database directory for stable path
457457 * (bug 1379) Installer directory conflicts with some hosts' configuration panel.
458458 * (bug 12070) After Installation MySQL was blocked
 459+* Fix XML well-formedness on a few pages when $wgHtml5 is true (the default)
459460
460461 === API changes in 1.17 ===
461462 * (bug 22738) Allow filtering by action type on query=logevent.
Index: trunk/phase3/tests/parser/parserTests.txt
@@ -1264,7 +1264,7 @@
12651265 <caption>Multiplication table
12661266 </caption>
12671267 <tr>
1268 -<th> &times; </th>
 1268+<th> &#215; </th>
12691269 <th> 1 </th>
12701270 <th> 2 </th>
12711271 <th> 3
@@ -1351,7 +1351,7 @@
13521352 !! result
13531353 <table border="1">
13541354 <tr>
1355 -<td> &alpha;
 1355+<td> &#945;
13561356 </td>
13571357 <td>
13581358 <table bgcolor="#ABCDEF" border="2">
@@ -1730,7 +1730,7 @@
17311731 !! input
17321732 [[&nbsp; Main &nbsp; Page &nbsp;]]
17331733 !! result
1734 -<p><a href="https://www.mediawiki.org/wiki/Main_Page" title="Main Page">&nbsp; Main &nbsp; Page &nbsp;</a>
 1734+<p><a href="https://www.mediawiki.org/wiki/Main_Page" title="Main Page">&#160; Main &#160; Page &#160;</a>
17351735 </p>
17361736 !!end
17371737
Index: trunk/phase3/includes/Sanitizer.php
@@ -1093,7 +1093,8 @@
10941094 * for XML and XHTML specifically. Any stray bits will be
10951095 * &amp;-escaped to result in a valid text fragment.
10961096 *
1097 - * a. any named char refs must be known in XHTML
 1097+ * a. named char refs can only be &lt; &gt; &amp; &quot;, others are
 1098+ * numericized (this way we're well-formed even without a DTD)
10981099 * b. any numeric char refs must be legal chars, not invalid or forbidden
10991100 * c. use &#x, not &#X
11001101 * d. fix or reject non-valid attributes
@@ -1130,9 +1131,10 @@
11311132
11321133 /**
11331134 * If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD,
1134 - * return the named entity reference as is. If the entity is a
1135 - * MediaWiki-specific alias, returns the HTML equivalent. Otherwise,
1136 - * returns HTML-escaped text of pseudo-entity source (eg &amp;foo;)
 1135+ * return the equivalent numeric entity reference (except for the core &lt;
 1136+ * &gt; &amp; &quot;). If the entity is a MediaWiki-specific alias, returns
 1137+ * the HTML equivalent. Otherwise, returns HTML-escaped text of
 1138+ * pseudo-entity source (eg &amp;foo;)
11371139 *
11381140 * @param $name String
11391141 * @return String
@@ -1141,8 +1143,11 @@
11421144 global $wgHtmlEntities, $wgHtmlEntityAliases;
11431145 if ( isset( $wgHtmlEntityAliases[$name] ) ) {
11441146 return "&{$wgHtmlEntityAliases[$name]};";
1145 - } elseif( isset( $wgHtmlEntities[$name] ) ) {
 1147+ } elseif ( in_array( $name,
 1148+ array( 'lt', 'gt', 'amp', 'quot' ) ) ) {
11461149 return "&$name;";
 1150+ } elseif ( isset( $wgHtmlEntities[$name] ) ) {
 1151+ return "&#{$wgHtmlEntities[$name]};";
11471152 } else {
11481153 return "&amp;$name;";
11491154 }

Follow-up revisions

RevisionCommit summaryAuthorDate
r83717Backport r83716 "Normalize named entities to numeric"...simetrical20:54, 11 March 2011
r83803(follow-up r83716) fix parser test for the entities being decoded into numeri...bawolff00:34, 13 March 2011
r839321.17wmf1: MFT r78990, r79844, r81548, r82022, r82193, r83061, r83067, r83583,...catrope17:59, 14 March 2011

Comments

#Comment by Bawolff (talk | contribs)   20:30, 12 March 2011

You missed a parser test. (the text with character entity: eacute... one )

#Comment by Simetrical (talk | contribs)   00:21, 13 March 2011

Probably because the parser tests crash for me with an error about a missing table:

This is MediaWiki version 1.18alpha.

Reading tests from "tests/parser/parserTests.txt"...
A database error has occurred.  Did you forget to run maintenance/update.php after upgrading?  See: [http://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script http://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script]
Query: SELECT  value,exptime  FROM `parsertest_objectcache`  WHERE keyname = 'wikidb-parsertest_:message-profiling'  LIMIT 1  
Function: SqlBagOStuff::get
Error: 1146 Table 'wikidb.parsertest_objectcache' doesn't exist (localhost)

I fixed the errors I could see, but I can't figure out the correct expected result if the tests don't run.

#Comment by Bawolff (talk | contribs)   00:36, 13 March 2011

ok. Fixed the parser test in r83803.

Status & tagging log