r83717 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r83716‎ | r83717 | r83718 >
Date:20:54, 11 March 2011
Author:simetrical
Status:ok
Tags:
Comment:
Backport r83716 "Normalize named entities to numeric"

Didn't bother merging the parser tests, so those are only in trunk.
Original commit message:

We should never be outputting named entities other than the ones in XML,
< > & ", because that will break well-formedness unless
we have a DTD in the doctype, which we don't in HTML5 mode.

I stuck with outputting numeric entities here instead of UTF-8 because
some characters are hard to read in UTF-8 (e.g.,  ). Maybe it
would be nicer if we decoded to UTF-8 except for whitespace and control
characters, or something like that, but it's a detail.

I'll backport to 1.17 and add RELEASE-NOTES there, which is why I added
the line to HISTORY instead of RELEASE-NOTES.
Modified paths:
  • /branches/REL1_17/phase3/RELEASE-NOTES (modified) (history)
  • /branches/REL1_17/phase3/includes/Sanitizer.php (modified) (history)

Diff [purge]

Index: branches/REL1_17/phase3/includes/Sanitizer.php
@@ -1107,7 +1107,8 @@
11081108 * for XML and XHTML specifically. Any stray bits will be
11091109 * &-escaped to result in a valid text fragment.
11101110 *
1111 - * a. any named char refs must be known in XHTML
 1111+ * a. named char refs can only be < > & ", others are
 1112+ * numericized (this way we're well-formed even without a DTD)
11121113 * b. any numeric char refs must be legal chars, not invalid or forbidden
11131114 * c. use &#x, not &#X
11141115 * d. fix or reject non-valid attributes
@@ -1146,9 +1147,10 @@
11471148
11481149 /**
11491150 * If the named entity is defined in the HTML 4.0/XHTML 1.0 DTD,
1150 - * return the named entity reference as is. If the entity is a
1151 - * MediaWiki-specific alias, returns the HTML equivalent. Otherwise,
1152 - * returns HTML-escaped text of pseudo-entity source (eg &foo;)
 1151+ * return the equivalent numeric entity reference (except for the core <
 1152+ * > & "). If the entity is a MediaWiki-specific alias, returns
 1153+ * the HTML equivalent. Otherwise, returns HTML-escaped text of
 1154+ * pseudo-entity source (eg &foo;)
11531155 *
11541156 * @param $name String
11551157 * @return String
@@ -1157,8 +1159,11 @@
11581160 global $wgHtmlEntities, $wgHtmlEntityAliases;
11591161 if ( isset( $wgHtmlEntityAliases[$name] ) ) {
11601162 return "&{$wgHtmlEntityAliases[$name]};";
1161 - } elseif( isset( $wgHtmlEntities[$name] ) ) {
 1163+ } elseif ( in_array( $name,
 1164+ array( 'lt', 'gt', 'amp', 'quot' ) ) ) {
11621165 return "&$name;";
 1166+ } elseif ( isset( $wgHtmlEntities[$name] ) ) {
 1167+ return "&#{$wgHtmlEntities[$name]};";
11631168 } else {
11641169 return "&$name;";
11651170 }
Index: branches/REL1_17/phase3/RELEASE-NOTES
@@ -498,6 +498,7 @@
499499 * (bug 1379) Installer directory conflicts with some hosts' configuration panel.
500500 * (bug 27781) Installer does not warn about 5.1.x. Added a compatibility function
501501 for array_key_exists().
 502+* Fix XML well-formedness on a few pages when $wgHtml5 is true (the default)
502503
503504 === API changes in 1.17 ===
504505 * BREAKING CHANGE: action=patrol now requires POST

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r83716Normalize named entities to numeric...simetrical20:50, 11 March 2011

Status & tagging log