Remove most named character references from output
This was basically just
find -iname '*.php' -exec sed -i 's/ /\ /g;s/—/―/g;s/•/•/g;s/á/á/g;s/´/´/g;s/à/à/g;s/ä/ä/g;s/©/©/g;s/↓/↓/g;s/°/°/g;s/é/é/g;s/è/è/g;s/€/€/g;s/…/…/g;s/í/í/g;s/ì/ì/g;s/←/←/g;s/·/·/g;s/−/−/g;s/–/–/g;s/ó/ó/g;s/ô/ô/g;s/ò/ò/g;s/õ/õ/g;s/ö/ö/g;s/£/£/g;s/′/′/g;s/″/″/g;s/»/»/g;s/→/→/g;s/ú/ú/g;s/↑/↑/g;s/¥/¥/g' {} +
followed by reading over every single line of the resulting diff and
fixing a whole bunch of false positives. The reason for this change is
given in <
http://lists.wikimedia.org/pipermail/wikitech-l/2010-April/047617.html>.
I cleared it with Tim and Brion on IRC before committing. It might
cause a few problems, but I tried to be careful; please report any
issues.
I skipped all messages files. I plan to make a follow-up commit that
alters wfMsgExt() with 'escapenoentities' to sanitize all the entities.
That way, the only messages that will be problems will be ones that
output raw HTML, and we want to get rid of those anyway.
For now, I skipped ‌    . I'll catch these at another
pass. As with , I won't replace these with the actual UTF-8 (too
confusing), but these are pretty few, so maybe I'll be able to determine
by inspection that I don't have to replace them by hard-to-remember
numeric codes.
Also, to everyone who uses non-breaking spaces when they could use a
normal space, or nothing at all, or CSS padding: I hate you. Die.