r9911 pywikipedia - Code Review archive

Repository:pywikipedia
Revision:r9910‎ | r9911 | r9912 >
Date:14:24, 19 February 2012
Author:drtrigon
Status:old
Tags:
Comment:
Adding capabilities of DrTrigonBot 'textlib' script; 'removeHTMLParts'
(this is a follow-up or bug fix for r9902 also)
Modified paths:
  • /trunk/pywikipedia/pywikibot/textlib.py (modified) (history)

Diff [purge]

Index: trunk/pywikipedia/pywikibot/textlib.py
@@ -16,6 +16,7 @@
1717
1818 import wikipedia as pywikibot
1919 import re
 20+from HTMLParser import HTMLParser
2021
2122 def unescape(s):
2223 """Replace escaped HTML-special characters by their originals"""
@@ -219,6 +220,40 @@
220221 return toRemoveR.sub('', text)
221222
222223
 224+def removeHTMLParts(text, keeptags = ['tt', 'nowiki', 'small', 'sup']):
 225+ """
 226+ Return text without portions where HTML markup is disabled
 227+
 228+ Parts that can/will be removed are --
 229+ * HTML and all wiki tags
 230+
 231+ The exact set of parts which should NOT be removed can be passed as the
 232+ 'keeptags' parameter, which defaults to ['tt', 'nowiki', 'small', 'sup'].
 233+ """
 234+ # try to merge with 'removeDisabledParts()' above into one generic function
 235+
 236+ # thanks to http://www.hellboundhackers.org/articles/841-using-python-39;s-htmlparser-class.html
 237+ parser = _GetDataHTML()
 238+ parser.keeptags = keeptags
 239+ parser.feed(text)
 240+ parser.close()
 241+ return parser.textdata
 242+
 243+# thanks to http://docs.python.org/library/htmlparser.html
 244+class _GetDataHTML(HTMLParser):
 245+ textdata = u''
 246+ keeptags = []
 247+
 248+ def handle_data(self, data):
 249+ self.textdata += data
 250+
 251+ def handle_starttag(self, tag, attrs):
 252+ if tag in self.keeptags: self.textdata += u"<%s>" % tag
 253+
 254+ def handle_endtag(self, tag):
 255+ if tag in self.keeptags: self.textdata += u"</%s>" % tag
 256+
 257+
223258 def isDisabled(text, index, tags = ['*']):
224259 """
225260 Return True if text[index] is disabled, e.g. by a comment or by nowiki tags.

Past revisions this follows-up on

RevisionCommit summaryAuthorDate
r9902Adding capabilities of DrTrigonBot 'wikipedia' script; 'getParsedString' and ...drtrigon11:34, 17 February 2012

Status & tagging log