Skip to content

Commit 43851a2

Browse files
authored
bpo-36673: Implement comment/PI parsing support for the TreeBuilder in ElementTree. (#12883)
* bpo-36673: Implement comment/PI parsing support for the TreeBuilder in ElementTree. * bpo-36673: Rewrite the comment/PI factory handling for the TreeBuilder in "_elementtree" to make it use the same factories as the ElementTree module, and to make it explicit when the comments/PIs are inserted into the tree and when they are not (which is the default).
1 parent 3d37ea2 commit 43851a2

File tree

6 files changed

+630
-54
lines changed

6 files changed

+630
-54
lines changed

Doc/library/xml.etree.elementtree.rst

Lines changed: 53 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -523,8 +523,9 @@ Functions
523523
Parses an XML section into an element tree incrementally, and reports what's
524524
going on to the user. *source* is a filename or :term:`file object`
525525
containing XML data. *events* is a sequence of events to report back. The
526-
supported events are the strings ``"start"``, ``"end"``, ``"start-ns"`` and
527-
``"end-ns"`` (the "ns" events are used to get detailed namespace
526+
supported events are the strings ``"start"``, ``"end"``, ``"comment"``,
527+
``"pi"``, ``"start-ns"`` and ``"end-ns"``
528+
(the "ns" events are used to get detailed namespace
528529
information). If *events* is omitted, only ``"end"`` events are reported.
529530
*parser* is an optional parser instance. If not given, the standard
530531
:class:`XMLParser` parser is used. *parser* must be a subclass of
@@ -549,6 +550,10 @@ Functions
549550
.. deprecated:: 3.4
550551
The *parser* argument.
551552

553+
.. versionchanged:: 3.8
554+
The ``comment`` and ``pi`` events were added.
555+
556+
552557
.. function:: parse(source, parser=None)
553558

554559
Parses an XML section into an element tree. *source* is a filename or file
@@ -1021,14 +1026,24 @@ TreeBuilder Objects
10211026
^^^^^^^^^^^^^^^^^^^
10221027

10231028

1024-
.. class:: TreeBuilder(element_factory=None)
1029+
.. class:: TreeBuilder(element_factory=None, *, comment_factory=None, \
1030+
pi_factory=None, insert_comments=False, insert_pis=False)
10251031

10261032
Generic element structure builder. This builder converts a sequence of
1027-
start, data, and end method calls to a well-formed element structure. You
1028-
can use this class to build an element structure using a custom XML parser,
1029-
or a parser for some other XML-like format. *element_factory*, when given,
1030-
must be a callable accepting two positional arguments: a tag and
1031-
a dict of attributes. It is expected to return a new element instance.
1033+
start, data, end, comment and pi method calls to a well-formed element
1034+
structure. You can use this class to build an element structure using
1035+
a custom XML parser, or a parser for some other XML-like format.
1036+
1037+
*element_factory*, when given, must be a callable accepting two positional
1038+
arguments: a tag and a dict of attributes. It is expected to return a new
1039+
element instance.
1040+
1041+
The *comment_factory* and *pi_factory* functions, when given, should behave
1042+
like the :func:`Comment` and :func:`ProcessingInstruction` functions to
1043+
create comments and processing instructions. When not given, the default
1044+
factories will be used. When *insert_comments* and/or *insert_pis* is true,
1045+
comments/pis will be inserted into the tree if they appear within the root
1046+
element (but not outside of it).
10321047

10331048
.. method:: close()
10341049

@@ -1054,6 +1069,22 @@ TreeBuilder Objects
10541069
containing element attributes. Returns the opened element.
10551070

10561071

1072+
.. method:: comment(text)
1073+
1074+
Creates a comment with the given *text*. If ``insert_comments`` is true,
1075+
this will also add it to the tree.
1076+
1077+
.. versionadded:: 3.8
1078+
1079+
1080+
.. method:: pi(target, text)
1081+
1082+
Creates a comment with the given *target* name and *text*. If
1083+
``insert_pis`` is true, this will also add it to the tree.
1084+
1085+
.. versionadded:: 3.8
1086+
1087+
10571088
In addition, a custom :class:`TreeBuilder` object can provide the
10581089
following method:
10591090

@@ -1150,9 +1181,9 @@ XMLPullParser Objects
11501181
callback target, :class:`XMLPullParser` collects an internal list of parsing
11511182
events and lets the user read from it. *events* is a sequence of events to
11521183
report back. The supported events are the strings ``"start"``, ``"end"``,
1153-
``"start-ns"`` and ``"end-ns"`` (the "ns" events are used to get detailed
1154-
namespace information). If *events* is omitted, only ``"end"`` events are
1155-
reported.
1184+
``"comment"``, ``"pi"``, ``"start-ns"`` and ``"end-ns"`` (the "ns" events
1185+
are used to get detailed namespace information). If *events* is omitted,
1186+
only ``"end"`` events are reported.
11561187

11571188
.. method:: feed(data)
11581189

@@ -1171,7 +1202,13 @@ XMLPullParser Objects
11711202
data fed to the
11721203
parser. The iterator yields ``(event, elem)`` pairs, where *event* is a
11731204
string representing the type of event (e.g. ``"end"``) and *elem* is the
1174-
encountered :class:`Element` object.
1205+
encountered :class:`Element` object, or other context value as follows.
1206+
1207+
* ``start``, ``end``: the current Element.
1208+
* ``comment``, ``pi``: the current comment / processing instruction
1209+
* ``start-ns``: a tuple ``(prefix, uri)`` naming the declared namespace
1210+
mapping.
1211+
* ``end-ns``: :const:`None` (this may change in a future version)
11751212

11761213
Events provided in a previous call to :meth:`read_events` will not be
11771214
yielded again. Events are consumed from the internal queue only when
@@ -1191,6 +1228,10 @@ XMLPullParser Objects
11911228

11921229
.. versionadded:: 3.4
11931230

1231+
.. versionchanged:: 3.8
1232+
The ``comment`` and ``pi`` events were added.
1233+
1234+
11941235
Exceptions
11951236
^^^^^^^^^^
11961237

Lib/test/test_xml_etree.py

Lines changed: 87 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1194,6 +1194,12 @@ def _feed(self, parser, data, chunk_size=None):
11941194
for i in range(0, len(data), chunk_size):
11951195
parser.feed(data[i:i+chunk_size])
11961196

1197+
def assert_events(self, parser, expected):
1198+
self.assertEqual(
1199+
[(event, (elem.tag, elem.text))
1200+
for event, elem in parser.read_events()],
1201+
expected)
1202+
11971203
def assert_event_tags(self, parser, expected):
11981204
events = parser.read_events()
11991205
self.assertEqual([(action, elem.tag) for action, elem in events],
@@ -1276,8 +1282,10 @@ def test_events(self):
12761282
self.assert_event_tags(parser, [])
12771283

12781284
parser = ET.XMLPullParser(events=('start', 'end'))
1279-
self._feed(parser, "<!-- comment -->\n")
1280-
self.assert_event_tags(parser, [])
1285+
self._feed(parser, "<!-- text here -->\n")
1286+
self.assert_events(parser, [])
1287+
1288+
parser = ET.XMLPullParser(events=('start', 'end'))
12811289
self._feed(parser, "<root>\n")
12821290
self.assert_event_tags(parser, [('start', 'root')])
12831291
self._feed(parser, "<element key='value'>text</element")
@@ -1314,6 +1322,33 @@ def test_events(self):
13141322
self._feed(parser, "</root>")
13151323
self.assertIsNone(parser.close())
13161324

1325+
def test_events_comment(self):
1326+
parser = ET.XMLPullParser(events=('start', 'comment', 'end'))
1327+
self._feed(parser, "<!-- text here -->\n")
1328+
self.assert_events(parser, [('comment', (ET.Comment, ' text here '))])
1329+
self._feed(parser, "<!-- more text here -->\n")
1330+
self.assert_events(parser, [('comment', (ET.Comment, ' more text here '))])
1331+
self._feed(parser, "<root-tag>text")
1332+
self.assert_event_tags(parser, [('start', 'root-tag')])
1333+
self._feed(parser, "<!-- inner comment-->\n")
1334+
self.assert_events(parser, [('comment', (ET.Comment, ' inner comment'))])
1335+
self._feed(parser, "</root-tag>\n")
1336+
self.assert_event_tags(parser, [('end', 'root-tag')])
1337+
self._feed(parser, "<!-- outer comment -->\n")
1338+
self.assert_events(parser, [('comment', (ET.Comment, ' outer comment '))])
1339+
1340+
parser = ET.XMLPullParser(events=('comment',))
1341+
self._feed(parser, "<!-- text here -->\n")
1342+
self.assert_events(parser, [('comment', (ET.Comment, ' text here '))])
1343+
1344+
def test_events_pi(self):
1345+
parser = ET.XMLPullParser(events=('start', 'pi', 'end'))
1346+
self._feed(parser, "<?pitarget?>\n")
1347+
self.assert_events(parser, [('pi', (ET.PI, 'pitarget'))])
1348+
parser = ET.XMLPullParser(events=('pi',))
1349+
self._feed(parser, "<?pitarget some text ?>\n")
1350+
self.assert_events(parser, [('pi', (ET.PI, 'pitarget some text '))])
1351+
13171352
def test_events_sequence(self):
13181353
# Test that events can be some sequence that's not just a tuple or list
13191354
eventset = {'end', 'start'}
@@ -1333,7 +1368,6 @@ def __next__(self):
13331368
self._feed(parser, "<foo>bar</foo>")
13341369
self.assert_event_tags(parser, [('start', 'foo'), ('end', 'foo')])
13351370

1336-
13371371
def test_unknown_event(self):
13381372
with self.assertRaises(ValueError):
13391373
ET.XMLPullParser(events=('start', 'end', 'bogus'))
@@ -2741,6 +2775,33 @@ class DummyBuilder(BaseDummyBuilder):
27412775
parser.feed(self.sample1)
27422776
self.assertIsNone(parser.close())
27432777

2778+
def test_treebuilder_comment(self):
2779+
b = ET.TreeBuilder()
2780+
self.assertEqual(b.comment('ctext').tag, ET.Comment)
2781+
self.assertEqual(b.comment('ctext').text, 'ctext')
2782+
2783+
b = ET.TreeBuilder(comment_factory=ET.Comment)
2784+
self.assertEqual(b.comment('ctext').tag, ET.Comment)
2785+
self.assertEqual(b.comment('ctext').text, 'ctext')
2786+
2787+
b = ET.TreeBuilder(comment_factory=len)
2788+
self.assertEqual(b.comment('ctext'), len('ctext'))
2789+
2790+
def test_treebuilder_pi(self):
2791+
b = ET.TreeBuilder()
2792+
self.assertEqual(b.pi('target', None).tag, ET.PI)
2793+
self.assertEqual(b.pi('target', None).text, 'target')
2794+
2795+
b = ET.TreeBuilder(pi_factory=ET.PI)
2796+
self.assertEqual(b.pi('target').tag, ET.PI)
2797+
self.assertEqual(b.pi('target').text, "target")
2798+
self.assertEqual(b.pi('pitarget', ' text ').tag, ET.PI)
2799+
self.assertEqual(b.pi('pitarget', ' text ').text, "pitarget text ")
2800+
2801+
b = ET.TreeBuilder(pi_factory=lambda target, text: (len(target), text))
2802+
self.assertEqual(b.pi('target'), (len('target'), None))
2803+
self.assertEqual(b.pi('pitarget', ' text '), (len('pitarget'), ' text '))
2804+
27442805
def test_treebuilder_elementfactory_none(self):
27452806
parser = ET.XMLParser(target=ET.TreeBuilder(element_factory=None))
27462807
parser.feed(self.sample1)
@@ -2761,6 +2822,21 @@ def foobar(self, x):
27612822
e = parser.close()
27622823
self._check_sample1_element(e)
27632824

2825+
def test_subclass_comment_pi(self):
2826+
class MyTreeBuilder(ET.TreeBuilder):
2827+
def foobar(self, x):
2828+
return x * 2
2829+
2830+
tb = MyTreeBuilder(comment_factory=ET.Comment, pi_factory=ET.PI)
2831+
self.assertEqual(tb.foobar(10), 20)
2832+
2833+
parser = ET.XMLParser(target=tb)
2834+
parser.feed(self.sample1)
2835+
parser.feed('<!-- a comment--><?and a pi?>')
2836+
2837+
e = parser.close()
2838+
self._check_sample1_element(e)
2839+
27642840
def test_element_factory(self):
27652841
lst = []
27662842
def myfactory(tag, attrib):
@@ -3418,6 +3494,12 @@ def test_main(module=None):
34183494
# Copy the path cache (should be empty)
34193495
path_cache = ElementPath._cache
34203496
ElementPath._cache = path_cache.copy()
3497+
# Align the Comment/PI factories.
3498+
if hasattr(ET, '_set_factories'):
3499+
old_factories = ET._set_factories(ET.Comment, ET.PI)
3500+
else:
3501+
old_factories = None
3502+
34213503
try:
34223504
support.run_unittest(*test_classes)
34233505
finally:
@@ -3426,6 +3508,8 @@ def test_main(module=None):
34263508
nsmap.clear()
34273509
nsmap.update(nsmap_copy)
34283510
ElementPath._cache = path_cache
3511+
if old_factories is not None:
3512+
ET._set_factories(*old_factories)
34293513
# don't interfere with subsequent tests
34303514
ET = pyET = None
34313515

Lib/xml/etree/ElementTree.py

Lines changed: 63 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1374,21 +1374,39 @@ class TreeBuilder:
13741374
*element_factory* is an optional element factory which is called
13751375
to create new Element instances, as necessary.
13761376
1377+
*comment_factory* is a factory to create comments to be used instead of
1378+
the standard factory. If *insert_comments* is false (the default),
1379+
comments will not be inserted into the tree.
1380+
1381+
*pi_factory* is a factory to create processing instructions to be used
1382+
instead of the standard factory. If *insert_pis* is false (the default),
1383+
processing instructions will not be inserted into the tree.
13771384
"""
1378-
def __init__(self, element_factory=None):
1385+
def __init__(self, element_factory=None, *,
1386+
comment_factory=None, pi_factory=None,
1387+
insert_comments=False, insert_pis=False):
13791388
self._data = [] # data collector
13801389
self._elem = [] # element stack
13811390
self._last = None # last element
1391+
self._root = None # root element
13821392
self._tail = None # true if we're after an end tag
1393+
if comment_factory is None:
1394+
comment_factory = Comment
1395+
self._comment_factory = comment_factory
1396+
self.insert_comments = insert_comments
1397+
if pi_factory is None:
1398+
pi_factory = ProcessingInstruction
1399+
self._pi_factory = pi_factory
1400+
self.insert_pis = insert_pis
13831401
if element_factory is None:
13841402
element_factory = Element
13851403
self._factory = element_factory
13861404

13871405
def close(self):
13881406
"""Flush builder buffers and return toplevel document Element."""
13891407
assert len(self._elem) == 0, "missing end tags"
1390-
assert self._last is not None, "missing toplevel element"
1391-
return self._last
1408+
assert self._root is not None, "missing toplevel element"
1409+
return self._root
13921410

13931411
def _flush(self):
13941412
if self._data:
@@ -1417,6 +1435,8 @@ def start(self, tag, attrs):
14171435
self._last = elem = self._factory(tag, attrs)
14181436
if self._elem:
14191437
self._elem[-1].append(elem)
1438+
elif self._root is None:
1439+
self._root = elem
14201440
self._elem.append(elem)
14211441
self._tail = 0
14221442
return elem
@@ -1435,6 +1455,33 @@ def end(self, tag):
14351455
self._tail = 1
14361456
return self._last
14371457

1458+
def comment(self, text):
1459+
"""Create a comment using the comment_factory.
1460+
1461+
*text* is the text of the comment.
1462+
"""
1463+
return self._handle_single(
1464+
self._comment_factory, self.insert_comments, text)
1465+
1466+
def pi(self, target, text=None):
1467+
"""Create a processing instruction using the pi_factory.
1468+
1469+
*target* is the target name of the processing instruction.
1470+
*text* is the data of the processing instruction, or ''.
1471+
"""
1472+
return self._handle_single(
1473+
self._pi_factory, self.insert_pis, target, text)
1474+
1475+
def _handle_single(self, factory, insert, *args):
1476+
elem = factory(*args)
1477+
if insert:
1478+
self._flush()
1479+
self._last = elem
1480+
if self._elem:
1481+
self._elem[-1].append(elem)
1482+
self._tail = 1
1483+
return elem
1484+
14381485

14391486
# also see ElementTree and TreeBuilder
14401487
class XMLParser:
@@ -1519,6 +1566,15 @@ def handler(prefix, uri, event=event_name, append=append):
15191566
def handler(prefix, event=event_name, append=append):
15201567
append((event, None))
15211568
parser.EndNamespaceDeclHandler = handler
1569+
elif event_name == 'comment':
1570+
def handler(text, event=event_name, append=append, self=self):
1571+
append((event, self.target.comment(text)))
1572+
parser.CommentHandler = handler
1573+
elif event_name == 'pi':
1574+
def handler(pi_target, data, event=event_name, append=append,
1575+
self=self):
1576+
append((event, self.target.pi(pi_target, data)))
1577+
parser.ProcessingInstructionHandler = handler
15221578
else:
15231579
raise ValueError("unknown event %r" % event_name)
15241580

@@ -1640,7 +1696,10 @@ def close(self):
16401696
# (see tests)
16411697
_Element_Py = Element
16421698

1643-
# Element, SubElement, ParseError, TreeBuilder, XMLParser
1699+
# Element, SubElement, ParseError, TreeBuilder, XMLParser, _set_factories
16441700
from _elementtree import *
1701+
from _elementtree import _set_factories
16451702
except ImportError:
16461703
pass
1704+
else:
1705+
_set_factories(Comment, ProcessingInstruction)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
The TreeBuilder and XMLPullParser in xml.etree.ElementTree gained support
2+
for parsing comments and processing instructions.
3+
Patch by Stefan Behnel.

0 commit comments

Comments
 (0)