Skip to content

Commit ab7341d

Browse files
committed
bpo-36673: Implement comment/PI parsing support for the TreeBuilder in ElementTree.
1 parent 9d062d6 commit ab7341d

File tree

6 files changed

+452
-53
lines changed

6 files changed

+452
-53
lines changed

Doc/library/xml.etree.elementtree.rst

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -523,8 +523,9 @@ Functions
523523
Parses an XML section into an element tree incrementally, and reports what's
524524
going on to the user. *source* is a filename or :term:`file object`
525525
containing XML data. *events* is a sequence of events to report back. The
526-
supported events are the strings ``"start"``, ``"end"``, ``"start-ns"`` and
527-
``"end-ns"`` (the "ns" events are used to get detailed namespace
526+
supported events are the strings ``"start"``, ``"end"``, ``"comment"``,
527+
``"pi"``, ``"start-ns"`` and ``"end-ns"``
528+
(the "ns" events are used to get detailed namespace
528529
information). If *events* is omitted, only ``"end"`` events are reported.
529530
*parser* is an optional parser instance. If not given, the standard
530531
:class:`XMLParser` parser is used. *parser* must be a subclass of
@@ -549,6 +550,10 @@ Functions
549550
.. deprecated:: 3.4
550551
The *parser* argument.
551552

553+
.. versionchanged:: 3.8
554+
The ``comment`` and ``pi`` events were added.
555+
556+
552557
.. function:: parse(source, parser=None)
553558

554559
Parses an XML section into an element tree. *source* is a filename or file
@@ -1021,14 +1026,24 @@ TreeBuilder Objects
10211026
^^^^^^^^^^^^^^^^^^^
10221027

10231028

1024-
.. class:: TreeBuilder(element_factory=None)
1029+
.. class:: TreeBuilder(element_factory=None, comment_factory=None, \
1030+
pi_factory=None)
10251031

10261032
Generic element structure builder. This builder converts a sequence of
10271033
start, data, and end method calls to a well-formed element structure. You
10281034
can use this class to build an element structure using a custom XML parser,
1029-
or a parser for some other XML-like format. *element_factory*, when given,
1030-
must be a callable accepting two positional arguments: a tag and
1031-
a dict of attributes. It is expected to return a new element instance.
1035+
or a parser for some other XML-like format.
1036+
1037+
*element_factory*, when given, must be a callable accepting two positional
1038+
arguments: a tag and a dict of attributes. It is expected to return a new
1039+
element instance.
1040+
1041+
The *comment_factory* and *pi_factory* functions, when given, should behave
1042+
like the :func:`Comment` and :func:`ProcessingInstruction` functions to
1043+
create comments and processing instructions. When not given, no comments
1044+
or processing instructions will be created. Note that these objects will
1045+
not currently be appended to the tree when they appear outside of the root
1046+
element.
10321047

10331048
.. method:: close()
10341049

@@ -1053,6 +1068,21 @@ TreeBuilder Objects
10531068
Opens a new element. *tag* is the element name. *attrs* is a dictionary
10541069
containing element attributes. Returns the opened element.
10551070

1071+
.. method:: comment(text)
1072+
1073+
Adds a comment with the given *text*. If *comment_factory* is
1074+
:const:`None`, this will just return the text.
1075+
1076+
.. versionadded:: 3.8
1077+
1078+
.. method:: pi(target, text)
1079+
1080+
Adds a comment with the given *target* name and *text*. If
1081+
*pi_factory* is :const:`None`, this will return a ``(target, text)``
1082+
tuple.
1083+
1084+
.. versionadded:: 3.8
1085+
10561086

10571087
In addition, a custom :class:`TreeBuilder` object can provide the
10581088
following method:
@@ -1150,9 +1180,9 @@ XMLPullParser Objects
11501180
callback target, :class:`XMLPullParser` collects an internal list of parsing
11511181
events and lets the user read from it. *events* is a sequence of events to
11521182
report back. The supported events are the strings ``"start"``, ``"end"``,
1153-
``"start-ns"`` and ``"end-ns"`` (the "ns" events are used to get detailed
1154-
namespace information). If *events* is omitted, only ``"end"`` events are
1155-
reported.
1183+
``"comment"``, ``"pi"``, ``"start-ns"`` and ``"end-ns"`` (the "ns" events
1184+
are used to get detailed namespace information). If *events* is omitted,
1185+
only ``"end"`` events are reported.
11561186

11571187
.. method:: feed(data)
11581188

@@ -1172,6 +1202,10 @@ XMLPullParser Objects
11721202
parser. The iterator yields ``(event, elem)`` pairs, where *event* is a
11731203
string representing the type of event (e.g. ``"end"``) and *elem* is the
11741204
encountered :class:`Element` object.
1205+
For ``start-ns`` events, the ``elem`` is a tuple ``(prefix, uri)`` naming
1206+
the declared namespace mapping. For ``end-ns`` events, the ``elem`` is
1207+
:const:`None`. For ``comment`` events, the second value is the comment
1208+
text and for ``pi`` events a tuple ``(target, text)``.
11751209

11761210
Events provided in a previous call to :meth:`read_events` will not be
11771211
yielded again. Events are consumed from the internal queue only when
@@ -1191,6 +1225,10 @@ XMLPullParser Objects
11911225

11921226
.. versionadded:: 3.4
11931227

1228+
.. versionchanged:: 3.8
1229+
The ``comment`` and ``pi`` events were added.
1230+
1231+
11941232
Exceptions
11951233
^^^^^^^^^^
11961234

Lib/test/test_xml_etree.py

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1193,6 +1193,9 @@ def _feed(self, parser, data, chunk_size=None):
11931193
for i in range(0, len(data), chunk_size):
11941194
parser.feed(data[i:i+chunk_size])
11951195

1196+
def assert_events(self, parser, expected):
1197+
self.assertEqual(list(parser.read_events()), expected)
1198+
11961199
def assert_event_tags(self, parser, expected):
11971200
events = parser.read_events()
11981201
self.assertEqual([(action, elem.tag) for action, elem in events],
@@ -1275,8 +1278,10 @@ def test_events(self):
12751278
self.assert_event_tags(parser, [])
12761279

12771280
parser = ET.XMLPullParser(events=('start', 'end'))
1278-
self._feed(parser, "<!-- comment -->\n")
1279-
self.assert_event_tags(parser, [])
1281+
self._feed(parser, "<!-- text here -->\n")
1282+
self.assert_events(parser, [])
1283+
1284+
parser = ET.XMLPullParser(events=('start', 'end'))
12801285
self._feed(parser, "<root>\n")
12811286
self.assert_event_tags(parser, [('start', 'root')])
12821287
self._feed(parser, "<element key='value'>text</element")
@@ -1313,6 +1318,34 @@ def test_events(self):
13131318
self._feed(parser, "</root>")
13141319
self.assertIsNone(parser.close())
13151320

1321+
def test_events_comment(self):
1322+
parser = ET.XMLPullParser(events=('start', 'comment', 'end'))
1323+
self._feed(parser, "<!-- text here -->\n")
1324+
self.assert_events(parser, [('comment', ' text here ')])
1325+
self._feed(parser, "<!-- more text here -->\n")
1326+
self.assert_events(parser, [('comment', ' more text here ')])
1327+
self._feed(parser, "<root-tag>text")
1328+
self.assert_event_tags(parser, [('start', 'root-tag')])
1329+
self._feed(parser, "<!-- inner comment-->\n")
1330+
self.assert_events(parser, [('comment', ' inner comment')])
1331+
self._feed(parser, "</root-tag>\n")
1332+
self.assert_event_tags(parser, [('end', 'root-tag')])
1333+
self._feed(parser, "<!-- outer comment -->\n")
1334+
self.assert_events(parser, [('comment', ' outer comment ')])
1335+
1336+
parser = ET.XMLPullParser(events=('comment',))
1337+
self._feed(parser, "<!-- text here -->\n")
1338+
self.assert_events(parser, [('comment', ' text here ')])
1339+
1340+
def test_events_pi(self):
1341+
parser = ET.XMLPullParser(events=('start', 'pi', 'end'))
1342+
self._feed(parser, "<?pitarget?>\n")
1343+
self.assert_events(parser, [('pi', ('pitarget', ''))])
1344+
parser = ET.XMLPullParser(events=('pi',))
1345+
self._feed(parser, "<?pitarget some text ?>\n")
1346+
self.assert_events(parser, [('pi', ('pitarget', 'some text '))])
1347+
1348+
13161349
def test_events_sequence(self):
13171350
# Test that events can be some sequence that's not just a tuple or list
13181351
eventset = {'end', 'start'}
@@ -2658,6 +2691,31 @@ class DummyBuilder(BaseDummyBuilder):
26582691
parser.feed(self.sample1)
26592692
self.assertIsNone(parser.close())
26602693

2694+
def test_treebuilder_comment(self):
2695+
b = ET.TreeBuilder()
2696+
self.assertEqual(b.comment('ctext'), 'ctext')
2697+
2698+
b = ET.TreeBuilder(comment_factory=ET.Comment)
2699+
self.assertEqual(b.comment('ctext').tag, ET.Comment)
2700+
self.assertEqual(b.comment('ctext').text, 'ctext')
2701+
2702+
b = ET.TreeBuilder(comment_factory=len)
2703+
self.assertEqual(b.comment('ctext'), len('ctext'))
2704+
2705+
def test_treebuilder_pi(self):
2706+
b = ET.TreeBuilder()
2707+
self.assertEqual(b.pi('target', None), ('target', None))
2708+
2709+
b = ET.TreeBuilder(pi_factory=ET.PI)
2710+
self.assertEqual(b.pi('target').tag, ET.PI)
2711+
self.assertEqual(b.pi('target').text, "target")
2712+
self.assertEqual(b.pi('pitarget', ' text ').tag, ET.PI)
2713+
self.assertEqual(b.pi('pitarget', ' text ').text, "pitarget text ")
2714+
2715+
b = ET.TreeBuilder(pi_factory=lambda target, text: (len(target), text))
2716+
self.assertEqual(b.pi('target'), (len('target'), None))
2717+
self.assertEqual(b.pi('pitarget', ' text '), (len('pitarget'), ' text '))
2718+
26612719
def test_treebuilder_elementfactory_none(self):
26622720
parser = ET.XMLParser(target=ET.TreeBuilder(element_factory=None))
26632721
parser.feed(self.sample1)
@@ -2678,6 +2736,21 @@ def foobar(self, x):
26782736
e = parser.close()
26792737
self._check_sample1_element(e)
26802738

2739+
def test_subclass_comment_pi(self):
2740+
class MyTreeBuilder(ET.TreeBuilder):
2741+
def foobar(self, x):
2742+
return x * 2
2743+
2744+
tb = MyTreeBuilder(comment_factory=ET.Comment, pi_factory=ET.PI)
2745+
self.assertEqual(tb.foobar(10), 20)
2746+
2747+
parser = ET.XMLParser(target=tb)
2748+
parser.feed(self.sample1)
2749+
parser.feed('<!-- a comment--><?and a pi?>')
2750+
2751+
e = parser.close()
2752+
self._check_sample1_element(e)
2753+
26812754
def test_element_factory(self):
26822755
lst = []
26832756
def myfactory(tag, attrib):

Lib/xml/etree/ElementTree.py

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1374,21 +1374,31 @@ class TreeBuilder:
13741374
*element_factory* is an optional element factory which is called
13751375
to create new Element instances, as necessary.
13761376
1377+
*comment_factory* is a factory to create comments. If not provided,
1378+
comments will not be inserted into the tree and "comment" pull parser
1379+
events will only return the plain text.
1380+
1381+
*pi_factory* is a factory to create processing instructions. If not
1382+
provided, PIs will not be inserted into the tree and "pi" pull parser
1383+
events will only return a (target, text) tuple.
13771384
"""
1378-
def __init__(self, element_factory=None):
1385+
def __init__(self, element_factory=None, comment_factory=None, pi_factory=None):
13791386
self._data = [] # data collector
13801387
self._elem = [] # element stack
13811388
self._last = None # last element
1389+
self._root = None # root element
13821390
self._tail = None # true if we're after an end tag
1391+
self._comment_factory = comment_factory
1392+
self._pi_factory = pi_factory
13831393
if element_factory is None:
13841394
element_factory = Element
13851395
self._factory = element_factory
13861396

13871397
def close(self):
13881398
"""Flush builder buffers and return toplevel document Element."""
13891399
assert len(self._elem) == 0, "missing end tags"
1390-
assert self._last is not None, "missing toplevel element"
1391-
return self._last
1400+
assert self._root is not None, "missing toplevel element"
1401+
return self._root
13921402

13931403
def _flush(self):
13941404
if self._data:
@@ -1417,6 +1427,8 @@ def start(self, tag, attrs):
14171427
self._last = elem = self._factory(tag, attrs)
14181428
if self._elem:
14191429
self._elem[-1].append(elem)
1430+
elif self._root is None:
1431+
self._root = elem
14201432
self._elem.append(elem)
14211433
self._tail = 0
14221434
return elem
@@ -1435,6 +1447,39 @@ def end(self, tag):
14351447
self._tail = 1
14361448
return self._last
14371449

1450+
def comment(self, text):
1451+
"""Create a comment using the comment_factory.
1452+
1453+
If no factory is provided, comments are ignored
1454+
and the text returned as is.
1455+
1456+
*text* is the text of the comment.
1457+
"""
1458+
if self._comment_factory is None:
1459+
return text
1460+
return self._handle_single(self._comment_factory, text)
1461+
1462+
def pi(self, target, text=None):
1463+
"""Create a processing instruction using the pi_factory.
1464+
1465+
If no factory is provided, PIs are ignored and a (target, text)
1466+
tuple is returned.
1467+
1468+
*target* is the target name of the processing instruction.
1469+
*text* is the data of the processing instruction, or ''.
1470+
"""
1471+
if self._pi_factory is None:
1472+
return (target, text)
1473+
return self._handle_single(self._pi_factory, target, text)
1474+
1475+
def _handle_single(self, factory, *args):
1476+
self._flush()
1477+
self._last = elem = factory(*args)
1478+
if self._elem:
1479+
self._elem[-1].append(elem)
1480+
self._tail = 1
1481+
return elem
1482+
14381483

14391484
# also see ElementTree and TreeBuilder
14401485
class XMLParser:
@@ -1519,6 +1564,15 @@ def handler(prefix, uri, event=event_name, append=append):
15191564
def handler(prefix, event=event_name, append=append):
15201565
append((event, None))
15211566
parser.EndNamespaceDeclHandler = handler
1567+
elif event_name == 'comment':
1568+
def handler(text, event=event_name, append=append, self=self):
1569+
append((event, self.target.comment(text)))
1570+
parser.CommentHandler = handler
1571+
elif event_name == 'pi':
1572+
def handler(pi_target, data, event=event_name, append=append,
1573+
self=self):
1574+
append((event, self.target.pi(pi_target, data)))
1575+
parser.ProcessingInstructionHandler = handler
15221576
else:
15231577
raise ValueError("unknown event %r" % event_name)
15241578

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
The TreeBuilder and XMLPullParser in xml.etree.ElementTree gained support
2+
for parsing comments and processing instructions.
3+
Patch by Stefan Behnel.

0 commit comments

Comments
 (0)