Skip to content

bpo-36673: Implement comment/PI parsing support for the TreeBuilder in ElementTree. #12883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 1, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 53 additions & 12 deletions Doc/library/xml.etree.elementtree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -523,8 +523,9 @@ Functions
Parses an XML section into an element tree incrementally, and reports what's
going on to the user. *source* is a filename or :term:`file object`
containing XML data. *events* is a sequence of events to report back. The
supported events are the strings ``"start"``, ``"end"``, ``"start-ns"`` and
``"end-ns"`` (the "ns" events are used to get detailed namespace
supported events are the strings ``"start"``, ``"end"``, ``"comment"``,
``"pi"``, ``"start-ns"`` and ``"end-ns"``
(the "ns" events are used to get detailed namespace
information). If *events* is omitted, only ``"end"`` events are reported.
*parser* is an optional parser instance. If not given, the standard
:class:`XMLParser` parser is used. *parser* must be a subclass of
Expand All @@ -549,6 +550,10 @@ Functions
.. deprecated:: 3.4
The *parser* argument.

.. versionchanged:: 3.8
The ``comment`` and ``pi`` events were added.


.. function:: parse(source, parser=None)

Parses an XML section into an element tree. *source* is a filename or file
Expand Down Expand Up @@ -1021,14 +1026,24 @@ TreeBuilder Objects
^^^^^^^^^^^^^^^^^^^


.. class:: TreeBuilder(element_factory=None)
.. class:: TreeBuilder(element_factory=None, *, comment_factory=None, \
pi_factory=None, insert_comments=False, insert_pis=False)

Generic element structure builder. This builder converts a sequence of
start, data, and end method calls to a well-formed element structure. You
can use this class to build an element structure using a custom XML parser,
or a parser for some other XML-like format. *element_factory*, when given,
must be a callable accepting two positional arguments: a tag and
a dict of attributes. It is expected to return a new element instance.
start, data, end, comment and pi method calls to a well-formed element
structure. You can use this class to build an element structure using
a custom XML parser, or a parser for some other XML-like format.

*element_factory*, when given, must be a callable accepting two positional
arguments: a tag and a dict of attributes. It is expected to return a new
element instance.

The *comment_factory* and *pi_factory* functions, when given, should behave
like the :func:`Comment` and :func:`ProcessingInstruction` functions to
create comments and processing instructions. When not given, the default
factories will be used. When *insert_comments* and/or *insert_pis* is true,
comments/pis will be inserted into the tree if they appear within the root
element (but not outside of it).

.. method:: close()

Expand All @@ -1054,6 +1069,22 @@ TreeBuilder Objects
containing element attributes. Returns the opened element.


.. method:: comment(text)

Creates a comment with the given *text*. If ``insert_comments`` is true,
this will also add it to the tree.

.. versionadded:: 3.8


.. method:: pi(target, text)

Creates a comment with the given *target* name and *text*. If
``insert_pis`` is true, this will also add it to the tree.

.. versionadded:: 3.8


In addition, a custom :class:`TreeBuilder` object can provide the
following method:

Expand Down Expand Up @@ -1150,9 +1181,9 @@ XMLPullParser Objects
callback target, :class:`XMLPullParser` collects an internal list of parsing
events and lets the user read from it. *events* is a sequence of events to
report back. The supported events are the strings ``"start"``, ``"end"``,
``"start-ns"`` and ``"end-ns"`` (the "ns" events are used to get detailed
namespace information). If *events* is omitted, only ``"end"`` events are
reported.
``"comment"``, ``"pi"``, ``"start-ns"`` and ``"end-ns"`` (the "ns" events
are used to get detailed namespace information). If *events* is omitted,
only ``"end"`` events are reported.

.. method:: feed(data)

Expand All @@ -1171,7 +1202,13 @@ XMLPullParser Objects
data fed to the
parser. The iterator yields ``(event, elem)`` pairs, where *event* is a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename elem to data or payload or whatever (since it is not always an element) and document its meaning uniformly for all events (maybe using an unordered list)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it's not ideal, but with the change that I'm currently working up, it's going to be an Element-like object (Element, Comment, PI) most of the time. For many use cases, it's really only an only Element, especially in the default end-only-events case. Also, elem, i.e. element, is a fairly generic term, even though its meaning is tighter in the context of ElementTree.

I've also thought of using a more structured way of presenting it, and I think I'll do that.

string representing the type of event (e.g. ``"end"``) and *elem* is the
encountered :class:`Element` object.
encountered :class:`Element` object, or other context value as follows.

* ``start``, ``end``: the current Element.
* ``comment``, ``pi``: the current comment / processing instruction
* ``start-ns``: a tuple ``(prefix, uri)`` naming the declared namespace
mapping.
* ``end-ns``: :const:`None` (this may change in a future version)

Events provided in a previous call to :meth:`read_events` will not be
yielded again. Events are consumed from the internal queue only when
Expand All @@ -1191,6 +1228,10 @@ XMLPullParser Objects

.. versionadded:: 3.4

.. versionchanged:: 3.8
The ``comment`` and ``pi`` events were added.


Exceptions
^^^^^^^^^^

Expand Down
90 changes: 87 additions & 3 deletions Lib/test/test_xml_etree.py
Original file line number Diff line number Diff line change
Expand Up @@ -1193,6 +1193,12 @@ def _feed(self, parser, data, chunk_size=None):
for i in range(0, len(data), chunk_size):
parser.feed(data[i:i+chunk_size])

def assert_events(self, parser, expected):
self.assertEqual(
[(event, (elem.tag, elem.text))
for event, elem in parser.read_events()],
expected)

def assert_event_tags(self, parser, expected):
events = parser.read_events()
self.assertEqual([(action, elem.tag) for action, elem in events],
Expand Down Expand Up @@ -1275,8 +1281,10 @@ def test_events(self):
self.assert_event_tags(parser, [])

parser = ET.XMLPullParser(events=('start', 'end'))
self._feed(parser, "<!-- comment -->\n")
self.assert_event_tags(parser, [])
self._feed(parser, "<!-- text here -->\n")
self.assert_events(parser, [])

parser = ET.XMLPullParser(events=('start', 'end'))
self._feed(parser, "<root>\n")
self.assert_event_tags(parser, [('start', 'root')])
self._feed(parser, "<element key='value'>text</element")
Expand Down Expand Up @@ -1313,6 +1321,33 @@ def test_events(self):
self._feed(parser, "</root>")
self.assertIsNone(parser.close())

def test_events_comment(self):
parser = ET.XMLPullParser(events=('start', 'comment', 'end'))
self._feed(parser, "<!-- text here -->\n")
self.assert_events(parser, [('comment', (ET.Comment, ' text here '))])
self._feed(parser, "<!-- more text here -->\n")
self.assert_events(parser, [('comment', (ET.Comment, ' more text here '))])
self._feed(parser, "<root-tag>text")
self.assert_event_tags(parser, [('start', 'root-tag')])
self._feed(parser, "<!-- inner comment-->\n")
self.assert_events(parser, [('comment', (ET.Comment, ' inner comment'))])
self._feed(parser, "</root-tag>\n")
self.assert_event_tags(parser, [('end', 'root-tag')])
self._feed(parser, "<!-- outer comment -->\n")
self.assert_events(parser, [('comment', (ET.Comment, ' outer comment '))])

parser = ET.XMLPullParser(events=('comment',))
self._feed(parser, "<!-- text here -->\n")
self.assert_events(parser, [('comment', (ET.Comment, ' text here '))])

def test_events_pi(self):
parser = ET.XMLPullParser(events=('start', 'pi', 'end'))
self._feed(parser, "<?pitarget?>\n")
self.assert_events(parser, [('pi', (ET.PI, 'pitarget'))])
parser = ET.XMLPullParser(events=('pi',))
self._feed(parser, "<?pitarget some text ?>\n")
self.assert_events(parser, [('pi', (ET.PI, 'pitarget some text '))])

def test_events_sequence(self):
# Test that events can be some sequence that's not just a tuple or list
eventset = {'end', 'start'}
Expand All @@ -1332,7 +1367,6 @@ def __next__(self):
self._feed(parser, "<foo>bar</foo>")
self.assert_event_tags(parser, [('start', 'foo'), ('end', 'foo')])


def test_unknown_event(self):
with self.assertRaises(ValueError):
ET.XMLPullParser(events=('start', 'end', 'bogus'))
Expand Down Expand Up @@ -2658,6 +2692,33 @@ class DummyBuilder(BaseDummyBuilder):
parser.feed(self.sample1)
self.assertIsNone(parser.close())

def test_treebuilder_comment(self):
b = ET.TreeBuilder()
self.assertEqual(b.comment('ctext').tag, ET.Comment)
self.assertEqual(b.comment('ctext').text, 'ctext')

b = ET.TreeBuilder(comment_factory=ET.Comment)
self.assertEqual(b.comment('ctext').tag, ET.Comment)
self.assertEqual(b.comment('ctext').text, 'ctext')

b = ET.TreeBuilder(comment_factory=len)
self.assertEqual(b.comment('ctext'), len('ctext'))

def test_treebuilder_pi(self):
b = ET.TreeBuilder()
self.assertEqual(b.pi('target', None).tag, ET.PI)
self.assertEqual(b.pi('target', None).text, 'target')

b = ET.TreeBuilder(pi_factory=ET.PI)
self.assertEqual(b.pi('target').tag, ET.PI)
self.assertEqual(b.pi('target').text, "target")
self.assertEqual(b.pi('pitarget', ' text ').tag, ET.PI)
self.assertEqual(b.pi('pitarget', ' text ').text, "pitarget text ")

b = ET.TreeBuilder(pi_factory=lambda target, text: (len(target), text))
self.assertEqual(b.pi('target'), (len('target'), None))
self.assertEqual(b.pi('pitarget', ' text '), (len('pitarget'), ' text '))

def test_treebuilder_elementfactory_none(self):
parser = ET.XMLParser(target=ET.TreeBuilder(element_factory=None))
parser.feed(self.sample1)
Expand All @@ -2678,6 +2739,21 @@ def foobar(self, x):
e = parser.close()
self._check_sample1_element(e)

def test_subclass_comment_pi(self):
class MyTreeBuilder(ET.TreeBuilder):
def foobar(self, x):
return x * 2

tb = MyTreeBuilder(comment_factory=ET.Comment, pi_factory=ET.PI)
self.assertEqual(tb.foobar(10), 20)

parser = ET.XMLParser(target=tb)
parser.feed(self.sample1)
parser.feed('<!-- a comment--><?and a pi?>')

e = parser.close()
self._check_sample1_element(e)

def test_element_factory(self):
lst = []
def myfactory(tag, attrib):
Expand Down Expand Up @@ -3335,6 +3411,12 @@ def test_main(module=None):
# Copy the path cache (should be empty)
path_cache = ElementPath._cache
ElementPath._cache = path_cache.copy()
# Align the Comment/PI factories.
if hasattr(ET, '_set_factories'):
old_factories = ET._set_factories(ET.Comment, ET.PI)
else:
old_factories = None

try:
support.run_unittest(*test_classes)
finally:
Expand All @@ -3343,6 +3425,8 @@ def test_main(module=None):
nsmap.clear()
nsmap.update(nsmap_copy)
ElementPath._cache = path_cache
if old_factories is not None:
ET._set_factories(*old_factories)
# don't interfere with subsequent tests
ET = pyET = None

Expand Down
67 changes: 63 additions & 4 deletions Lib/xml/etree/ElementTree.py
Original file line number Diff line number Diff line change
Expand Up @@ -1374,21 +1374,39 @@ class TreeBuilder:
*element_factory* is an optional element factory which is called
to create new Element instances, as necessary.

*comment_factory* is a factory to create comments to be used instead of
the standard factory. If *insert_comments* is false (the default),
comments will not be inserted into the tree.

*pi_factory* is a factory to create processing instructions to be used
instead of the standard factory. If *insert_pis* is false (the default),
processing instructions will not be inserted into the tree.
"""
def __init__(self, element_factory=None):
def __init__(self, element_factory=None, *,
comment_factory=None, pi_factory=None,
insert_comments=False, insert_pis=False):
self._data = [] # data collector
self._elem = [] # element stack
self._last = None # last element
self._root = None # root element
self._tail = None # true if we're after an end tag
if comment_factory is None:
comment_factory = Comment
self._comment_factory = comment_factory
self.insert_comments = insert_comments
if pi_factory is None:
pi_factory = ProcessingInstruction
self._pi_factory = pi_factory
self.insert_pis = insert_pis
if element_factory is None:
element_factory = Element
self._factory = element_factory

def close(self):
"""Flush builder buffers and return toplevel document Element."""
assert len(self._elem) == 0, "missing end tags"
assert self._last is not None, "missing toplevel element"
return self._last
assert self._root is not None, "missing toplevel element"
return self._root

def _flush(self):
if self._data:
Expand Down Expand Up @@ -1417,6 +1435,8 @@ def start(self, tag, attrs):
self._last = elem = self._factory(tag, attrs)
if self._elem:
self._elem[-1].append(elem)
elif self._root is None:
self._root = elem
self._elem.append(elem)
self._tail = 0
return elem
Expand All @@ -1435,6 +1455,33 @@ def end(self, tag):
self._tail = 1
return self._last

def comment(self, text):
"""Create a comment using the comment_factory.

*text* is the text of the comment.
"""
return self._handle_single(
self._comment_factory, self.insert_comments, text)

def pi(self, target, text=None):
"""Create a processing instruction using the pi_factory.

*target* is the target name of the processing instruction.
*text* is the data of the processing instruction, or ''.
"""
return self._handle_single(
self._pi_factory, self.insert_pis, target, text)

def _handle_single(self, factory, insert, *args):
elem = factory(*args)
if insert:
self._flush()
self._last = elem
if self._elem:
self._elem[-1].append(elem)
self._tail = 1
return elem


# also see ElementTree and TreeBuilder
class XMLParser:
Expand Down Expand Up @@ -1519,6 +1566,15 @@ def handler(prefix, uri, event=event_name, append=append):
def handler(prefix, event=event_name, append=append):
append((event, None))
parser.EndNamespaceDeclHandler = handler
elif event_name == 'comment':
def handler(text, event=event_name, append=append, self=self):
append((event, self.target.comment(text)))
parser.CommentHandler = handler
elif event_name == 'pi':
def handler(pi_target, data, event=event_name, append=append,
self=self):
append((event, self.target.pi(pi_target, data)))
parser.ProcessingInstructionHandler = handler
else:
raise ValueError("unknown event %r" % event_name)

Expand Down Expand Up @@ -1640,7 +1696,10 @@ def close(self):
# (see tests)
_Element_Py = Element

# Element, SubElement, ParseError, TreeBuilder, XMLParser
# Element, SubElement, ParseError, TreeBuilder, XMLParser, _set_factories
from _elementtree import *
from _elementtree import _set_factories
except ImportError:
pass
else:
_set_factories(Comment, ProcessingInstruction)
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
The TreeBuilder and XMLPullParser in xml.etree.ElementTree gained support
for parsing comments and processing instructions.
Patch by Stefan Behnel.
Loading