Skip to content

Providing Input to XMLUnit

Stefan Bodewig edited this page Apr 10, 2023 · 16 revisions

Source and ISource

All core parts of XMLUnit use a single abstraction for "pieces of XML" they are supposed to work on. For Java this is javax.xml.transform.Source and for .NET we've created Org.XmlUnit.ISource which basically adds a wrapper around an XmlReader.

For Java many implementations of said interface are part of the Java class library, for .NET we've added the corresponding

  • ReaderSource - just wraps an existing XmlReader
  • DOMSource - creates a Source from an XmlNode
  • StreamSource - creates a Source from a TextReader, Stream or a string holding an URI
  • LinqSource - creates a Source from an XNode

At the time of this writing there is no XML-Serialization based equivalent of JAXBSource for .NET.

In order to make it easier to create instances of Source or ISource there a builder, that provides a fluent API.

CommentLessSource

CommentLessSource is a decorator of a different source and provides XML that consists of the original source's content with all comments removed.

Use this wrapper if you want XMLUnit to ignore comments.

This is class is used under the covers if you tell DiffBuilder to ignore comments.

WhitespaceStrippedSource

WhitespaceStrippedSource is a decorator of a different source that removes all empty text nodes and trims the remaining text nodes.

If you only want to remove all "element content whitespace", i.e. text content between XML elements that is just an artifact of "pretty printing" XML then you should use ElementContentWhitespaceStrippedSource instead.

Examples

Empty text nodes are removed:

<element>
</element>

becomes

<element></element>

Text Nodes are stripped:

<element>
  foo
</element>

becomes

<element>foo</element>

If the XML content has been created in memory rather than been deserialized from an external source it could contain adjacent Text nodes so that

<element>
  foo
  bar
</element>

could become

<element>foobar</element>

or

<element>
foo
bar
</element>

depending on how the document has been structured. In order to get more control the input had to be normalized (using Document.normalize() or XmlDocument.Normalize()) before wrapping it in a WhitespaceStrippedSource - or by using an additional NormalizedSource wrapper.

XmlWhitespaceStrippedSource

WhitespaceStrippedSource uses String.trim/Trim to "trim" text nodes. For .NET this "trims" all characters considered whitespace by the Char class, which is not the same definition XML uses. For XML only linebreaks, tabs and space are whitespace. If you rely on XML's definition of whitespace WhitespaceStrippedSource may remove more than you want, therefore XMLUnit.NET 2.10.0 introduced a new type XmlWhitespaceStrippedSource that behaves like WhitespaceStrippedSource but uses XML's interpreation of whitespace.

For Java the trim method only removes characters that are also considered whitespace by XML so there is no need for an XmlWhitespaceStrippedSource.

WhitespaceNormalizedSource

WhitespaceNormalizedSource is a decorator of a different source that replaces all whitespace characters found in Text nodes with Space characters and collapses consecutive whitespace characters into a single Space.

Examples

<element>a

    b
</element>

becomes

<element>a b </element>

NormalizedSource

NormalizedSource performs XML normalization on the wrapped document. This means adjacent text nodes are merged to single nodes and empty Text nodes removed (recursively). For Java when wrapping a Document rather than a Node additional normalizations may be preformed - see XmlNode.Normalize for .NET and Node#normalize as well as Document#normalizeDocument for Java.

When reading documents a parser usually puts the document into normalized form anyway. You will only need to perform XML normalization on DOM trees you have created programmatically.

ElementContentWhitespaceStrippedSource

ElementContentWhitespaceStrippedSource is a decorator of a different source that removes all text nodes solely consisting of whitespace.

The main use of this decorator is to remove all "element content whitespace", i.e. text content between XML elements that is just an artifact of "pretty printing" XML.

This class has been added with XMLUnit 2.6.0.

Examples

Empty text nodes are removed:

<element>
</element>

becomes

<element></element>

Text Nodes are not stripped:

<element>
  foo
</element>

remains

<element>
  foo
</element>

InputBuilder

With the Helper Class Input you can generate Input.Builder to create Source instances.

Source source = Input.fromFile("file:/..../test.xml").build();

or with XSL transformations:

Source source = Input.byTransforming(Input.fromFile("file:/..../test.xml"))
		.withStylesheet(Input.fromFile("file:/..../test.xsl"))
		.build();

In .NET the code Examples are very similar, see API:
Java: http://www.xmlunit.org/api/java/master/org/xmlunit/builder/Input.html
.NET: http://www.xmlunit.org/api/net/master/Org.XmlUnit.Builder/Input.html

Input.from(Object)

A special case is the helper method Input.from(Object). This generic method creates a Builder instance depending of the type of the given Object:

Java type .NET type Description
org.xmlunit.builder.Input.Builder Org.XmlUnit.Builder.Input.IBuilder Builder to create an XML-Source.
javax.xml.transform.Source Org.XmlUnit.ISource XML-Source
org.w3c.dom.Document System.Xml.XmlDocument dom Document
org.w3c.dom.Node System.Xml.XmlNode dom Node
- System.Xml.Linq.XDocument Linq Document
- System.Xml.Linq.XNode Linq Node
byte[] byte[] byte[] which is an XML-Content.
String string String which is an XML-Content.
java.io.File - File which contains XML.
java.net.URL - URL to an XML
java.net.URI System.Uri URI to an XML
java.io.InputStream System.IO.Stream Stream from an XML.
java.nio.channels.ReadableByteChannel System.IO.TextReader ReadableByteChannel or TextReader of an XML
A Jaxb Object - Object which can be transformed to XML by javax.xml.bind.JAXB.marshal(...)

This method simplifies the API of DiffBuilder and CompareMatcher which can accept nearly any Object as input to generate a valid Source.

XXE Prevention

Whenever you parse XML there is the danger of being vulnerable to XML External Entity Processing - XXE for short.

XMLUnit for Java

When passing input to XMLUnit the input is tranformed to a DOM document with the help of a DocumentBuilder most of the time. Prior to XMLUnit for Java 2.6.0 the DocumentBuilder used by default was not configured to prevent XXE as Java's defaults are vulnerable. Starting with XMLUnit 2.6.0 the default DocumentBuilder is configured according to OWASP's XXE Prevention Cheat Sheet.

This means if you want to protect yourself against XXE and you use a version of XMLUnit prior to 2.6.0 you have to explicitly set a DocumentBuilderFactory that is configured properly. Likewise if you rely on DTD loading or expansion of external entities you must provide an explicit DocumentBuilderFactory when using XMLUnit 2.6.0 or later.

If you use the legacy module, XXE prevention is disabled by default. Starting with XMLUnit 2.6.0 the XMLUnit class has a new setEnableXXEPrevention method that can be used to enable it.

XMLUnit.NET

When using .NET 4.5.2 or newer the default settings used by XMLUnit.NET have always been safe according to OWASP's XXE Prevention Cheat Cheet. Prior to XMLUnit.NET 2.6.0 there have been a few places where XmlDocument is used and did not explicitly disable the XmlResolver which means these places have been vulnerable.

If you rely on XmlDocument loading external entities you will need to provide an XmlResolver of your own startting with XMLUnit.NET 2.6.0.

Clone this wiki locally