Jackson parsing fails when using an UTF-8 character for link masking #705

odrotbohm · 2020-11-20T10:04:32Z

When using e.g. the UTF-8 ellipsis character (…) passed into maskLinks(…), the execution fails on windows.

org.springframework.restdocs.snippet.ModelCreationException:
com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0x85

I wasn't able to obtain a more complete stack trace yet but will further investigate.

Update I managed to obtain more stack trace:

at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1851)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:707)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3601)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3597)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2540)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2466)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:297)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:670)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.mapObject(UntypedObjectDeserializer.java:869)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:652)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.mapObject(UntypedObjectDeserializer.java:869)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:652)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringKeyMap(MapDeserializer.java:540)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:377)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:29)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4526)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3529)
    at org.springframework.restdocs.hypermedia.AbstractJsonLinkExtractor.extractLinks(AbstractJsonLinkExtractor.java:39)
    at org.springframework.restdocs.hypermedia.ContentTypeLinkExtractor.extractLinks(ContentTypeLinkExtractor.java:52)
    at org.springframework.restdocs.hypermedia.LinksSnippet.createModel(LinksSnippet.java:122)

Also, tweaking AbstractJsonLinkExtractor.extractLinks(…) to rather use ….getContentAsString() over ….getContent() seems to workaround the error as that replaces mangled characters with the replacement character (?). I.e. it looks like something is platform encoding the content byte array.

The text was updated successfully, but these errors were encountered:

The character seems to break the JSON parsing on Windows as reported in Spring REST Docs [0]. We temporarily move off the dedicated character in favor of three dots. [0] spring-projects/spring-restdocs#705.

wilkinsona · 2020-11-20T15:55:57Z

Some experimentation is suggesting that this isn't a REST Docs problem and that the … is corrupted before it gets to REST Docs. I think the problem occurs when you have UTF-8 encoded source that's compiled and then run on a JVM using a different encoding (most likely CP1252 on Windows).

The following UTF-8-encoded source appears to be sufficient to reproduce the problem:

System.out.println("…");
System.out.println("\u2026");

When run on a JVM using CP1252 as its default encoding, the following output is produced:

�
�

Switching the JVM's default encoding to UTF-8 produces the expected output:

…
…

@odrotbohm Can you try your app with the JVM running the tests configured to use UTF-8 as its default encoding (-Dfile.encoding=UTF-8)?

odrotbohm · 2020-11-20T16:44:02Z

Hm, to me it looks like the PatternReplacingContentModifier is calling String.getBytes() on the replacement result, which uses the system / JVM encoding and thus – on Windows – produces a ISO 8859 1 encoded byte array. Replacing that call with an ….getBytes(StandardCharsets.UTF_8) produces a UTF-8 encoded byte array in the first place and avoids the exception in the first place. I guess, ideally, the charset provided by the response is preferred with UTF-8 being the fallback in case none is given.

wilkinsona · 2021-01-08T11:02:41Z

I've spent some time in a Windows VM and I think I have a more complete understanding of the problem.

Replacing that call with an ….getBytes(StandardCharsets.UTF_8) produces a UTF-8 encoded byte array in the first place and avoids the exception in the first place

Yeah, changes along these lines do appear to fix the problem:

@@ -16,6 +16,8 @@
 
 package org.springframework.restdocs.operation.preprocess;
 
+import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 
@@ -47,13 +49,8 @@ class PatternReplacingContentModifier implements ContentModifier {
 
        @Override
        public byte[] modifyContent(byte[] content, MediaType contentType) {
-               String original;
-               if (contentType != null && contentType.getCharset() != null) {
-                       original = new String(content, contentType.getCharset());
-               }
-               else {
-                       original = new String(content);
-               }
+               Charset charset = (contentType != null && contentType.getCharset() != null) ? contentType.getCharset() : StandardCharsets.UTF_8;
+               String original = new String(content, charset);
                Matcher matcher = this.pattern.matcher(original);
                StringBuilder builder = new StringBuilder();
                int previous = 0;
@@ -73,7 +70,7 @@ class PatternReplacingContentModifier implements ContentModifier {
                if (previous < original.length()) {
                        builder.append(original.substring(previous));
                }
-               return builder.toString().getBytes();
+               return builder.toString().getBytes(charset);
        }
 
 }

That said, I'm not sure that this is the right change to make. Switching from using the JVM's default encoding as a fallback to UTF-8 will fix this problem, but it may create others when UTF-8 isn't the right encoding to use. I'm concerned about such a change breaking people's tests in a maintenance release.

To reduce the scope of the change but still address the problem, I think it may be better for LinkMaskingContentModifier to configure its PatternReplacingContentModifier to use UTF-8 as the fallback charset. This seems reasonable as LinkMaskingContentModifier already assumes JSON content and JSON is always UTF-8 encoded.

odrotbohm · 2021-01-08T12:36:41Z

Lovely, thanks, Andy! 🙇

spring-projects-issues added the status: waiting-for-triage Untriaged issue label Nov 20, 2020

wilkinsona closed this as completed in c3cb7af Jan 8, 2021

wilkinsona added type: bug A bug and removed status: waiting-for-triage Untriaged issue labels Jan 8, 2021

wilkinsona changed the title ~~Jackson parsing breaking when using an UTF-8 character for link masking~~ Jackson parsing fails when using an UTF-8 character for link masking Jan 8, 2021

wilkinsona added this to the 2.0.6.RELEASE milestone Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jackson parsing fails when using an UTF-8 character for link masking #705

Jackson parsing fails when using an UTF-8 character for link masking #705

odrotbohm commented Nov 20, 2020 •

edited

Loading

wilkinsona commented Nov 20, 2020

Uh oh!

odrotbohm commented Nov 20, 2020

Uh oh!

wilkinsona commented Jan 8, 2021 •

edited

Loading

Uh oh!

odrotbohm commented Jan 8, 2021

Uh oh!

Jackson parsing fails when using an UTF-8 character for link masking #705

Jackson parsing fails when using an UTF-8 character for link masking #705

Comments

odrotbohm commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wilkinsona commented Nov 20, 2020

Uh oh!

odrotbohm commented Nov 20, 2020

Uh oh!

wilkinsona commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

odrotbohm commented Jan 8, 2021

Uh oh!

odrotbohm commented Nov 20, 2020 •

edited

Loading

wilkinsona commented Jan 8, 2021 •

edited

Loading