Skip to content

Jackson parsing fails when using an UTF-8 character for link masking #705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
odrotbohm opened this issue Nov 20, 2020 · 4 comments
Closed
Labels
Milestone

Comments

@odrotbohm
Copy link
Member

odrotbohm commented Nov 20, 2020

When using e.g. the UTF-8 ellipsis character () passed into maskLinks(…), the execution fails on windows.

org.springframework.restdocs.snippet.ModelCreationException:
com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0x85

I wasn't able to obtain a more complete stack trace yet but will further investigate.

Update I managed to obtain more stack trace:

at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1851)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:707)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3601)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3597)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2540)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2466)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:297)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:670)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.mapObject(UntypedObjectDeserializer.java:869)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:652)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.mapObject(UntypedObjectDeserializer.java:869)
    at com.fasterxml.jackson.databind.deser.std.UntypedObjectDeserializer$Vanilla.deserialize(UntypedObjectDeserializer.java:652)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringKeyMap(MapDeserializer.java:540)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:377)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:29)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4526)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3529)
    at org.springframework.restdocs.hypermedia.AbstractJsonLinkExtractor.extractLinks(AbstractJsonLinkExtractor.java:39)
    at org.springframework.restdocs.hypermedia.ContentTypeLinkExtractor.extractLinks(ContentTypeLinkExtractor.java:52)
    at org.springframework.restdocs.hypermedia.LinksSnippet.createModel(LinksSnippet.java:122)

Also, tweaking AbstractJsonLinkExtractor.extractLinks(…) to rather use ….getContentAsString() over ….getContent() seems to workaround the error as that replaces mangled characters with the replacement character (?). I.e. it looks like something is platform encoding the content byte array.

odrotbohm added a commit to quarano/quarano-application that referenced this issue Nov 20, 2020
The character seems to break the JSON parsing on Windows as reported in Spring REST Docs [0]. We temporarily move off the dedicated character in favor of three dots.

[0] spring-projects/spring-restdocs#705.
@wilkinsona
Copy link
Member

Some experimentation is suggesting that this isn't a REST Docs problem and that the is corrupted before it gets to REST Docs. I think the problem occurs when you have UTF-8 encoded source that's compiled and then run on a JVM using a different encoding (most likely CP1252 on Windows).

The following UTF-8-encoded source appears to be sufficient to reproduce the problem:

System.out.println("…");
System.out.println("\u2026");

When run on a JVM using CP1252 as its default encoding, the following output is produced:

�
�

Switching the JVM's default encoding to UTF-8 produces the expected output:

…
…

@odrotbohm Can you try your app with the JVM running the tests configured to use UTF-8 as its default encoding (-Dfile.encoding=UTF-8)?

@odrotbohm
Copy link
Member Author

Hm, to me it looks like the PatternReplacingContentModifier is calling String.getBytes() on the replacement result, which uses the system / JVM encoding and thus – on Windows – produces a ISO 8859 1 encoded byte array. Replacing that call with an ….getBytes(StandardCharsets.UTF_8) produces a UTF-8 encoded byte array in the first place and avoids the exception in the first place. I guess, ideally, the charset provided by the response is preferred with UTF-8 being the fallback in case none is given.

@wilkinsona
Copy link
Member

wilkinsona commented Jan 8, 2021

I've spent some time in a Windows VM and I think I have a more complete understanding of the problem.

Replacing that call with an ….getBytes(StandardCharsets.UTF_8) produces a UTF-8 encoded byte array in the first place and avoids the exception in the first place

Yeah, changes along these lines do appear to fix the problem:

@@ -16,6 +16,8 @@
 
 package org.springframework.restdocs.operation.preprocess;
 
+import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 
@@ -47,13 +49,8 @@ class PatternReplacingContentModifier implements ContentModifier {
 
        @Override
        public byte[] modifyContent(byte[] content, MediaType contentType) {
-               String original;
-               if (contentType != null && contentType.getCharset() != null) {
-                       original = new String(content, contentType.getCharset());
-               }
-               else {
-                       original = new String(content);
-               }
+               Charset charset = (contentType != null && contentType.getCharset() != null) ? contentType.getCharset() : StandardCharsets.UTF_8;
+               String original = new String(content, charset);
                Matcher matcher = this.pattern.matcher(original);
                StringBuilder builder = new StringBuilder();
                int previous = 0;
@@ -73,7 +70,7 @@ class PatternReplacingContentModifier implements ContentModifier {
                if (previous < original.length()) {
                        builder.append(original.substring(previous));
                }
-               return builder.toString().getBytes();
+               return builder.toString().getBytes(charset);
        }
 
 }

That said, I'm not sure that this is the right change to make. Switching from using the JVM's default encoding as a fallback to UTF-8 will fix this problem, but it may create others when UTF-8 isn't the right encoding to use. I'm concerned about such a change breaking people's tests in a maintenance release.

To reduce the scope of the change but still address the problem, I think it may be better for LinkMaskingContentModifier to configure its PatternReplacingContentModifier to use UTF-8 as the fallback charset. This seems reasonable as LinkMaskingContentModifier already assumes JSON content and JSON is always UTF-8 encoded.

@wilkinsona wilkinsona added type: bug A bug and removed status: waiting-for-triage Untriaged issue labels Jan 8, 2021
@wilkinsona wilkinsona changed the title Jackson parsing breaking when using an UTF-8 character for link masking Jackson parsing fails when using an UTF-8 character for link masking Jan 8, 2021
@wilkinsona wilkinsona added this to the 2.0.6.RELEASE milestone Jan 8, 2021
@odrotbohm
Copy link
Member Author

Lovely, thanks, Andy! 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants