Skip to content

Fix an issue with loading bundles with multi-byte character #2820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jul 14, 2021

Conversation

wu-hui
Copy link
Contributor

@wu-hui wu-hui commented Jul 12, 2021

Both BundleReader and the testing BundlerBuilder should be counting UTF-8 bytes, instead of Java char count.

This is a fix for: #2805

@googlebot googlebot added the cla: yes Override cla label Jul 12, 2021
@google-oss-bot
Copy link
Contributor

google-oss-bot commented Jul 12, 2021

Coverage Report

Affected SDKs

  • firebase-firestore

    SDK overall coverage changed from 47.18% (4f1cdc8) to 47.10% (559e5f5e) by -0.08%.

    Filename Base (4f1cdc8) Head (559e5f5e) Diff
    AsyncQueue.java 78.39% 76.88% -1.51%
    BundleReader.java 94.87% 95.29% +0.42%
    ByteBufferInputStream.java 90.91% 83.33% -7.58%
    FirestoreClient.java 34.96% 30.08% -4.88%
    LruGarbageCollector.java 93.46% 84.11% -9.35%

Test Logs

Notes

HTML coverage reports can be produced locally with ./gradlew <product>:checkCoverage.
Report files are located at <product-build-dir>/reports/jacoco/.

Head commit (559e5f5e) is created by Prow via merging commits: 4f1cdc8 7daf5cb.

@google-oss-bot
Copy link
Contributor

google-oss-bot commented Jul 12, 2021

Binary Size Report

Affected SDKs

  • firebase-firestore

    Type Base (4f1cdc8) Head (559e5f5e) Diff
    aar 1.03 MB 1.03 MB +237 B (+0.0%)
    apk (release) 3.19 MB 3.19 MB +388 B (+0.0%)

Test Logs

Notes

Head commit (559e5f5e) is created by Prow via merging commits: 4f1cdc8 7daf5cb.

@@ -153,29 +152,50 @@ private int indexOfOpenBracket() {
}
}

private class ReadJsonResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by comment: this class should be static, and probably final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

buffer.position(buffer.position() + read);
byte[] bytes = new byte[read];
buffer.get(bytes);
json.append(new String(bytes, charset));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by comment: This is dangerous because bytes could end in the middle of a multi-byte character. You could possibly use a class like java.nio.charset.CharsetDecoder to handle this properly.

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change doesn't fix the issue. For example, consider the situation where the last byte of bytes specified to new String(bytes, charset) is the first byte of a multi-byte character, and the second byte of that character would be the first byte of the bytes in the next iteration; this would probably either fail decoding with an exception or would produce garbage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bytes the bundle tells the SDK that it is a UTF8 string. If what you described happens, it means the bundle is invalid, and an exception is exactly what we want here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the code reads the bundle's bytes in "chunks". Wouldn't there be arbitrary breaks in the "chunks" such that a multibyte character could span two chunks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This is problematic..I pushed a new commit to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now, what if bytesToRead > BUFFER_CAPACITY? IIUC, with the updated code, if the buffer fills up then pullMoreData() will return false and readJson() will incorrectly throw IllegalArgumentException.

I have another idea. What if you go back to the original implementation but change StringBuilder to ByteArrayOutputStream. You can then accumulate the bytes in the ByteArrayOutputStream and, once all bytes are read, you can decode the entire byte sequence, avoiding the multi-byte-character-spanning-a-chunk issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is better.

buffer.position(buffer.position() + read);
byte[] bytes = new byte[read];
buffer.get(bytes);
json.append(new String(bytes, charset));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change doesn't fix the issue. For example, consider the situation where the last byte of bytes specified to new String(bytes, charset) is the first byte of a multi-byte character, and the second byte of that character would be the first byte of the bytes in the next iteration; this would probably either fail decoding with an exception or would produce garbage.

@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException {
*/
private boolean pullMoreData() throws IOException {
buffer.compact();
int read = dataReader.read(buffer);
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

available() is allowed to return 0 even before the end of the input stream. I think you should use buffer.remaining() if available() returns 0; otherwise, the read operation could erroneously fail as if it had prematurely reached the end of input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test that hits this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reply to Sebastian: I am not sure how to do this without mocking, available() is stated to be an estimate, which is pretty accurate for most implementations.

@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException {
*/
private boolean pullMoreData() throws IOException {
buffer.compact();
int read = dataReader.read(buffer);
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining());
byte[] bytes = new byte[bytesToRead];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider reading directly into the ByteBuffer's backing array, instead of allocating a new byte array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Consider reading/Please read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I feel better to avoid performance optimization without proof. The end result is some byte counting, which is bad for readability.

The other place we do this is acceptable, because there is no good alternative..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are doubling the amount of memory consumed here. Gil pushed back heavily against this in the original PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is premature..but done anyways.

buffer.get(c);
return new String(c);
return charset.decode(ByteBuffer.wrap(c)).toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readLengthPrefix() could return the byte count as well as the String, to avoid forcing the caller to re-encode the String into bytes. As with the other suggestion, this could be deferred to a follow-up PR if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that indexOfOpenBracket returns the char-index of the first bracket. Using a byte buffer here is not correct. but likely won't make a difference since this part of the input should be ASCII.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

int read = Math.min(remaining, buffer.remaining());
json.append(buffer, 0, read);
jsonBytes.write(buffer.array(), buffer.position(), read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add buffer.arrayOffset() to buffer.position().

https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html#arrayOffset()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More over, buffer is a CharBuffer. CharBuffer deals with UTF-8 direcltly:

            String unicode = "\uD83D\uDE0A";
            byte[] b = unicode.getBytes();
            CharBuffer buffer = CharBuffer.allocate(10);
            new InputStreamReader(new ByteArrayInputStream(b)).read(buffer);
            StringBuilder result = new StringBuilder();
            buffer.flip();
            System.out.println(buffer.remaining());
            result.append(buffer);
            System.out.println(result);

Prints:

2
😊

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI buffer was changed from CharBuffer to ByteBuffer in one of the recent commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, but the existing code uses it throughout and creates the initial input stream using UTF-8. It should work (and the test passes), so I am a bit concerned that the bug is somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the tests pass because the test bundle builder also count characters, not bytes.

buffer.get(c);
return new String(c);
return charset.decode(ByteBuffer.wrap(c)).toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that indexOfOpenBracket returns the char-index of the first bracket. Using a byte buffer here is not correct. but likely won't make a difference since this part of the input should be ASCII.

int read = Math.min(remaining, buffer.remaining());
json.append(buffer, 0, read);
jsonBytes.write(buffer.array(), buffer.position(), read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More over, buffer is a CharBuffer. CharBuffer deals with UTF-8 direcltly:

            String unicode = "\uD83D\uDE0A";
            byte[] b = unicode.getBytes();
            CharBuffer buffer = CharBuffer.allocate(10);
            new InputStreamReader(new ByteArrayInputStream(b)).read(buffer);
            StringBuilder result = new StringBuilder();
            buffer.flip();
            System.out.println(buffer.remaining());
            result.append(buffer);
            System.out.println(result);

Prints:

2
😊

@@ -71,7 +72,7 @@
+ "{\"seconds\":1000,\"nanos\":9999},\"exists\":true}}",
"{\"document\":{\"name\":\"projects/{projectId}/databases/(default)/documents/coll-1/b\","
+ "\"createTime\":{\"seconds\":1,\"nanos\":9},\"updateTime\":{\"seconds\":1,"
+ "\"nanos\":9},\"fields\":{\"k\":{\"stringValue\":\"b\"},\"bar\":"
+ "\"nanos\":9},\"fields\":{\"k\":{\"stringValue\":\"\uD83D\uDE0A\"},\"bar\":"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this test passes against master as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the bundles built here also count chars, not bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other change in this test, it uses byte count, which fails the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just saw this flaw in my logic. So the bug is that we try to read more data than available.

@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException {
*/
private boolean pullMoreData() throws IOException {
buffer.compact();
int read = dataReader.read(buffer);
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test that hits this issue.

@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException {
*/
private boolean pullMoreData() throws IOException {
buffer.compact();
int read = dataReader.read(buffer);
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining());
byte[] bytes = new byte[bytesToRead];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Consider reading/Please read

Copy link
Contributor

@dconeybe dconeybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve, but please ensure that Sebastian's comments are resolved.

@dconeybe dconeybe assigned wu-hui and unassigned dconeybe Jul 13, 2021
long bytesRead;

public BundleReader(BundleSerializer serializer, InputStream data) {
this.serializer = serializer;
dataReader = new InputStreamReader(data, charset);
buffer = CharBuffer.allocate(BUFFER_CAPACITY);
bundleInputStream = data;
Copy link
Contributor

@schmidt-sebastian schmidt-sebastian Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename the argument to bundleInputStream as well? We could also drop the prefix similar to serializer.

Optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -133,9 +135,9 @@ private BundleElement readNextElement() throws IOException, JSONException {
throw abort("Reached the end of bundle when a length string is expected.");
}

char[] c = new char[nextOpenBracket];
byte[] c = new byte[nextOpenBracket];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/c/b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 196 to 202
int available = bundleInputStream.available();
int bytesToRead = Math.min(available, buffer.remaining());
// `available` is an estimation, we still try to read if the estimation is 0, to move things
// forward.
if (available == 0) {
bytesToRead = buffer.remaining();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always read buffer.remaining?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since we are reading into the buffer directly now!

@google-oss-bot
Copy link
Contributor

@wu-hui: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
smoke-tests 7daf5cb link /test smoke-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@schmidt-sebastian schmidt-sebastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wu-hui wu-hui merged commit 76b029c into master Jul 14, 2021
@wu-hui wu-hui deleted the wuandy/BundleUnicodeFix branch July 14, 2021 01:52
@firebase firebase locked and limited conversation to collaborators Aug 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes Override cla size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants