-
Notifications
You must be signed in to change notification settings - Fork 626
Fix an issue with loading bundles with multi-byte character #2820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # firebase-firestore/CHANGELOG.md
Coverage ReportAffected SDKs
Test Logs
NotesHTML coverage reports can be produced locally with Head commit (559e5f5e) is created by Prow via merging commits: 4f1cdc8 7daf5cb. |
Binary Size ReportAffected SDKs
Test Logs
NotesHead commit (559e5f5e) is created by Prow via merging commits: 4f1cdc8 7daf5cb. |
@@ -153,29 +152,50 @@ private int indexOfOpenBracket() { | |||
} | |||
} | |||
|
|||
private class ReadJsonResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by comment: this class should be static
, and probably final
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
buffer.position(buffer.position() + read); | ||
byte[] bytes = new byte[read]; | ||
buffer.get(bytes); | ||
json.append(new String(bytes, charset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by comment: This is dangerous because bytes
could end in the middle of a multi-byte character. You could possibly use a class like java.nio.charset.CharsetDecoder
to handle this properly.
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change doesn't fix the issue. For example, consider the situation where the last byte of bytes
specified to new String(bytes, charset)
is the first byte of a multi-byte character, and the second byte of that character would be the first byte of the bytes
in the next iteration; this would probably either fail decoding with an exception or would produce garbage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bytes the bundle tells the SDK that it is a UTF8 string. If what you described happens, it means the bundle is invalid, and an exception is exactly what we want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the code reads the bundle's bytes in "chunks". Wouldn't there be arbitrary breaks in the "chunks" such that a multibyte character could span two chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This is problematic..I pushed a new commit to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now, what if bytesToRead
> BUFFER_CAPACITY
? IIUC, with the updated code, if the buffer fills up then pullMoreData()
will return false
and readJson()
will incorrectly throw IllegalArgumentException
.
I have another idea. What if you go back to the original implementation but change StringBuilder
to ByteArrayOutputStream
. You can then accumulate the bytes in the ByteArrayOutputStream
and, once all bytes are read, you can decode the entire byte sequence, avoiding the multi-byte-character-spanning-a-chunk issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is better.
buffer.position(buffer.position() + read); | ||
byte[] bytes = new byte[read]; | ||
buffer.get(bytes); | ||
json.append(new String(bytes, charset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change doesn't fix the issue. For example, consider the situation where the last byte of bytes
specified to new String(bytes, charset)
is the first byte of a multi-byte character, and the second byte of that character would be the first byte of the bytes
in the next iteration; this would probably either fail decoding with an exception or would produce garbage.
firebase-firestore/src/main/java/com/google/firebase/firestore/bundle/BundleReader.java
Outdated
Show resolved
Hide resolved
firebase-firestore/src/androidTest/java/com/google/firebase/firestore/BundleTest.java
Outdated
Show resolved
Hide resolved
@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException { | |||
*/ | |||
private boolean pullMoreData() throws IOException { | |||
buffer.compact(); | |||
int read = dataReader.read(buffer); | |||
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
available()
is allowed to return 0
even before the end of the input stream. I think you should use buffer.remaining()
if available()
returns 0
; otherwise, the read operation could erroneously fail as if it had prematurely reached the end of input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test that hits this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reply to Sebastian: I am not sure how to do this without mocking, available()
is stated to be an estimate, which is pretty accurate for most implementations.
@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException { | |||
*/ | |||
private boolean pullMoreData() throws IOException { | |||
buffer.compact(); | |||
int read = dataReader.read(buffer); | |||
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining()); | |||
byte[] bytes = new byte[bytesToRead]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider reading directly into the ByteBuffer
's backing array, instead of allocating a new byte
array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Consider reading/Please read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I feel better to avoid performance optimization without proof. The end result is some byte counting, which is bad for readability.
The other place we do this is acceptable, because there is no good alternative..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are doubling the amount of memory consumed here. Gil pushed back heavily against this in the original PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is premature..but done anyways.
firebase-firestore/src/main/java/com/google/firebase/firestore/bundle/BundleReader.java
Outdated
Show resolved
Hide resolved
buffer.get(c); | ||
return new String(c); | ||
return charset.decode(ByteBuffer.wrap(c)).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readLengthPrefix()
could return the byte count as well as the String, to avoid forcing the caller to re-encode the String into bytes. As with the other suggestion, this could be deferred to a follow-up PR if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that indexOfOpenBracket
returns the char-index of the first bracket. Using a byte buffer here is not correct. but likely won't make a difference since this part of the input should be ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
int read = Math.min(remaining, buffer.remaining()); | ||
json.append(buffer, 0, read); | ||
jsonBytes.write(buffer.array(), buffer.position(), read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add buffer.arrayOffset() to buffer.position().
https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html#arrayOffset()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More over, buffer
is a CharBuffer. CharBuffer deals with UTF-8 direcltly:
String unicode = "\uD83D\uDE0A";
byte[] b = unicode.getBytes();
CharBuffer buffer = CharBuffer.allocate(10);
new InputStreamReader(new ByteArrayInputStream(b)).read(buffer);
StringBuilder result = new StringBuilder();
buffer.flip();
System.out.println(buffer.remaining());
result.append(buffer);
System.out.println(result);
Prints:
2
😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI buffer
was changed from CharBuffer
to ByteBuffer
in one of the recent commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, but the existing code uses it throughout and creates the initial input stream using UTF-8. It should work (and the test passes), so I am a bit concerned that the bug is somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the tests pass because the test bundle builder also count characters, not bytes.
firebase-firestore/src/main/java/com/google/firebase/firestore/bundle/BundleReader.java
Outdated
Show resolved
Hide resolved
firebase-firestore/src/main/java/com/google/firebase/firestore/bundle/BundleReader.java
Outdated
Show resolved
Hide resolved
buffer.get(c); | ||
return new String(c); | ||
return charset.decode(ByteBuffer.wrap(c)).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that indexOfOpenBracket
returns the char-index of the first bracket. Using a byte buffer here is not correct. but likely won't make a difference since this part of the input should be ASCII.
int read = Math.min(remaining, buffer.remaining()); | ||
json.append(buffer, 0, read); | ||
jsonBytes.write(buffer.array(), buffer.position(), read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More over, buffer
is a CharBuffer. CharBuffer deals with UTF-8 direcltly:
String unicode = "\uD83D\uDE0A";
byte[] b = unicode.getBytes();
CharBuffer buffer = CharBuffer.allocate(10);
new InputStreamReader(new ByteArrayInputStream(b)).read(buffer);
StringBuilder result = new StringBuilder();
buffer.flip();
System.out.println(buffer.remaining());
result.append(buffer);
System.out.println(result);
Prints:
2
😊
@@ -71,7 +72,7 @@ | |||
+ "{\"seconds\":1000,\"nanos\":9999},\"exists\":true}}", | |||
"{\"document\":{\"name\":\"projects/{projectId}/databases/(default)/documents/coll-1/b\"," | |||
+ "\"createTime\":{\"seconds\":1,\"nanos\":9},\"updateTime\":{\"seconds\":1," | |||
+ "\"nanos\":9},\"fields\":{\"k\":{\"stringValue\":\"b\"},\"bar\":" | |||
+ "\"nanos\":9},\"fields\":{\"k\":{\"stringValue\":\"\uD83D\uDE0A\"},\"bar\":" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this test passes against master
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the bundles built here also count chars, not bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other change in this test, it uses byte count, which fails the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just saw this flaw in my logic. So the bug is that we try to read more data than available.
@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException { | |||
*/ | |||
private boolean pullMoreData() throws IOException { | |||
buffer.compact(); | |||
int read = dataReader.read(buffer); | |||
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test that hits this issue.
@@ -185,9 +186,12 @@ private String readJsonString(int length) throws IOException { | |||
*/ | |||
private boolean pullMoreData() throws IOException { | |||
buffer.compact(); | |||
int read = dataReader.read(buffer); | |||
int bytesToRead = Math.min(bundleInputStream.available(), buffer.remaining()); | |||
byte[] bytes = new byte[bytesToRead]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Consider reading/Please read
firebase-firestore/src/main/java/com/google/firebase/firestore/bundle/BundleReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve, but please ensure that Sebastian's comments are resolved.
long bytesRead; | ||
|
||
public BundleReader(BundleSerializer serializer, InputStream data) { | ||
this.serializer = serializer; | ||
dataReader = new InputStreamReader(data, charset); | ||
buffer = CharBuffer.allocate(BUFFER_CAPACITY); | ||
bundleInputStream = data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename the argument to bundleInputStream
as well? We could also drop the prefix similar to serializer
.
Optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -133,9 +135,9 @@ private BundleElement readNextElement() throws IOException, JSONException { | |||
throw abort("Reached the end of bundle when a length string is expected."); | |||
} | |||
|
|||
char[] c = new char[nextOpenBracket]; | |||
byte[] c = new byte[nextOpenBracket]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/c/b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
int available = bundleInputStream.available(); | ||
int bytesToRead = Math.min(available, buffer.remaining()); | ||
// `available` is an estimation, we still try to read if the estimation is 0, to move things | ||
// forward. | ||
if (available == 0) { | ||
bytesToRead = buffer.remaining(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always read buffer.remaining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since we are reading into the buffer directly now!
@wu-hui: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Both
BundleReader
and the testingBundlerBuilder
should be counting UTF-8 bytes, instead of Javachar
count.This is a fix for: #2805