-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[Serialization] Fix source location data loss during decoding. #145529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss. This patch fixes the issue by preserving the lower 33 bits for the encode.
@llvm/pr-subscribers-clang-modules @llvm/pr-subscribers-clang Author: Haojian Wu (hokein) ChangesThe delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss. This patch fixes the issue by preserving the lower 33 bits for the encode. Full diff: https://github.com/llvm/llvm-project/pull/145529.diff 2 Files Affected:
diff --git a/clang/include/clang/Serialization/SourceLocationEncoding.h b/clang/include/clang/Serialization/SourceLocationEncoding.h
index 33ca1728fa479..85a7379a5fb55 100644
--- a/clang/include/clang/Serialization/SourceLocationEncoding.h
+++ b/clang/include/clang/Serialization/SourceLocationEncoding.h
@@ -48,6 +48,9 @@ class SourceLocationEncoding {
using UIntTy = SourceLocation::UIntTy;
constexpr static unsigned UIntBits = CHAR_BIT * sizeof(UIntTy);
+ // The maximum number of bits we use for the encoding.
+ constexpr static unsigned EncodingBits = UIntBits + 1;
+
static UIntTy encodeRaw(UIntTy Raw) {
return (Raw << 1) | (Raw >> (UIntBits - 1));
}
@@ -179,20 +182,20 @@ SourceLocationEncoding::encode(SourceLocation Loc, UIntTy BaseOffset,
// 16 bits should be sufficient to store the module file index.
assert(BaseModuleFileIndex < (1 << 16));
- Encoded |= (RawLocEncoding)BaseModuleFileIndex << 32;
+ Encoded |= (RawLocEncoding)BaseModuleFileIndex << EncodingBits;
return Encoded;
}
inline std::pair<SourceLocation, unsigned>
SourceLocationEncoding::decode(RawLocEncoding Encoded,
SourceLocationSequence *Seq) {
- unsigned ModuleFileIndex = Encoded >> 32;
+ unsigned ModuleFileIndex = Encoded >> EncodingBits;
if (!ModuleFileIndex)
return {Seq ? Seq->decode(Encoded)
: SourceLocation::getFromRawEncoding(decodeRaw(Encoded)),
ModuleFileIndex};
- Encoded &= llvm::maskTrailingOnes<RawLocEncoding>(32);
+ Encoded &= llvm::maskTrailingOnes<RawLocEncoding>(EncodingBits);
SourceLocation Loc = SourceLocation::getFromRawEncoding(decodeRaw(Encoded));
return {Loc, ModuleFileIndex};
diff --git a/clang/unittests/Serialization/SourceLocationEncodingTest.cpp b/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
index c80a8fd0e52b1..b21035af47cfc 100644
--- a/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
+++ b/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
@@ -104,6 +104,8 @@ TEST(SourceLocationEncoding, Sequence) {
roundTrip(
{123 | MacroBit, 1, 9, Biggest, Big, Big + 1, 0, MacroBit | Big, 0});
+
+ roundTrip({1, (1u << 30) + 1});
}
} // namespace
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I failed to understand the problem. In what case, may the encoder produce up to 33 bits?
The newly-added testcase shows the issue -- the encoded value for the delta is llvm-project/clang/include/clang/Serialization/SourceLocationEncoding.h Lines 123 to 125 in c3c923c
And when decoding, we only read the lower 32 bits, which misses the 33rd bit. |
The design is, the higher 32 bits are used for module file index and the lower bits are used for offsets. Could you give a concrete example why the current implementation is problematic? |
I don’t have a concrete failure case, but I noticed this while working on 64-bit source locations. Currently, we encode offsets in two ways depending on the module file index:
The 2) case is fine, as the encoded value fits into a 32-bit integer. However, the 1) case can produce a 33-bit value, which doesn’t fit in the lower 32 bits. It appears we only use 16 bits for the module file index, the fix here is to preserve an additional bit for the offset to avoid this issue. |
Yeah, the current situation conflicts with 64-bit source location. We discussed this. The conclusion is, if we need large source location space, use 48 bits and leave upper 16 bits for module file index. 48 bits should be enough for most cases.
|
In this case, I am fine to drop the optimizations. I don't like these small optimization which is hard to maintain and not significant for users. |
Thanks, this makes sense. In my implementation,
I had a similar thought when reading through that part of the code -- it’s becoming non-trivial to maintain. Removing the optimization would simplify things quite a bit. I’ll prepare a patch for that. The downside is that this may increase the on-disk size of PCM/preamble files. Based on D125403, the size increase could be up to 10%?, but that might be acceptable given the simplification and better long-term maintainability. |
Yeah, and for what it worth, maybe you want to take a look at: https://discourse.llvm.org/t/rfc-c-modules-stop-using-abbrev-and-drop-the-maintainance/87063/2 |
See the discussion in #145529. This will slightly increase the PCM size (~5%), some data (in-memory preamble size in clangd): - SemaExpr.cpp: 77MB -> 80MB - FindTarget.cpp: 71MB -> 75MB
See the discussion in llvm/llvm-project#145529. This will slightly increase the PCM size (~5%), some data (in-memory preamble size in clangd): - SemaExpr.cpp: 77MB -> 80MB - FindTarget.cpp: 71MB -> 75MB
See the discussion in llvm#145529. This will slightly increase the PCM size (~5%), some data (in-memory preamble size in clangd): - SemaExpr.cpp: 77MB -> 80MB - FindTarget.cpp: 71MB -> 75MB
See the discussion in llvm#145529. This will slightly increase the PCM size (~5%), some data (in-memory preamble size in clangd): - SemaExpr.cpp: 77MB -> 80MB - FindTarget.cpp: 71MB -> 75MB
The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss.
This patch fixes the issue by preserving the lower 33 bits for the encode.