Skip to content

[Serialization] Fix source location data loss during decoding. #145529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

hokein
Copy link
Collaborator

@hokein hokein commented Jun 24, 2025

The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss.

This patch fixes the issue by preserving the lower 33 bits for the encode.

The delta encoding can produce values up to 33 bits, but the current decoding
logic only preserves the lower 32 bits, potentially causing data loss.

This patch fixes the issue by preserving the lower 33 bits for the
encode.
@hokein hokein requested a review from ChuanqiXu9 June 24, 2025 15:16
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:modules C++20 modules and Clang Header Modules labels Jun 24, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 24, 2025

@llvm/pr-subscribers-clang-modules

@llvm/pr-subscribers-clang

Author: Haojian Wu (hokein)

Changes

The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss.

This patch fixes the issue by preserving the lower 33 bits for the encode.


Full diff: https://github.com/llvm/llvm-project/pull/145529.diff

2 Files Affected:

  • (modified) clang/include/clang/Serialization/SourceLocationEncoding.h (+6-3)
  • (modified) clang/unittests/Serialization/SourceLocationEncodingTest.cpp (+2)
diff --git a/clang/include/clang/Serialization/SourceLocationEncoding.h b/clang/include/clang/Serialization/SourceLocationEncoding.h
index 33ca1728fa479..85a7379a5fb55 100644
--- a/clang/include/clang/Serialization/SourceLocationEncoding.h
+++ b/clang/include/clang/Serialization/SourceLocationEncoding.h
@@ -48,6 +48,9 @@ class SourceLocationEncoding {
   using UIntTy = SourceLocation::UIntTy;
   constexpr static unsigned UIntBits = CHAR_BIT * sizeof(UIntTy);
 
+  // The maximum number of bits we use for the encoding.
+  constexpr static unsigned EncodingBits = UIntBits + 1;
+
   static UIntTy encodeRaw(UIntTy Raw) {
     return (Raw << 1) | (Raw >> (UIntBits - 1));
   }
@@ -179,20 +182,20 @@ SourceLocationEncoding::encode(SourceLocation Loc, UIntTy BaseOffset,
 
   // 16 bits should be sufficient to store the module file index.
   assert(BaseModuleFileIndex < (1 << 16));
-  Encoded |= (RawLocEncoding)BaseModuleFileIndex << 32;
+  Encoded |= (RawLocEncoding)BaseModuleFileIndex << EncodingBits;
   return Encoded;
 }
 inline std::pair<SourceLocation, unsigned>
 SourceLocationEncoding::decode(RawLocEncoding Encoded,
                                SourceLocationSequence *Seq) {
-  unsigned ModuleFileIndex = Encoded >> 32;
+  unsigned ModuleFileIndex = Encoded >> EncodingBits;
 
   if (!ModuleFileIndex)
     return {Seq ? Seq->decode(Encoded)
                 : SourceLocation::getFromRawEncoding(decodeRaw(Encoded)),
             ModuleFileIndex};
 
-  Encoded &= llvm::maskTrailingOnes<RawLocEncoding>(32);
+  Encoded &= llvm::maskTrailingOnes<RawLocEncoding>(EncodingBits);
   SourceLocation Loc = SourceLocation::getFromRawEncoding(decodeRaw(Encoded));
 
   return {Loc, ModuleFileIndex};
diff --git a/clang/unittests/Serialization/SourceLocationEncodingTest.cpp b/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
index c80a8fd0e52b1..b21035af47cfc 100644
--- a/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
+++ b/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
@@ -104,6 +104,8 @@ TEST(SourceLocationEncoding, Sequence) {
 
   roundTrip(
       {123 | MacroBit, 1, 9, Biggest, Big, Big + 1, 0, MacroBit | Big, 0});
+
+  roundTrip({1, (1u << 30) + 1});
 }
 
 } // namespace

Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I failed to understand the problem. In what case, may the encoder produce up to 33 bits?

@hokein
Copy link
Collaborator Author

hokein commented Jun 25, 2025

Sorry. I failed to understand the problem. In what case, may the encoder produce up to 33 bits?

The newly-added testcase shows the issue -- the encoded value for the delta is 1<<32, relevant code

// Exactly one 33 bit value is possible! (1 << 32).
// This is because we have two representations of zero: trivial & relative.
return 1 + EncodedTy{zigZag(Delta)};
.

And when decoding, we only read the lower 32 bits, which misses the 33rd bit.

@ChuanqiXu9
Copy link
Member

The design is, the higher 32 bits are used for module file index and the lower bits are used for offsets. Could you give a concrete example why the current implementation is problematic?

@hokein
Copy link
Collaborator Author

hokein commented Jun 25, 2025

The design is, the higher 32 bits are used for module file index and the lower bits are used for offsets. Could you give a concrete example why the current implementation is problematic?

I don’t have a concrete failure case, but I noticed this while working on 64-bit source locations.

Currently, we encode offsets in two ways depending on the module file index:

  1. If the module file index is 0, we use delta encoding (see this code path).
  2. otherwise, we use raw encoding (see this code path).

The 2) case is fine, as the encoded value fits into a 32-bit integer. However, the 1) case can produce a 33-bit value, which doesn’t fit in the lower 32 bits.

It appears we only use 16 bits for the module file index, the fix here is to preserve an additional bit for the offset to avoid this issue.

@ChuanqiXu9
Copy link
Member

The design is, the higher 32 bits are used for module file index and the lower bits are used for offsets. Could you give a concrete example why the current implementation is problematic?

I don’t have a concrete failure case, but I noticed this while working on 64-bit source locations.

Yeah, the current situation conflicts with 64-bit source location. We discussed this. The conclusion is, if we need large source location space, use 48 bits and leave upper 16 bits for module file index. 48 bits should be enough for most cases.

Currently, we encode offsets in two ways depending on the module file index:

  1. If the module file index is 0, we use delta encoding (see this code path).
  2. otherwise, we use raw encoding (see this code path).

The 2) case is fine, as the encoded value fits into a 32-bit integer. However, the 1) case can produce a 33-bit value, which doesn’t fit in the lower 32 bits.

It appears we only use 16 bits for the module file index, the fix here is to preserve an additional bit for the offset to avoid this issue.

@ChuanqiXu9
Copy link
Member

If the module file index is 0, we use delta encoding (see this code path).
However, the 1) case can produce a 33-bit value, which doesn’t fit in the lower 32 bits.

In this case, I am fine to drop the optimizations. I don't like these small optimization which is hard to maintain and not significant for users.

@hokein
Copy link
Collaborator Author

hokein commented Jun 25, 2025

Yeah, the current situation conflicts with 64-bit source location. We discussed this. The conclusion is, if we need large source location space, use 48 bits and leave upper 16 bits for module file index. 48 bits should be enough for most cases.

Thanks, this makes sense. In my implementation, SourceLocation class would be 64 bits, but the actual used bits is no more than 48 bits (the limit is easy to adjust if needed).

In this case, I am fine to drop the optimizations. I don't like these small optimization which is hard to maintain and not significant for users.

I had a similar thought when reading through that part of the code -- it’s becoming non-trivial to maintain. Removing the optimization would simplify things quite a bit. I’ll prepare a patch for that.

The downside is that this may increase the on-disk size of PCM/preamble files. Based on D125403, the size increase could be up to 10%?, but that might be acceptable given the simplification and better long-term maintainability.

@ChuanqiXu9
Copy link
Member

Yeah, and for what it worth, maybe you want to take a look at: https://discourse.llvm.org/t/rfc-c-modules-stop-using-abbrev-and-drop-the-maintainance/87063/2

hokein added a commit that referenced this pull request Jun 25, 2025
See the discussion in #145529.

This will slightly increase the PCM size (~5%), some data (in-memory
preamble size in clangd):

- SemaExpr.cpp: 77MB  -> 80MB
- FindTarget.cpp: 71MB -> 75MB
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jun 25, 2025
See the discussion in llvm/llvm-project#145529.

This will slightly increase the PCM size (~5%), some data (in-memory
preamble size in clangd):

- SemaExpr.cpp: 77MB  -> 80MB
- FindTarget.cpp: 71MB -> 75MB
@ChuanqiXu9 ChuanqiXu9 closed this Jun 26, 2025
anthonyhatran pushed a commit to anthonyhatran/llvm-project that referenced this pull request Jun 26, 2025
See the discussion in llvm#145529.

This will slightly increase the PCM size (~5%), some data (in-memory
preamble size in clangd):

- SemaExpr.cpp: 77MB  -> 80MB
- FindTarget.cpp: 71MB -> 75MB
rlavaee pushed a commit to rlavaee/llvm-project that referenced this pull request Jul 1, 2025
See the discussion in llvm#145529.

This will slightly increase the PCM size (~5%), some data (in-memory
preamble size in clangd):

- SemaExpr.cpp: 77MB  -> 80MB
- FindTarget.cpp: 71MB -> 75MB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:modules C++20 modules and Clang Header Modules clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants