[Serialization] Fix source location data loss during decoding. #145529

hokein · 2025-06-24T15:16:34Z

The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss.

This patch fixes the issue by preserving the lower 33 bits for the encode.

The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss. This patch fixes the issue by preserving the lower 33 bits for the encode.

llvmbot · 2025-06-24T15:17:07Z

@llvm/pr-subscribers-clang-modules

@llvm/pr-subscribers-clang

Author: Haojian Wu (hokein)

Changes

The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss.

This patch fixes the issue by preserving the lower 33 bits for the encode.

Full diff: https://github.com/llvm/llvm-project/pull/145529.diff

2 Files Affected:

(modified) clang/include/clang/Serialization/SourceLocationEncoding.h (+6-3)
(modified) clang/unittests/Serialization/SourceLocationEncodingTest.cpp (+2)

diff --git a/clang/include/clang/Serialization/SourceLocationEncoding.h b/clang/include/clang/Serialization/SourceLocationEncoding.h
index 33ca1728fa479..85a7379a5fb55 100644
--- a/clang/include/clang/Serialization/SourceLocationEncoding.h
+++ b/clang/include/clang/Serialization/SourceLocationEncoding.h
@@ -48,6 +48,9 @@ class SourceLocationEncoding {
   using UIntTy = SourceLocation::UIntTy;
   constexpr static unsigned UIntBits = CHAR_BIT * sizeof(UIntTy);
 
+  // The maximum number of bits we use for the encoding.
+  constexpr static unsigned EncodingBits = UIntBits + 1;
+
   static UIntTy encodeRaw(UIntTy Raw) {
     return (Raw << 1) | (Raw >> (UIntBits - 1));
   }
@@ -179,20 +182,20 @@ SourceLocationEncoding::encode(SourceLocation Loc, UIntTy BaseOffset,
 
   // 16 bits should be sufficient to store the module file index.
   assert(BaseModuleFileIndex < (1 << 16));
-  Encoded |= (RawLocEncoding)BaseModuleFileIndex << 32;
+  Encoded |= (RawLocEncoding)BaseModuleFileIndex << EncodingBits;
   return Encoded;
 }
 inline std::pair<SourceLocation, unsigned>
 SourceLocationEncoding::decode(RawLocEncoding Encoded,
                                SourceLocationSequence *Seq) {
-  unsigned ModuleFileIndex = Encoded >> 32;
+  unsigned ModuleFileIndex = Encoded >> EncodingBits;
 
   if (!ModuleFileIndex)
     return {Seq ? Seq->decode(Encoded)
                 : SourceLocation::getFromRawEncoding(decodeRaw(Encoded)),
             ModuleFileIndex};
 
-  Encoded &= llvm::maskTrailingOnes<RawLocEncoding>(32);
+  Encoded &= llvm::maskTrailingOnes<RawLocEncoding>(EncodingBits);
   SourceLocation Loc = SourceLocation::getFromRawEncoding(decodeRaw(Encoded));
 
   return {Loc, ModuleFileIndex};
diff --git a/clang/unittests/Serialization/SourceLocationEncodingTest.cpp b/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
index c80a8fd0e52b1..b21035af47cfc 100644
--- a/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
+++ b/clang/unittests/Serialization/SourceLocationEncodingTest.cpp
@@ -104,6 +104,8 @@ TEST(SourceLocationEncoding, Sequence) {
 
   roundTrip(
       {123 | MacroBit, 1, 9, Biggest, Big, Big + 1, 0, MacroBit | Big, 0});
+
+  roundTrip({1, (1u << 30) + 1});
 }
 
 } // namespace

ChuanqiXu9

Sorry. I failed to understand the problem. In what case, may the encoder produce up to 33 bits?

hokein · 2025-06-25T06:46:18Z

Sorry. I failed to understand the problem. In what case, may the encoder produce up to 33 bits?

The newly-added testcase shows the issue -- the encoded value for the delta is 1<<32, relevant code

llvm-project/clang/include/clang/Serialization/SourceLocationEncoding.h

Lines 123 to 125 in c3c923c

    
           // Exactly one 33 bit value is possible! (1 << 32). 
        
           // This is because we have two representations of zero: trivial & relative. 
        
           return 1 + EncodedTy{zigZag(Delta)};

.

And when decoding, we only read the lower 32 bits, which misses the 33rd bit.

ChuanqiXu9 · 2025-06-25T06:55:34Z

The design is, the higher 32 bits are used for module file index and the lower bits are used for offsets. Could you give a concrete example why the current implementation is problematic?

hokein · 2025-06-25T07:19:12Z

The design is, the higher 32 bits are used for module file index and the lower bits are used for offsets. Could you give a concrete example why the current implementation is problematic?

I don’t have a concrete failure case, but I noticed this while working on 64-bit source locations.

Currently, we encode offsets in two ways depending on the module file index:

If the module file index is 0, we use delta encoding (see this code path).
otherwise, we use raw encoding (see this code path).

The 2) case is fine, as the encoded value fits into a 32-bit integer. However, the 1) case can produce a 33-bit value, which doesn’t fit in the lower 32 bits.

It appears we only use 16 bits for the module file index, the fix here is to preserve an additional bit for the offset to avoid this issue.

ChuanqiXu9 · 2025-06-25T07:33:27Z

The design is, the higher 32 bits are used for module file index and the lower bits are used for offsets. Could you give a concrete example why the current implementation is problematic?

I don’t have a concrete failure case, but I noticed this while working on 64-bit source locations.

Yeah, the current situation conflicts with 64-bit source location. We discussed this. The conclusion is, if we need large source location space, use 48 bits and leave upper 16 bits for module file index. 48 bits should be enough for most cases.

Currently, we encode offsets in two ways depending on the module file index:

If the module file index is 0, we use delta encoding (see this code path).

otherwise, we use raw encoding (see this code path).

The 2) case is fine, as the encoded value fits into a 32-bit integer. However, the 1) case can produce a 33-bit value, which doesn’t fit in the lower 32 bits.

It appears we only use 16 bits for the module file index, the fix here is to preserve an additional bit for the offset to avoid this issue.

ChuanqiXu9 · 2025-06-25T07:40:49Z

If the module file index is 0, we use delta encoding (see this code path).
However, the 1) case can produce a 33-bit value, which doesn’t fit in the lower 32 bits.

In this case, I am fine to drop the optimizations. I don't like these small optimization which is hard to maintain and not significant for users.

hokein · 2025-06-25T10:20:49Z

Yeah, the current situation conflicts with 64-bit source location. We discussed this. The conclusion is, if we need large source location space, use 48 bits and leave upper 16 bits for module file index. 48 bits should be enough for most cases.

Thanks, this makes sense. In my implementation, SourceLocation class would be 64 bits, but the actual used bits is no more than 48 bits (the limit is easy to adjust if needed).

In this case, I am fine to drop the optimizations. I don't like these small optimization which is hard to maintain and not significant for users.

I had a similar thought when reading through that part of the code -- it’s becoming non-trivial to maintain. Removing the optimization would simplify things quite a bit. I’ll prepare a patch for that.

The downside is that this may increase the on-disk size of PCM/preamble files. Based on D125403, the size increase could be up to 10%?, but that might be acceptable given the simplification and better long-term maintainability.

ChuanqiXu9 · 2025-06-25T10:52:33Z

Yeah, and for what it worth, maybe you want to take a look at: https://discourse.llvm.org/t/rfc-c-modules-stop-using-abbrev-and-drop-the-maintainance/87063/2

See the discussion in #145529. This will slightly increase the PCM size (~5%), some data (in-memory preamble size in clangd): - SemaExpr.cpp: 77MB -> 80MB - FindTarget.cpp: 71MB -> 75MB

See the discussion in llvm/llvm-project#145529. This will slightly increase the PCM size (~5%), some data (in-memory preamble size in clangd): - SemaExpr.cpp: 77MB -> 80MB - FindTarget.cpp: 71MB -> 75MB

See the discussion in llvm#145529. This will slightly increase the PCM size (~5%), some data (in-memory preamble size in clangd): - SemaExpr.cpp: 77MB -> 80MB - FindTarget.cpp: 71MB -> 75MB

[Serialization] Fix source location data loss during decoding.

d0ddd1d

The delta encoding can produce values up to 33 bits, but the current decoding logic only preserves the lower 32 bits, potentially causing data loss. This patch fixes the issue by preserving the lower 33 bits for the encode.

hokein requested a review from ChuanqiXu9 June 24, 2025 15:16

llvmbot added clang Clang issues not falling into any other category clang:modules C++20 modules and Clang Header Modules labels Jun 24, 2025

ChuanqiXu9 reviewed Jun 25, 2025

View reviewed changes

hokein mentioned this pull request Jun 25, 2025

[Serialization] Remove delta encoding optimization #145670

Merged

ChuanqiXu9 closed this Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serialization] Fix source location data loss during decoding. #145529

[Serialization] Fix source location data loss during decoding. #145529

Uh oh!

hokein commented Jun 24, 2025

Uh oh!

llvmbot commented Jun 24, 2025 •

edited

Loading

Uh oh!

ChuanqiXu9 left a comment

Uh oh!

hokein commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

hokein commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

hokein commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

Uh oh!

[Serialization] Fix source location data loss during decoding. #145529

[Serialization] Fix source location data loss during decoding. #145529

Uh oh!

Conversation

hokein commented Jun 24, 2025

Uh oh!

llvmbot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChuanqiXu9 left a comment

Choose a reason for hiding this comment

Uh oh!

hokein commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

hokein commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

hokein commented Jun 25, 2025

Uh oh!

ChuanqiXu9 commented Jun 25, 2025

Uh oh!

Uh oh!

llvmbot commented Jun 24, 2025 •

edited

Loading