[clang] Bounds checking on unclosed parentheses, brackets or braces in Expanded Tokens #69849

robozati · 2023-10-21T15:16:18Z

When getting the AST from clangd, if there is an unclosed token, a segfault or an UB can happen when accessing File.SpelledTokens because the supposed closing token is not present, so its index is greater than the vec's size.

This fixes the issue by returning a nullptr on the function that caused the UB and then checking if the spelled token is null, returning nullopt for the whole expanded token.

Fixes clangd/clangd#1559.

robozati · 2023-10-21T16:13:39Z

Tests are failing because of the findExpanded function, but it didn’t result in UB. Will fix the test later.

robozati · 2023-10-21T22:46:21Z

Rewrote the test and now it passes.

Sorry for pinging you @HighCommander4, but I don't know who to ping and you participated in that issue.

HighCommander4 · 2023-11-27T02:38:10Z

clang/unittests/Tooling/Syntax/TokensTest.cpp

@@ -585,6 +585,17 @@ TEST_F(TokenCollectorTest, DelayedParsing) {
  EXPECT_THAT(collectAndDump(Code), StartsWith(ExpectedTokens));
 }

+TEST_F(TokenCollectorTest, UnclosedToken) {


I started looking at this to try and make some progress towards getting this landed. However, when running this test locally, I'm not seeing either of the branches added to spelledForExpandedToken() being taken.

Am I missing something, or is this test not exercising the intended codepath?

Indeed, it looks like TokenBuffer::spelledForExpanded() (which is the only call site of spelledForExpandedToken()) is not called by anything currently in this test.

To exercise spelledForExpanded(), we will need to make an explicit call to it, the way a number of other tests in this file do.

robozati · 2023-11-28T13:48:20Z

@HighCommander4, thank you for reviewing this PR!

Unfortunately, I’m currently in my finals and will only have time to look at this next week. So, I’ll update this later.

HighCommander4 · 2023-12-03T08:34:52Z

I spent a bit more time looking at this. I'd say I'm still quite far from having a complete understanding of this code, but two things occur to me:

TokenBuffer operates at the level of the preprocessor. "spelledForExpanded" is basically an operation which takes as input a range of tokens in the preprocessed token stream ("expanded tokens"), and maps them back to a range of tokens in the un-preprocessed token stream ("spelled tokens").
- The preprocessor has no notion of matching or balancing braces. Braces only become relevant later, during parsing.
- So, I don't understand why an input token stream that contains unbalanced braces would pose a problem for the "spelledForExpanded" operation.
- I'm hesitant to approve a patch that adds an early exit to "spelledForExpanded" to prevent an out-of-bounds array access, without understanding in more detail how the out-of-bounds array access arises / how it's connected to the unbalanced braces. It's possible that the out-of-bounds access is indicative of a deeper logic error which we'd be covering up.
Since we've observed the problem arise during the dumpAST operation (frame 10 of the stack trace here), it would be good to (also) add a test that exercises the dumpAST operation as a whole.
- We have such tests in DumpASTTests.cpp.
- For an input with unbalanced braces, such as int main() {, I think it would be preferable if the AST nodes still had non-empty source ranges in the AST dump. I suspect the current patch may give some AST nodes empty source ranges.

HighCommander4 · 2024-01-08T08:50:18Z

I poked at this a bit more. A few more notes:

Since the test case currently in the patch does not trigger the crash (without the fix), and it was not obvious to me what sort of call to spelledForExpanded() to formulate to trigger the crash, I instead formulated a test case in DumpASTTests.cpp that calls dumpAST() on the TranslationUnitDecl of a file containing the code int main() {. This does trigger the crash.
The crash occurs when dumping the FunctionDecl node for main. When dumping a node, dumpAST() calls Tokens.spelledForExpanded(Tokens.expandedTokens(SR)) where SR is the node's source range.
For this FunctionDecl node, the end location of its source range points to a tok::eof token.
- This comes from the end location of the function's CompoundStmt body, which is what the parser assigns in Parser::ParseCompoundStatementBody() in this situation.
The range of expanded tokens returned by Tok.expandedTokens(SR) thus includes as its last element the tok::eof token.
The comment above TokenBuffer says "the expanded token stream has a tok::eof token at the end, the spelled tokens never store a 'eof' token".

I haven't dug into the implementation of spelledForExpanded() yet, but from the above it's looking to me like spelledForExpanded() is not expecting an input expanded token range that includes the eof token, but we are passing it one (and such a token range can legitimately arise from the source range of an AST node in cases of invalid code like this).

HighCommander4 · 2024-01-08T08:55:07Z

Anyways, in a case like this, the behaviour I'd like to see after the fix would be for spelledForExpanded() to return the spelled tokens covering int main() { (not including the eof token but including everything else), not nullopt.

ilya-biryukov · 2024-01-08T10:44:28Z

I second everything that @HighCommander4 said. It would be really useful to add an assertion that we never get out-of-bounds here, but this indicates a logic error somewhere in the code that should be fixed.

I am slightly suspicious of source locations pointing at eof in the AST in the first place, even in invalid code. I wonder if we would be better off just having an invalid source location there instead of pointing at eof.

Asserting that tok::eof should not be passed to spelledForExpandedToken and returning nullopt in spelledForExpanded looks like a reasonable workaround until we fix this. (The semantics looks reasonable as we don't really have a spelled token we can map eof back to).

HighCommander4 · 2024-01-08T19:44:30Z

I am slightly suspicious of source locations pointing at eof in the AST in the first place, even in invalid code. I wonder if we would be better off just having an invalid source location there instead of pointing at eof.

I think in this case it's deliberate. If you look at the compiler error for the missing brace:

test.cpp:1:13: error: expected '}'
    1 | int main() {
      |             ^
test.cpp:1:12: note: to match this '{'
    1 | int main() {
      |            ^

you can see that the diagnostic location points to one character past the opening brace, which is the eof token (there is no source character there). I'm not sure how else the diagnostic location could end up there.

Asserting that tok::eof should not be passed to spelledForExpandedToken and returning nullopt in spelledForExpanded looks like a reasonable workaround until we fix this. (The semantics looks reasonable as we don't really have a spelled token we can map eof back to).

What do you think about this as an alternative: if spelledForExpanded receives as input an expanded token range [firstExpanded, lastExpanded] where lastExpanded is the eof token, return the spelled tokens corresponding to [firstExpanded, lastExpanded - 1] instead? (In the case where the eof token is the only one in the range, we could return nullopt.)

This would have the advantage of gracefully handling AST nodes like this one whose end location is the eof, and return all the spelled tokens actually making up the node.

HighCommander4 · 2024-01-14T08:51:05Z

What do you think about this as an alternative: if spelledForExpanded receives as input an expanded token range [firstExpanded, lastExpanded] where lastExpanded is the eof token, return the spelled tokens corresponding to [firstExpanded, lastExpanded - 1] instead? (In the case where the eof token is the only one in the range, we could return nullopt.)

This would have the advantage of gracefully handling AST nodes like this one whose end location is the eof, and return all the spelled tokens actually making up the node.

I wrote this up at #78092

HighCommander4 · 2024-01-18T07:11:33Z

Superseded by #78092.

Thanks for putting together this patch, @robozati. While we didn't go with the fix approach taken in this patch, it did prod me to look at the issue in more detail 😆

llvmbot added the clang Clang issues not falling into any other category label Oct 21, 2023

robozati marked this pull request as draft October 21, 2023 16:13

robozati force-pushed the bounds-check-on-tokens branch from 474e43f to 239302d Compare October 21, 2023 22:26

Fix UB on unclosed parentheses, brackets or braces in Tokens

239302d

robozati marked this pull request as ready for review October 21, 2023 22:36

HighCommander4 added the clangd label Oct 21, 2023

HighCommander4 requested review from sam-mccall, ilya-biryukov, hlopko and kadircet October 21, 2023 23:19

HighCommander4 reviewed Nov 27, 2023

View reviewed changes

HighCommander4 mentioned this pull request Jan 14, 2024

[clangd] Handle an expanded token range that ends in the eof token in TokenBuffer::spelledForExpanded() #78092

Merged

HighCommander4 closed this Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[clang] Bounds checking on unclosed parentheses, brackets or braces in Expanded Tokens #69849

[clang] Bounds checking on unclosed parentheses, brackets or braces in Expanded Tokens #69849

Uh oh!

robozati commented Oct 21, 2023

Uh oh!

robozati commented Oct 21, 2023

Uh oh!

robozati commented Oct 21, 2023

Uh oh!

HighCommander4 Nov 27, 2023

Uh oh!

HighCommander4 Nov 27, 2023

Uh oh!

robozati commented Nov 28, 2023

Uh oh!

HighCommander4 commented Dec 3, 2023 •

edited

Loading

Uh oh!

HighCommander4 commented Jan 8, 2024 •

edited

Loading

Uh oh!

HighCommander4 commented Jan 8, 2024 •

edited

Loading

Uh oh!

ilya-biryukov commented Jan 8, 2024

Uh oh!

HighCommander4 commented Jan 8, 2024 •

edited

Loading

Uh oh!

HighCommander4 commented Jan 14, 2024

Uh oh!

HighCommander4 commented Jan 18, 2024

Uh oh!

Uh oh!

[clang] Bounds checking on unclosed parentheses, brackets or braces in Expanded Tokens #69849

[clang] Bounds checking on unclosed parentheses, brackets or braces in Expanded Tokens #69849

Uh oh!

Conversation

robozati commented Oct 21, 2023

Uh oh!

robozati commented Oct 21, 2023

Uh oh!

robozati commented Oct 21, 2023

Uh oh!

HighCommander4 Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

HighCommander4 Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

robozati commented Nov 28, 2023

Uh oh!

HighCommander4 commented Dec 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HighCommander4 commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HighCommander4 commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilya-biryukov commented Jan 8, 2024

Uh oh!

HighCommander4 commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HighCommander4 commented Jan 14, 2024

Uh oh!

HighCommander4 commented Jan 18, 2024

Uh oh!

Uh oh!

HighCommander4 commented Dec 3, 2023 •

edited

Loading

HighCommander4 commented Jan 8, 2024 •

edited

Loading

HighCommander4 commented Jan 8, 2024 •

edited

Loading

HighCommander4 commented Jan 8, 2024 •

edited

Loading