Skip to content

[clang] Bounds checking on unclosed parentheses, brackets or braces in Expanded Tokens #69849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

robozati
Copy link
Contributor

When getting the AST from clangd, if there is an unclosed token, a segfault or an UB can happen when accessing File.SpelledTokens because the supposed closing token is not present, so its index is greater than the vec's size.

This fixes the issue by returning a nullptr on the function that caused the UB and then checking if the spelled token is null, returning nullopt for the whole expanded token.

Fixes clangd/clangd#1559.

@llvmbot llvmbot added the clang Clang issues not falling into any other category label Oct 21, 2023
@robozati
Copy link
Contributor Author

Tests are failing because of the findExpanded function, but it didn’t result in UB. Will fix the test later.

@robozati robozati marked this pull request as draft October 21, 2023 16:13
@robozati robozati force-pushed the bounds-check-on-tokens branch from 474e43f to 239302d Compare October 21, 2023 22:26
@robozati robozati marked this pull request as ready for review October 21, 2023 22:36
@robozati
Copy link
Contributor Author

Rewrote the test and now it passes.

Sorry for pinging you @HighCommander4, but I don't know who to ping and you participated in that issue.

@@ -585,6 +585,17 @@ TEST_F(TokenCollectorTest, DelayedParsing) {
EXPECT_THAT(collectAndDump(Code), StartsWith(ExpectedTokens));
}

TEST_F(TokenCollectorTest, UnclosedToken) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking at this to try and make some progress towards getting this landed. However, when running this test locally, I'm not seeing either of the branches added to spelledForExpandedToken() being taken.

Am I missing something, or is this test not exercising the intended codepath?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it looks like TokenBuffer::spelledForExpanded() (which is the only call site of spelledForExpandedToken()) is not called by anything currently in this test.

To exercise spelledForExpanded(), we will need to make an explicit call to it, the way a number of other tests in this file do.

@robozati
Copy link
Contributor Author

@HighCommander4, thank you for reviewing this PR!

Unfortunately, I’m currently in my finals and will only have time to look at this next week. So, I’ll update this later.

@HighCommander4
Copy link
Collaborator

HighCommander4 commented Dec 3, 2023

I spent a bit more time looking at this. I'd say I'm still quite far from having a complete understanding of this code, but two things occur to me:

  1. TokenBuffer operates at the level of the preprocessor. "spelledForExpanded" is basically an operation which takes as input a range of tokens in the preprocessed token stream ("expanded tokens"), and maps them back to a range of tokens in the un-preprocessed token stream ("spelled tokens").
    • The preprocessor has no notion of matching or balancing braces. Braces only become relevant later, during parsing.
    • So, I don't understand why an input token stream that contains unbalanced braces would pose a problem for the "spelledForExpanded" operation.
    • I'm hesitant to approve a patch that adds an early exit to "spelledForExpanded" to prevent an out-of-bounds array access, without understanding in more detail how the out-of-bounds array access arises / how it's connected to the unbalanced braces. It's possible that the out-of-bounds access is indicative of a deeper logic error which we'd be covering up.
  2. Since we've observed the problem arise during the dumpAST operation (frame 10 of the stack trace here), it would be good to (also) add a test that exercises the dumpAST operation as a whole.
    • We have such tests in DumpASTTests.cpp.
    • For an input with unbalanced braces, such as int main() {, I think it would be preferable if the AST nodes still had non-empty source ranges in the AST dump. I suspect the current patch may give some AST nodes empty source ranges.

@HighCommander4
Copy link
Collaborator

HighCommander4 commented Jan 8, 2024

I poked at this a bit more. A few more notes:

  • Since the test case currently in the patch does not trigger the crash (without the fix), and it was not obvious to me what sort of call to spelledForExpanded() to formulate to trigger the crash, I instead formulated a test case in DumpASTTests.cpp that calls dumpAST() on the TranslationUnitDecl of a file containing the code int main() {. This does trigger the crash.
  • The crash occurs when dumping the FunctionDecl node for main. When dumping a node, dumpAST() calls Tokens.spelledForExpanded(Tokens.expandedTokens(SR)) where SR is the node's source range.
  • For this FunctionDecl node, the end location of its source range points to a tok::eof token.
  • The range of expanded tokens returned by Tok.expandedTokens(SR) thus includes as its last element the tok::eof token.
  • The comment above TokenBuffer says "the expanded token stream has a tok::eof token at the end, the spelled tokens never store a 'eof' token".

I haven't dug into the implementation of spelledForExpanded() yet, but from the above it's looking to me like spelledForExpanded() is not expecting an input expanded token range that includes the eof token, but we are passing it one (and such a token range can legitimately arise from the source range of an AST node in cases of invalid code like this).

@HighCommander4
Copy link
Collaborator

HighCommander4 commented Jan 8, 2024

Anyways, in a case like this, the behaviour I'd like to see after the fix would be for spelledForExpanded() to return the spelled tokens covering int main() { (not including the eof token but including everything else), not nullopt.

@ilya-biryukov
Copy link
Contributor

I second everything that @HighCommander4 said. It would be really useful to add an assertion that we never get out-of-bounds here, but this indicates a logic error somewhere in the code that should be fixed.

I am slightly suspicious of source locations pointing at eof in the AST in the first place, even in invalid code. I wonder if we would be better off just having an invalid source location there instead of pointing at eof.

Asserting that tok::eof should not be passed to spelledForExpandedToken and returning nullopt in spelledForExpanded looks like a reasonable workaround until we fix this. (The semantics looks reasonable as we don't really have a spelled token we can map eof back to).

@HighCommander4
Copy link
Collaborator

HighCommander4 commented Jan 8, 2024

I am slightly suspicious of source locations pointing at eof in the AST in the first place, even in invalid code. I wonder if we would be better off just having an invalid source location there instead of pointing at eof.

I think in this case it's deliberate. If you look at the compiler error for the missing brace:

test.cpp:1:13: error: expected '}'
    1 | int main() {
      |             ^
test.cpp:1:12: note: to match this '{'
    1 | int main() {
      |            ^

you can see that the diagnostic location points to one character past the opening brace, which is the eof token (there is no source character there). I'm not sure how else the diagnostic location could end up there.

Asserting that tok::eof should not be passed to spelledForExpandedToken and returning nullopt in spelledForExpanded looks like a reasonable workaround until we fix this. (The semantics looks reasonable as we don't really have a spelled token we can map eof back to).

What do you think about this as an alternative: if spelledForExpanded receives as input an expanded token range [firstExpanded, lastExpanded] where lastExpanded is the eof token, return the spelled tokens corresponding to [firstExpanded, lastExpanded - 1] instead? (In the case where the eof token is the only one in the range, we could return nullopt.)

This would have the advantage of gracefully handling AST nodes like this one whose end location is the eof, and return all the spelled tokens actually making up the node.

@HighCommander4
Copy link
Collaborator

What do you think about this as an alternative: if spelledForExpanded receives as input an expanded token range [firstExpanded, lastExpanded] where lastExpanded is the eof token, return the spelled tokens corresponding to [firstExpanded, lastExpanded - 1] instead? (In the case where the eof token is the only one in the range, we could return nullopt.)

This would have the advantage of gracefully handling AST nodes like this one whose end location is the eof, and return all the spelled tokens actually making up the node.

I wrote this up at #78092

@HighCommander4
Copy link
Collaborator

Superseded by #78092.

Thanks for putting together this patch, @robozati. While we didn't go with the fix approach taken in this patch, it did prod me to look at the issue in more detail 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category clangd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertions triggered with unclosed parentheses, brackets, or braces with _GLIBCXX_ASSERTIONS
4 participants