Skip to content

Skip escaped newlines before checking for whitespace in Lexer::getRawToken. #117548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 5, 2024

Conversation

bazuzi
Copy link
Contributor

@bazuzi bazuzi commented Nov 25, 2024

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.

The modified test was printing \\nF for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of F.

Edit: No longer need to modify the test and we are solving the issue of whitespace being erroneously ignored by the Lexer instance by checking for whitespace immediately following any escaped newline. getPreviousTokenAndStart was more robust than expected, because Lexer::GetBeginningOfToken returns the location of any immediately preceding escaped newlines.

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at `Loc`, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.

The modified test was printing `\\nF` for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of `F`.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Nov 25, 2024
@llvmbot
Copy link
Member

llvmbot commented Nov 25, 2024

@llvm/pr-subscribers-clang

Author: Samira Bazuzi (bazuzi)

Changes

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.

The modified test was printing \\nF for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of F.


Full diff: https://github.com/llvm/llvm-project/pull/117548.diff

2 Files Affected:

  • (modified) clang/lib/Lex/Lexer.cpp (+3-1)
  • (modified) clang/test/Frontend/highlight-text.c (+1-2)
diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index e58c8bc72ae5b3..392cce6be0d171 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -527,7 +527,9 @@ bool Lexer::getRawToken(SourceLocation Loc, Token &Result,
 
   const char *StrData = Buffer.data()+LocInfo.second;
 
-  if (!IgnoreWhiteSpace && isWhitespace(StrData[0]))
+  if (!IgnoreWhiteSpace && (isWhitespace(StrData[0]) ||
+                            // Treat escaped newlines as whitespace.
+                            SkipEscapedNewLines(StrData) != StrData))
     return true;
 
   // Create a lexer starting at the beginning of this token.
diff --git a/clang/test/Frontend/highlight-text.c b/clang/test/Frontend/highlight-text.c
index a81d26caa4c24c..eefa4ebeec8ca4 100644
--- a/clang/test/Frontend/highlight-text.c
+++ b/clang/test/Frontend/highlight-text.c
@@ -12,8 +12,7 @@ int a = M;
 // CHECK-NEXT: :5:11: note: expanded from macro 'M'
 // CHECK-NEXT:     5 | #define M \
 // CHECK-NEXT:       |           ^
-// CHECK-NEXT: :3:14: note: expanded from macro '\
-// CHECK-NEXT: F'
+// CHECK-NEXT: :3:14: note: expanded from here
 // CHECK-NEXT:     3 | #define F (1 << 99)
 // CHECK-NEXT:       |              ^  ~~
 // CHECK-NEXT: :8:9: warning: shift count >= width of type [-Wshift-count-overflow]

@cor3ntin
Copy link
Contributor

line splicing does not introduce spaces (https://compiler-explorer.com/z/ohaq6Wzv7).

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc

That sounds like a more promising approach. Is that something you would be willing to explore?

@bazuzi
Copy link
Contributor Author

bazuzi commented Nov 26, 2024

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc

That sounds like a more promising approach. Is that something you would be willing to explore?

Certainly. Some initial testing reveals that getPreviousTokenAndStart is not in conflict with this approach as I had feared, because Lexer::GetBeginningOfToken will include an escaped newline immediately followed by non-whitespace as being part of the token.

I see no failing existing tests with just the new change to getRawToken, so I will update this PR to take that approach.

@bazuzi bazuzi changed the title Treat escaped newlines as whitespace in Lexer::getRawToken. Skip escaped newlines before checking for whitespace in Lexer::getRawToken. Nov 26, 2024
@cor3ntin
Copy link
Contributor

This looks good. Do you have any tests where this change is observable?

@bazuzi
Copy link
Contributor Author

bazuzi commented Nov 27, 2024

Added a test for getRawToken where the EXPECT_TRUE failed before this change.

@bazuzi
Copy link
Contributor Author

bazuzi commented Dec 4, 2024

Anything else I should add before merging this?

Copy link
Contributor

@cor3ntin cor3ntin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks
(I don't think we need a changelog for this one)

@bazuzi bazuzi merged commit f7e8be7 into llvm:main Dec 5, 2024
8 checks passed
@bazuzi bazuzi deleted the piper_export_cl_699185710 branch December 5, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants