Skip escaped newlines before checking for whitespace in Lexer::getRawToken. #117548

bazuzi · 2024-11-25T12:54:44Z

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.

The modified test was printing \\nF for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of F.

Edit: No longer need to modify the test and we are solving the issue of whitespace being erroneously ignored by the Lexer instance by checking for whitespace immediately following any escaped newline. getPreviousTokenAndStart was more robust than expected, because Lexer::GetBeginningOfToken returns the location of any immediately preceding escaped newlines.

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct. If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at `Loc`, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true. The modified test was printing `\\nF` for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of `F`.

llvmbot · 2024-11-25T12:55:23Z

@llvm/pr-subscribers-clang

Author: Samira Bazuzi (bazuzi)

Changes

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.

The modified test was printing \\nF for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of F.

Full diff: https://github.com/llvm/llvm-project/pull/117548.diff

2 Files Affected:

(modified) clang/lib/Lex/Lexer.cpp (+3-1)
(modified) clang/test/Frontend/highlight-text.c (+1-2)

diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index e58c8bc72ae5b3..392cce6be0d171 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -527,7 +527,9 @@ bool Lexer::getRawToken(SourceLocation Loc, Token &Result,
 
   const char *StrData = Buffer.data()+LocInfo.second;
 
-  if (!IgnoreWhiteSpace && isWhitespace(StrData[0]))
+  if (!IgnoreWhiteSpace && (isWhitespace(StrData[0]) ||
+                            // Treat escaped newlines as whitespace.
+                            SkipEscapedNewLines(StrData) != StrData))
     return true;
 
   // Create a lexer starting at the beginning of this token.
diff --git a/clang/test/Frontend/highlight-text.c b/clang/test/Frontend/highlight-text.c
index a81d26caa4c24c..eefa4ebeec8ca4 100644
--- a/clang/test/Frontend/highlight-text.c
+++ b/clang/test/Frontend/highlight-text.c
@@ -12,8 +12,7 @@ int a = M;
 // CHECK-NEXT: :5:11: note: expanded from macro 'M'
 // CHECK-NEXT:     5 | #define M \
 // CHECK-NEXT:       |           ^
-// CHECK-NEXT: :3:14: note: expanded from macro '\
-// CHECK-NEXT: F'
+// CHECK-NEXT: :3:14: note: expanded from here
 // CHECK-NEXT:     3 | #define F (1 << 99)
 // CHECK-NEXT:       |              ^  ~~
 // CHECK-NEXT: :8:9: warning: shift count >= width of type [-Wshift-count-overflow]

cor3ntin · 2024-11-26T08:50:19Z

line splicing does not introduce spaces (https://compiler-explorer.com/z/ohaq6Wzv7).

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc

That sounds like a more promising approach. Is that something you would be willing to explore?

bazuzi · 2024-11-26T14:54:24Z

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at Loc

That sounds like a more promising approach. Is that something you would be willing to explore?

Certainly. Some initial testing reveals that getPreviousTokenAndStart is not in conflict with this approach as I had feared, because Lexer::GetBeginningOfToken will include an escaped newline immediately followed by non-whitespace as being part of the token.

I see no failing existing tests with just the new change to getRawToken, so I will update this PR to take that approach.

cor3ntin · 2024-11-27T05:42:01Z

This looks good. Do you have any tests where this change is observable?

bazuzi · 2024-11-27T15:09:56Z

Added a test for getRawToken where the EXPECT_TRUE failed before this change.

bazuzi · 2024-12-04T20:19:52Z

Anything else I should add before merging this?

cor3ntin

LGTM, thanks
(I don't think we need a changelog for this one)

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Nov 25, 2024

bazuzi requested review from ilya-biryukov, kadircet, cor3ntin, alexfh and AaronBallman November 25, 2024 12:55

Switch to checking for whitespace after escaped newlines.

7dec0bb

bazuzi changed the title ~~Treat escaped newlines as whitespace in Lexer::getRawToken.~~ Skip escaped newlines before checking for whitespace in Lexer::getRawToken. Nov 26, 2024

Undo accidental formatting of entire file.

7437706

Add test for getRawToken.

da1df1c

ilya-biryukov removed request for ilya-biryukov and kadircet November 28, 2024 14:25

cor3ntin approved these changes Dec 5, 2024

View reviewed changes

bazuzi merged commit f7e8be7 into llvm:main Dec 5, 2024
8 checks passed

bazuzi deleted the piper_export_cl_699185710 branch December 5, 2024 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip escaped newlines before checking for whitespace in Lexer::getRawToken. #117548

Skip escaped newlines before checking for whitespace in Lexer::getRawToken. #117548

Uh oh!

bazuzi commented Nov 25, 2024 •

edited

Loading

Uh oh!

llvmbot commented Nov 25, 2024

Uh oh!

cor3ntin commented Nov 26, 2024

Uh oh!

bazuzi commented Nov 26, 2024

Uh oh!

cor3ntin commented Nov 27, 2024

Uh oh!

bazuzi commented Nov 27, 2024

Uh oh!

bazuzi commented Dec 4, 2024

Uh oh!

cor3ntin left a comment

Uh oh!

Uh oh!

Uh oh!

Skip escaped newlines before checking for whitespace in Lexer::getRawToken. #117548

Skip escaped newlines before checking for whitespace in Lexer::getRawToken. #117548

Uh oh!

Conversation

bazuzi commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 25, 2024

Uh oh!

cor3ntin commented Nov 26, 2024

Uh oh!

bazuzi commented Nov 26, 2024

Uh oh!

cor3ntin commented Nov 27, 2024

Uh oh!

bazuzi commented Nov 27, 2024

Uh oh!

bazuzi commented Dec 4, 2024

Uh oh!

cor3ntin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bazuzi commented Nov 25, 2024 •

edited

Loading