-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Skip escaped newlines before checking for whitespace in Lexer::getRawToken. #117548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct. If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at `Loc`, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true. The modified test was printing `\\nF` for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of `F`.
@llvm/pr-subscribers-clang Author: Samira Bazuzi (bazuzi) ChangesThe Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct. If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at The modified test was printing Full diff: https://github.com/llvm/llvm-project/pull/117548.diff 2 Files Affected:
diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index e58c8bc72ae5b3..392cce6be0d171 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -527,7 +527,9 @@ bool Lexer::getRawToken(SourceLocation Loc, Token &Result,
const char *StrData = Buffer.data()+LocInfo.second;
- if (!IgnoreWhiteSpace && isWhitespace(StrData[0]))
+ if (!IgnoreWhiteSpace && (isWhitespace(StrData[0]) ||
+ // Treat escaped newlines as whitespace.
+ SkipEscapedNewLines(StrData) != StrData))
return true;
// Create a lexer starting at the beginning of this token.
diff --git a/clang/test/Frontend/highlight-text.c b/clang/test/Frontend/highlight-text.c
index a81d26caa4c24c..eefa4ebeec8ca4 100644
--- a/clang/test/Frontend/highlight-text.c
+++ b/clang/test/Frontend/highlight-text.c
@@ -12,8 +12,7 @@ int a = M;
// CHECK-NEXT: :5:11: note: expanded from macro 'M'
// CHECK-NEXT: 5 | #define M \
// CHECK-NEXT: | ^
-// CHECK-NEXT: :3:14: note: expanded from macro '\
-// CHECK-NEXT: F'
+// CHECK-NEXT: :3:14: note: expanded from here
// CHECK-NEXT: 3 | #define F (1 << 99)
// CHECK-NEXT: | ^ ~~
// CHECK-NEXT: :8:9: warning: shift count >= width of type [-Wshift-count-overflow]
|
line splicing does not introduce spaces (https://compiler-explorer.com/z/ohaq6Wzv7).
That sounds like a more promising approach. Is that something you would be willing to explore? |
Certainly. Some initial testing reveals that getPreviousTokenAndStart is not in conflict with this approach as I had feared, because Lexer::GetBeginningOfToken will include an escaped newline immediately followed by non-whitespace as being part of the token. I see no failing existing tests with just the new change to getRawToken, so I will update this PR to take that approach. |
This looks good. Do you have any tests where this change is observable? |
Added a test for getRawToken where the EXPECT_TRUE failed before this change. |
Anything else I should add before merging this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
(I don't think we need a changelog for this one)
The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.
If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at
Loc
, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.The modified test was printing
\\nF
for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location ofF
.Edit: No longer need to modify the test and we are solving the issue of whitespace being erroneously ignored by the Lexer instance by checking for whitespace immediately following any escaped newline. getPreviousTokenAndStart was more robust than expected, because Lexer::GetBeginningOfToken returns the location of any immediately preceding escaped newlines.