Skip to content

[Clang][Preprocessor] Expand UCNs in macro concatenation #145351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 24, 2025

Conversation

yronglin
Copy link
Contributor

@yronglin yronglin commented Jun 23, 2025

Fixes #145240.

The UCN in preprocessor pasted identifier not resolved to unicode, it may cause the following issue:

#define CAT(a,b) a##b

char foo\u00b5;
char*p = &CAT(foo, \u00b5); // error: use of undeclared identifier 'foo\u00b5'

The real identifier after paste is fooµ. This PR fix this issue in TokenLexer::pasteTokens, if there has any UCN in pasting tokens, the final pasted token should have a Token::HasUCN flag. Then Preprocessor::LookUpIdentifierInfo will expand UCNs in this token.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Jun 23, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 23, 2025

@llvm/pr-subscribers-clang

Author: None (yronglin)

Changes

Fixs #145240.


Full diff: https://github.com/llvm/llvm-project/pull/145351.diff

3 Files Affected:

  • (modified) clang/docs/ReleaseNotes.rst (+1)
  • (modified) clang/lib/Lex/TokenLexer.cpp (+11)
  • (added) clang/test/Preprocessor/macro_paste_identifier_ucn.c (+10)
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 96477ef6ddc9a..af107a2d51062 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -720,6 +720,7 @@ Bug Fixes in This Version
 - Fixed incorrect token location when emitting diagnostics for tokens expanded from macros. (#GH143216)
 - Fixed an infinite recursion when checking constexpr destructors. (#GH141789)
 - Fixed a crash when a malformed using declaration appears in a ``constexpr`` function. (#GH144264)
+- Fixed a bug when use unicode character name in macro concatenation. (#GH145240) 
 
 Bug Fixes to Compiler Builtins
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/clang/lib/Lex/TokenLexer.cpp b/clang/lib/Lex/TokenLexer.cpp
index 6e93416e01c0c..72f1ffa7ed06e 100644
--- a/clang/lib/Lex/TokenLexer.cpp
+++ b/clang/lib/Lex/TokenLexer.cpp
@@ -748,6 +748,7 @@ bool TokenLexer::pasteTokens(Token &LHSTok, ArrayRef<Token> TokenStream,
   const char *ResultTokStrPtr = nullptr;
   SourceLocation StartLoc = LHSTok.getLocation();
   SourceLocation PasteOpLoc;
+  bool HasUCNs = false;
 
   auto IsAtEnd = [&TokenStream, &CurIdx] {
     return TokenStream.size() == CurIdx;
@@ -885,6 +886,9 @@ bool TokenLexer::pasteTokens(Token &LHSTok, ArrayRef<Token> TokenStream,
 
     // Finally, replace LHS with the result, consume the RHS, and iterate.
     ++CurIdx;
+
+    // Set Token::HasUCN flag if LHS or RHS contains any UCNs.
+    HasUCNs = LHSTok.hasUCN() || RHS.hasUCN() || HasUCNs;
     LHSTok = Result;
   } while (!IsAtEnd() && TokenStream[CurIdx].is(tok::hashhash));
 
@@ -913,6 +917,13 @@ bool TokenLexer::pasteTokens(Token &LHSTok, ArrayRef<Token> TokenStream,
   // token pasting re-lexes the result token in raw mode, identifier information
   // isn't looked up.  As such, if the result is an identifier, look up id info.
   if (LHSTok.is(tok::raw_identifier)) {
+
+    // If there has any UNCs in concated token, we should mark this token
+    // with Token::HasUCN flag, then LookUpIdentifierInfo will expand UCNs in
+    // token.
+    if (HasUCNs)
+      LHSTok.setFlag(Token::HasUCN);
+
     // Look up the identifier info for the token.  We disabled identifier lookup
     // by saying we're skipping contents, so we need to do this manually.
     PP.LookUpIdentifierInfo(LHSTok);
diff --git a/clang/test/Preprocessor/macro_paste_identifier_ucn.c b/clang/test/Preprocessor/macro_paste_identifier_ucn.c
new file mode 100644
index 0000000000000..c9eb8190edfe8
--- /dev/null
+++ b/clang/test/Preprocessor/macro_paste_identifier_ucn.c
@@ -0,0 +1,10 @@
+// RUN: %clang_cc1 -fms-extensions %s -verify
+// RUN: %clang_cc1 -E -fms-extensions %s | FileCheck %s
+// expected-no-diagnostics
+
+#define CAT(a,b) a##b
+
+char foo\u00b5;
+char*p = &CAT(foo, \u00b5);
+// CHECK: char fooµ;
+// CHECK-NEXT: char*p = &fooµ;

@yronglin yronglin requested a review from shafik June 23, 2025 16:54
Copy link
Collaborator

@shafik shafik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summaries should not just be a link to an issue, it should at minimum briefly explain the problem and the fix. For a simple PR a reviewer should be able to digest the PR w/o leaving the review to look for more information.

@yronglin
Copy link
Contributor Author

Thanks for your review!

Summaries should not just be a link to an issue, it should at minimum briefly explain the problem and the fix. For a simple PR a reviewer should be able to digest the PR w/o leaving the review to look for more information.

Sorry for that, I'll update the summary.

Copy link
Contributor

@cor3ntin cor3ntin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yronglin
Copy link
Contributor Author

Thanks for the review!

@yronglin yronglin merged commit 8b0d112 into llvm:main Jun 24, 2025
11 checks passed
DrSergei pushed a commit to DrSergei/llvm-project that referenced this pull request Jun 24, 2025
Fixs llvm#145240.

The UCN in preprocessor pasted identifier not resolved to unicode, it
may cause the following issue:
```c
#define CAT(a,b) a##b

char foo\u00b5;
char*p = &CAT(foo, \u00b5); // error: use of undeclared identifier 'foo\u00b5'
```
The real identifier after paste is `fooµ`. This PR fix this issue in
`TokenLexer::pasteTokens`, if there has any UCN in pasting tokens, the
final pasted token should have a Token::HasUCN flag. Then
`Preprocessor::LookUpIdentifierInfo` will expand UCNs in this token.

Signed-off-by: yronglin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Clang] UCN in preprocessor-pasted identifier not resolved to unicode
4 participants