[clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects #143898

yronglin · 2025-06-12T13:56:07Z

This PR introduce a new function peekNextPPToken. It's an extension of isNextPPTokenLParen and can makes look ahead one token in preprocessor without side-effects.

It's also the 1st part of #107168 and it was used to look ahead next token then determine whether current lexing pp directive is one of pp-import or pp-module directive.

At the start of phase 4 an import or module token is treated as starting a directive and are converted to their respective keywords iff:

After skipping horizontal whitespace are
- at the start of a logical line, or
- preceded by an export at the start of the logical line.
Are followed by an identifier pp token (before macro expansion), or
- <, ", or : (but not ::) pp tokens for import, or
- ; for module
  Otherwise the token is treated as an identifier.

Signed-off-by: yronglin <[email protected]>

llvmbot · 2025-06-12T13:56:53Z

@llvm/pr-subscribers-clang

Author: None (yronglin)

Changes

This PR introduce a new function peekNextPPToken. It's an extension of isNextPPTokenLParen and can makes look ahead one token in preprocessor without side-effects.

It's also the 1st part of #107168 and it was used to look ahead next token then determine whether current lexing pp directive is one of pp-import or pp-module directive.

At the start of phase 4 an import or module token is treated as starting a directive and are converted to their respective keywords iff:

After skipping horizontal whitespace are
- at the start of a logical line, or
- preceded by an export at the start of the logical line.
Are followed by an identifier pp token (before macro expansion), or
- <, ", or : (but not ::) pp tokens for import, or
- ; for module
  Otherwise the token is treated as an identifier.

Full diff: https://github.com/llvm/llvm-project/pull/143898.diff

6 Files Affected:

(modified) clang/include/clang/Lex/Lexer.h (+5-5)
(modified) clang/include/clang/Lex/Preprocessor.h (+7-1)
(modified) clang/include/clang/Lex/TokenLexer.h (+3-4)
(modified) clang/lib/Lex/Lexer.cpp (+11-10)
(modified) clang/lib/Lex/PPMacroExpansion.cpp (+15-17)
(modified) clang/lib/Lex/TokenLexer.cpp (+5-5)

diff --git a/clang/include/clang/Lex/Lexer.h b/clang/include/clang/Lex/Lexer.h
index bb65ae010cffa..a595cda1eaa77 100644
--- a/clang/include/clang/Lex/Lexer.h
+++ b/clang/include/clang/Lex/Lexer.h
@@ -124,7 +124,7 @@ class Lexer : public PreprocessorLexer {
   //===--------------------------------------------------------------------===//
   // Context that changes as the file is lexed.
   // NOTE: any state that mutates when in raw mode must have save/restore code
-  // in Lexer::isNextPPTokenLParen.
+  // in Lexer::peekNextPPToken.
 
   // BufferPtr - Current pointer into the buffer.  This is the next character
   // to be lexed.
@@ -642,10 +642,10 @@ class Lexer : public PreprocessorLexer {
     BufferPtr = TokEnd;
   }
 
-  /// isNextPPTokenLParen - Return 1 if the next unexpanded token will return a
-  /// tok::l_paren token, 0 if it is something else and 2 if there are no more
-  /// tokens in the buffer controlled by this lexer.
-  unsigned isNextPPTokenLParen();
+  /// peekNextPPToken - Return std::nullopt if there are no more tokens in the
+  /// buffer controlled by this lexer, otherwise return the next unexpanded
+  /// token.
+  std::optional<Token> peekNextPPToken();
 
   //===--------------------------------------------------------------------===//
   // Lexer character reading interfaces.
diff --git a/clang/include/clang/Lex/Preprocessor.h b/clang/include/clang/Lex/Preprocessor.h
index 78be2bd64d61c..4fe7a79393afc 100644
--- a/clang/include/clang/Lex/Preprocessor.h
+++ b/clang/include/clang/Lex/Preprocessor.h
@@ -2288,7 +2288,9 @@ class Preprocessor {
   /// Determine whether the next preprocessor token to be
   /// lexed is a '('.  If so, consume the token and return true, if not, this
   /// method should have no observable side-effect on the lexed tokens.
-  bool isNextPPTokenLParen();
+  bool isNextPPTokenLParen() {
+    return peekNextPPToken().value_or(Token{}).is(tok::l_paren);
+  }
 
 private:
   /// Identifiers used for SEH handling in Borland. These are only
@@ -2667,6 +2669,10 @@ class Preprocessor {
 
   void removeCachedMacroExpandedTokensOfLastLexer();
 
+  /// Peek the next token. If so, return the token, if not, this
+  /// method should have no observable side-effect on the lexed tokens.
+  std::optional<Token> peekNextPPToken();
+
   /// After reading "MACRO(", this method is invoked to read all of the formal
   /// arguments specified for the macro invocation.  Returns null on error.
   MacroArgs *ReadMacroCallArgumentList(Token &MacroName, MacroInfo *MI,
diff --git a/clang/include/clang/Lex/TokenLexer.h b/clang/include/clang/Lex/TokenLexer.h
index 4d229ae610674..777b4e6266c71 100644
--- a/clang/include/clang/Lex/TokenLexer.h
+++ b/clang/include/clang/Lex/TokenLexer.h
@@ -139,10 +139,9 @@ class TokenLexer {
   void Init(const Token *TokArray, unsigned NumToks, bool DisableMacroExpansion,
             bool OwnsTokens, bool IsReinject);
 
-  /// If the next token lexed will pop this macro off the
-  /// expansion stack, return 2.  If the next unexpanded token is a '(', return
-  /// 1, otherwise return 0.
-  unsigned isNextTokenLParen() const;
+  /// If the next token lexed will pop this macro off the expansion stack,
+  /// return std::nullopt, otherwise return the next unexpanded token.
+  std::optional<Token> peekNextPPToken() const;
 
   /// Lex and return a token from this macro stream.
   bool Lex(Token &Tok);
diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index 93200458f04b4..8e977eac8c983 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -3200,18 +3200,19 @@ bool Lexer::LexEndOfFile(Token &Result, const char *CurPtr) {
   return PP->HandleEndOfFile(Result, isPragmaLexer());
 }
 
-/// isNextPPTokenLParen - Return 1 if the next unexpanded token lexed from
-/// the specified lexer will return a tok::l_paren token, 0 if it is something
-/// else and 2 if there are no more tokens in the buffer controlled by the
-/// lexer.
-unsigned Lexer::isNextPPTokenLParen() {
+/// peekNextPPToken - Return std::nullopt if there are no more tokens in the
+/// buffer controlled by this lexer, otherwise return the next unexpanded
+/// token.
+std::optional<Token> Lexer::peekNextPPToken() {
   assert(!LexingRawMode && "How can we expand a macro from a skipping buffer?");
 
   if (isDependencyDirectivesLexer()) {
     if (NextDepDirectiveTokenIndex == DepDirectives.front().Tokens.size())
-      return 2;
-    return DepDirectives.front().Tokens[NextDepDirectiveTokenIndex].is(
-        tok::l_paren);
+      return std::nullopt;
+    Token Result;
+    (void)convertDependencyDirectiveToken(
+        DepDirectives.front().Tokens[NextDepDirectiveTokenIndex], Result);
+    return Result;
   }
 
   // Switch to 'skipping' mode.  This will ensure that we can lex a token
@@ -3240,8 +3241,8 @@ unsigned Lexer::isNextPPTokenLParen() {
   LexingRawMode = false;
 
   if (Tok.is(tok::eof))
-    return 2;
-  return Tok.is(tok::l_paren);
+    return std::nullopt;
+  return Tok;
 }
 
 /// Find the end of a version control conflict marker.
diff --git a/clang/lib/Lex/PPMacroExpansion.cpp b/clang/lib/Lex/PPMacroExpansion.cpp
index 37ac1bf07e9c0..585990f60c98a 100644
--- a/clang/lib/Lex/PPMacroExpansion.cpp
+++ b/clang/lib/Lex/PPMacroExpansion.cpp
@@ -418,42 +418,40 @@ static bool isTrivialSingleTokenExpansion(const MacroInfo *MI,
   return !llvm::is_contained(MI->params(), II);
 }
 
-/// isNextPPTokenLParen - Determine whether the next preprocessor token to be
-/// lexed is a '('.  If so, consume the token and return true, if not, this
-/// method should have no observable side-effect on the lexed tokens.
-bool Preprocessor::isNextPPTokenLParen() {
+/// isNextPPTokenLParen - Peek the next token. If so, return the token, if not,
+/// this method should have no observable side-effect on the lexed tokens.
+std::optional<Token> Preprocessor::peekNextPPToken() {
   // Do some quick tests for rejection cases.
-  unsigned Val;
+  std::optional<Token> Val;
   if (CurLexer)
-    Val = CurLexer->isNextPPTokenLParen();
+    Val = CurLexer->peekNextPPToken();
   else
-    Val = CurTokenLexer->isNextTokenLParen();
+    Val = CurTokenLexer->peekNextPPToken();
 
-  if (Val == 2) {
+  if (!Val) {
     // We have run off the end.  If it's a source file we don't
     // examine enclosing ones (C99 5.1.1.2p4).  Otherwise walk up the
     // macro stack.
     if (CurPPLexer)
-      return false;
+      return std::nullopt;
     for (const IncludeStackInfo &Entry : llvm::reverse(IncludeMacroStack)) {
       if (Entry.TheLexer)
-        Val = Entry.TheLexer->isNextPPTokenLParen();
+        Val = Entry.TheLexer->peekNextPPToken();
       else
-        Val = Entry.TheTokenLexer->isNextTokenLParen();
+        Val = Entry.TheTokenLexer->peekNextPPToken();
 
-      if (Val != 2)
+      if (Val)
         break;
 
       // Ran off the end of a source file?
       if (Entry.ThePPLexer)
-        return false;
+        return std::nullopt;
     }
   }
 
-  // Okay, if we know that the token is a '(', lex it and return.  Otherwise we
-  // have found something that isn't a '(' or we found the end of the
-  // translation unit.  In either case, return false.
-  return Val == 1;
+  // Okay, we found the token and return.  Otherwise we found the end of the
+  // translation unit.
+  return Val;
 }
 
 /// HandleMacroExpandedIdentifier - If an identifier token is read that is to be
diff --git a/clang/lib/Lex/TokenLexer.cpp b/clang/lib/Lex/TokenLexer.cpp
index 6e93416e01c0c..fbb8c4262d6da 100644
--- a/clang/lib/Lex/TokenLexer.cpp
+++ b/clang/lib/Lex/TokenLexer.cpp
@@ -921,13 +921,13 @@ bool TokenLexer::pasteTokens(Token &LHSTok, ArrayRef<Token> TokenStream,
 }
 
 /// isNextTokenLParen - If the next token lexed will pop this macro off the
-/// expansion stack, return 2.  If the next unexpanded token is a '(', return
-/// 1, otherwise return 0.
-unsigned TokenLexer::isNextTokenLParen() const {
+/// expansion stack, return std::nullopt, otherwise return the next unexpanded
+/// token.
+std::optional<Token> TokenLexer::peekNextPPToken() const {
   // Out of tokens?
   if (isAtEnd())
-    return 2;
-  return Tokens[CurTokenIdx].is(tok::l_paren);
+    return std::nullopt;
+  return Tokens[CurTokenIdx];
 }
 
 /// isParsingPreprocessorDirective - Return true if we are in the middle of a

cor3ntin

I'd like @AaronBallman and @Bigcheese to look at this but it looks reasonable

cor3ntin · 2025-06-12T15:43:01Z

clang/include/clang/Lex/Preprocessor.h

@@ -2288,7 +2288,9 @@ class Preprocessor {
  /// Determine whether the next preprocessor token to be
  /// lexed is a '('.  If so, consume the token and return true, if not, this
  /// method should have no observable side-effect on the lexed tokens.
-  bool isNextPPTokenLParen();
+  bool isNextPPTokenLParen() {
+    return peekNextPPToken().value_or(Token{}).is(tok::l_paren);


I don't like that we are creating a new token everytime - I'm not sure that it matters though

The lexer is almost always on the hot path but Tokens are small... I think it may be worth verifying that this has no performance difference from an approach without value_or.

Can we use an if-stmt to avoid create a dummy Token{}. WDYT? (I'm interested in this performance overhead, give me some time to test it).

That's what I was thinking of as an alternative, but Erich's suggestion also seems like a reasonable idea worth exploring. It's a bummer to go back to foo == tok::blah || foo == tok::yada instead of using IsOneOf(), but if that gets too frustrating, we could add a helper function like isTokenKindOneOf() that takes a pack of token kinds.

An additional alternative, is to just have

template<TokenKind TK1, TokenKind...TKs> isNextPPTokenOneOf();

Then just ask the question we REALLY care about.

So uses of THIS function become:
isNextPPTokenOneOf<tok::l_paren>()

Cool! This looks good to me, let me give a try.

erichkeane

I find myself wondering based on the patch: DO we actually care about the token itself, or just the kind? Could we instead have peekNextPPToken instead be peekNextPPTokenKind and just return the next token kind?

That way we don't need the optional, and can just do a direct comparison. We could also just return the eof as well rather than treating that as a nullopt.

WDYT?

yronglin · 2025-06-12T16:26:57Z

Thanks a lot for the review!

I think it may be worth verifying that this has no performance difference from an approach without value_or.

Can we use an if-stmt to avoid create a dummy Token{}. WDYT? (I'm interested in this performance overhead, give me some time to test it).

DO we actually care about the token itself, or just the kind?
Could we instead have peekNextPPToken instead be peekNextPPTokenKind and just return the next token kind?

Yes, currently we just care about token kind. The reason for using Token instead of TokenKind is that when I implement P1857R3, I need utils like Tok.is, Tok.isOneOf in Token to avoid code like Kind == tok::AAA || Kind == tok::BBB. But I have no preference for this. Below is an example from P1857R3 implementation draft:

  auto NextTok = peekNextPPToken().value_or(Token{});
  CurPPLexer->ParsingPreprocessorDirective = SavedParsingPreprocessorDirective;
  if (Result.getIdentifierInfo()->isModulesImport() &&
      NextTok.isOneOf(tok::raw_identifier, tok::less, tok::string_literal,
                      tok::colon)) {
    Result.setKind(tok::kw_import);
    ModuleImportLoc = Result.getLocation();
    IsAtImport = false;
    return true;
  }
  if (Result.getIdentifierInfo()->isModulesDeclaration() &&
      NextTok.isOneOf(tok::raw_identifier, tok::colon, tok::semi)) {
    Result.setKind(tok::kw_module);
    ModuleDeclLoc = Result.getLocation();
    return true;
  }

  // Ok, it's an identifier.
  return false;

That way we don't need the optional, and can just do a direct comparison. We could also just return the eof as well rather than treating that as a nullopt.

The original isNextPPTokenLParen return 1 if the next unexpanded token will return a tok::l_paren token, 0 if it is something else and 2 if there are no more tokens in the buffer controlled by this lexer. so the return value std::nullopt and tok::eof have two different meanings. This function was implemented a long time ago (early last year), I remember that I had ever try to avoid std::optional, but hit some test failures. I'd like to try again to remove std::optional.

Signed-off-by: yronglin <[email protected]>

yronglin · 2025-06-14T06:33:26Z

Do we also need to update peekNextToken in Lexer/TokenLexer?

so the return value std::nullopt and tok::eof have two different meanings.

We depent std::nullopt to walk up in macro stack in isNextTokenOneOf.

yronglin · 2025-06-19T14:09:03Z

friendly ping～

AaronBallman

LGTM!

yronglin · 2025-06-23T16:55:38Z

Many thanks for your review!

[Clang] Add peekNextPPToken, makes peek next token without side-effects

373daed

Signed-off-by: yronglin <[email protected]>

yronglin requested review from Bigcheese, mizvekov, cor3ntin, AaronBallman, shafik, erichkeane and ChuanqiXu9 June 12, 2025 13:56

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Jun 12, 2025

yronglin changed the title ~~[clang][Preprocessor] Add peekNextPPToken, makes peek next token without side-effects~~ [clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects Jun 12, 2025

yronglin self-assigned this Jun 12, 2025

cor3ntin reviewed Jun 12, 2025

View reviewed changes

erichkeane reviewed Jun 12, 2025

View reviewed changes

yronglin mentioned this pull request Jun 12, 2025

[C++20][Modules] Implement P1857R3 Modules Dependency Discovery #107168

Open

3 tasks

yronglin added 2 commits June 14, 2025 14:29

Update peekNextToken to isNextPPTokenOneOf

c963c3a

Signed-off-by: yronglin <[email protected]>

Merge branch 'main' into peek_next_token

a1f62cf

Merge branch 'main' into peek_next_token

9bdb305

yronglin mentioned this pull request Jun 22, 2025

[clang][Preprocessor] Handle the first pp-token in EnterMainSourceFile #145244

Open

AaronBallman approved these changes Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects #143898

[clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects #143898

Uh oh!

yronglin commented Jun 12, 2025

Uh oh!

llvmbot commented Jun 12, 2025

Uh oh!

cor3ntin left a comment

Uh oh!

cor3ntin Jun 12, 2025

Uh oh!

AaronBallman Jun 12, 2025

Uh oh!

yronglin Jun 12, 2025

Uh oh!

AaronBallman Jun 12, 2025

Uh oh!

erichkeane Jun 12, 2025

Uh oh!

yronglin Jun 12, 2025

Uh oh!

erichkeane left a comment

Uh oh!

yronglin commented Jun 12, 2025 •

edited

Loading

Uh oh!

yronglin commented Jun 14, 2025

Uh oh!

yronglin commented Jun 19, 2025

Uh oh!

AaronBallman left a comment

Uh oh!

yronglin commented Jun 23, 2025

Uh oh!

Uh oh!

[clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects #143898

Are you sure you want to change the base?

[clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects #143898

Uh oh!

Conversation

yronglin commented Jun 12, 2025

Uh oh!

llvmbot commented Jun 12, 2025

Uh oh!

cor3ntin left a comment

Choose a reason for hiding this comment

Uh oh!

cor3ntin Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

AaronBallman Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

yronglin Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

AaronBallman Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

erichkeane Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

yronglin Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

erichkeane left a comment

Choose a reason for hiding this comment

Uh oh!

yronglin commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yronglin commented Jun 14, 2025

Uh oh!

yronglin commented Jun 19, 2025

Uh oh!

AaronBallman left a comment

Choose a reason for hiding this comment

Uh oh!

yronglin commented Jun 23, 2025

Uh oh!

Uh oh!

yronglin commented Jun 12, 2025 •

edited

Loading