Skip to content

[clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects #143898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

yronglin
Copy link
Contributor

This PR introduce a new function peekNextPPToken. It's an extension of isNextPPTokenLParen and can makes look ahead one token in preprocessor without side-effects.

It's also the 1st part of #107168 and it was used to look ahead next token then determine whether current lexing pp directive is one of pp-import or pp-module directive.

At the start of phase 4 an import or module token is treated as starting a directive and are converted to their respective keywords iff:

  • After skipping horizontal whitespace are
    • at the start of a logical line, or
    • preceded by an export at the start of the logical line.
  • Are followed by an identifier pp token (before macro expansion), or
    • <, ", or : (but not ::) pp tokens for import, or
    • ; for module
      Otherwise the token is treated as an identifier.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Jun 12, 2025
@yronglin yronglin changed the title [clang][Preprocessor] Add peekNextPPToken, makes peek next token without side-effects [clang][Preprocessor] Add peekNextPPToken, makes look ahead next token without side-effects Jun 12, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 12, 2025

@llvm/pr-subscribers-clang

Author: None (yronglin)

Changes

This PR introduce a new function peekNextPPToken. It's an extension of isNextPPTokenLParen and can makes look ahead one token in preprocessor without side-effects.

It's also the 1st part of #107168 and it was used to look ahead next token then determine whether current lexing pp directive is one of pp-import or pp-module directive.

At the start of phase 4 an import or module token is treated as starting a directive and are converted to their respective keywords iff:

  • After skipping horizontal whitespace are
    • at the start of a logical line, or
    • preceded by an export at the start of the logical line.
  • Are followed by an identifier pp token (before macro expansion), or
    • <, ", or : (but not ::) pp tokens for import, or
    • ; for module
      Otherwise the token is treated as an identifier.

Full diff: https://github.com/llvm/llvm-project/pull/143898.diff

6 Files Affected:

  • (modified) clang/include/clang/Lex/Lexer.h (+5-5)
  • (modified) clang/include/clang/Lex/Preprocessor.h (+7-1)
  • (modified) clang/include/clang/Lex/TokenLexer.h (+3-4)
  • (modified) clang/lib/Lex/Lexer.cpp (+11-10)
  • (modified) clang/lib/Lex/PPMacroExpansion.cpp (+15-17)
  • (modified) clang/lib/Lex/TokenLexer.cpp (+5-5)
diff --git a/clang/include/clang/Lex/Lexer.h b/clang/include/clang/Lex/Lexer.h
index bb65ae010cffa..a595cda1eaa77 100644
--- a/clang/include/clang/Lex/Lexer.h
+++ b/clang/include/clang/Lex/Lexer.h
@@ -124,7 +124,7 @@ class Lexer : public PreprocessorLexer {
   //===--------------------------------------------------------------------===//
   // Context that changes as the file is lexed.
   // NOTE: any state that mutates when in raw mode must have save/restore code
-  // in Lexer::isNextPPTokenLParen.
+  // in Lexer::peekNextPPToken.
 
   // BufferPtr - Current pointer into the buffer.  This is the next character
   // to be lexed.
@@ -642,10 +642,10 @@ class Lexer : public PreprocessorLexer {
     BufferPtr = TokEnd;
   }
 
-  /// isNextPPTokenLParen - Return 1 if the next unexpanded token will return a
-  /// tok::l_paren token, 0 if it is something else and 2 if there are no more
-  /// tokens in the buffer controlled by this lexer.
-  unsigned isNextPPTokenLParen();
+  /// peekNextPPToken - Return std::nullopt if there are no more tokens in the
+  /// buffer controlled by this lexer, otherwise return the next unexpanded
+  /// token.
+  std::optional<Token> peekNextPPToken();
 
   //===--------------------------------------------------------------------===//
   // Lexer character reading interfaces.
diff --git a/clang/include/clang/Lex/Preprocessor.h b/clang/include/clang/Lex/Preprocessor.h
index 78be2bd64d61c..4fe7a79393afc 100644
--- a/clang/include/clang/Lex/Preprocessor.h
+++ b/clang/include/clang/Lex/Preprocessor.h
@@ -2288,7 +2288,9 @@ class Preprocessor {
   /// Determine whether the next preprocessor token to be
   /// lexed is a '('.  If so, consume the token and return true, if not, this
   /// method should have no observable side-effect on the lexed tokens.
-  bool isNextPPTokenLParen();
+  bool isNextPPTokenLParen() {
+    return peekNextPPToken().value_or(Token{}).is(tok::l_paren);
+  }
 
 private:
   /// Identifiers used for SEH handling in Borland. These are only
@@ -2667,6 +2669,10 @@ class Preprocessor {
 
   void removeCachedMacroExpandedTokensOfLastLexer();
 
+  /// Peek the next token. If so, return the token, if not, this
+  /// method should have no observable side-effect on the lexed tokens.
+  std::optional<Token> peekNextPPToken();
+
   /// After reading "MACRO(", this method is invoked to read all of the formal
   /// arguments specified for the macro invocation.  Returns null on error.
   MacroArgs *ReadMacroCallArgumentList(Token &MacroName, MacroInfo *MI,
diff --git a/clang/include/clang/Lex/TokenLexer.h b/clang/include/clang/Lex/TokenLexer.h
index 4d229ae610674..777b4e6266c71 100644
--- a/clang/include/clang/Lex/TokenLexer.h
+++ b/clang/include/clang/Lex/TokenLexer.h
@@ -139,10 +139,9 @@ class TokenLexer {
   void Init(const Token *TokArray, unsigned NumToks, bool DisableMacroExpansion,
             bool OwnsTokens, bool IsReinject);
 
-  /// If the next token lexed will pop this macro off the
-  /// expansion stack, return 2.  If the next unexpanded token is a '(', return
-  /// 1, otherwise return 0.
-  unsigned isNextTokenLParen() const;
+  /// If the next token lexed will pop this macro off the expansion stack,
+  /// return std::nullopt, otherwise return the next unexpanded token.
+  std::optional<Token> peekNextPPToken() const;
 
   /// Lex and return a token from this macro stream.
   bool Lex(Token &Tok);
diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index 93200458f04b4..8e977eac8c983 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -3200,18 +3200,19 @@ bool Lexer::LexEndOfFile(Token &Result, const char *CurPtr) {
   return PP->HandleEndOfFile(Result, isPragmaLexer());
 }
 
-/// isNextPPTokenLParen - Return 1 if the next unexpanded token lexed from
-/// the specified lexer will return a tok::l_paren token, 0 if it is something
-/// else and 2 if there are no more tokens in the buffer controlled by the
-/// lexer.
-unsigned Lexer::isNextPPTokenLParen() {
+/// peekNextPPToken - Return std::nullopt if there are no more tokens in the
+/// buffer controlled by this lexer, otherwise return the next unexpanded
+/// token.
+std::optional<Token> Lexer::peekNextPPToken() {
   assert(!LexingRawMode && "How can we expand a macro from a skipping buffer?");
 
   if (isDependencyDirectivesLexer()) {
     if (NextDepDirectiveTokenIndex == DepDirectives.front().Tokens.size())
-      return 2;
-    return DepDirectives.front().Tokens[NextDepDirectiveTokenIndex].is(
-        tok::l_paren);
+      return std::nullopt;
+    Token Result;
+    (void)convertDependencyDirectiveToken(
+        DepDirectives.front().Tokens[NextDepDirectiveTokenIndex], Result);
+    return Result;
   }
 
   // Switch to 'skipping' mode.  This will ensure that we can lex a token
@@ -3240,8 +3241,8 @@ unsigned Lexer::isNextPPTokenLParen() {
   LexingRawMode = false;
 
   if (Tok.is(tok::eof))
-    return 2;
-  return Tok.is(tok::l_paren);
+    return std::nullopt;
+  return Tok;
 }
 
 /// Find the end of a version control conflict marker.
diff --git a/clang/lib/Lex/PPMacroExpansion.cpp b/clang/lib/Lex/PPMacroExpansion.cpp
index 37ac1bf07e9c0..585990f60c98a 100644
--- a/clang/lib/Lex/PPMacroExpansion.cpp
+++ b/clang/lib/Lex/PPMacroExpansion.cpp
@@ -418,42 +418,40 @@ static bool isTrivialSingleTokenExpansion(const MacroInfo *MI,
   return !llvm::is_contained(MI->params(), II);
 }
 
-/// isNextPPTokenLParen - Determine whether the next preprocessor token to be
-/// lexed is a '('.  If so, consume the token and return true, if not, this
-/// method should have no observable side-effect on the lexed tokens.
-bool Preprocessor::isNextPPTokenLParen() {
+/// isNextPPTokenLParen - Peek the next token. If so, return the token, if not,
+/// this method should have no observable side-effect on the lexed tokens.
+std::optional<Token> Preprocessor::peekNextPPToken() {
   // Do some quick tests for rejection cases.
-  unsigned Val;
+  std::optional<Token> Val;
   if (CurLexer)
-    Val = CurLexer->isNextPPTokenLParen();
+    Val = CurLexer->peekNextPPToken();
   else
-    Val = CurTokenLexer->isNextTokenLParen();
+    Val = CurTokenLexer->peekNextPPToken();
 
-  if (Val == 2) {
+  if (!Val) {
     // We have run off the end.  If it's a source file we don't
     // examine enclosing ones (C99 5.1.1.2p4).  Otherwise walk up the
     // macro stack.
     if (CurPPLexer)
-      return false;
+      return std::nullopt;
     for (const IncludeStackInfo &Entry : llvm::reverse(IncludeMacroStack)) {
       if (Entry.TheLexer)
-        Val = Entry.TheLexer->isNextPPTokenLParen();
+        Val = Entry.TheLexer->peekNextPPToken();
       else
-        Val = Entry.TheTokenLexer->isNextTokenLParen();
+        Val = Entry.TheTokenLexer->peekNextPPToken();
 
-      if (Val != 2)
+      if (Val)
         break;
 
       // Ran off the end of a source file?
       if (Entry.ThePPLexer)
-        return false;
+        return std::nullopt;
     }
   }
 
-  // Okay, if we know that the token is a '(', lex it and return.  Otherwise we
-  // have found something that isn't a '(' or we found the end of the
-  // translation unit.  In either case, return false.
-  return Val == 1;
+  // Okay, we found the token and return.  Otherwise we found the end of the
+  // translation unit.
+  return Val;
 }
 
 /// HandleMacroExpandedIdentifier - If an identifier token is read that is to be
diff --git a/clang/lib/Lex/TokenLexer.cpp b/clang/lib/Lex/TokenLexer.cpp
index 6e93416e01c0c..fbb8c4262d6da 100644
--- a/clang/lib/Lex/TokenLexer.cpp
+++ b/clang/lib/Lex/TokenLexer.cpp
@@ -921,13 +921,13 @@ bool TokenLexer::pasteTokens(Token &LHSTok, ArrayRef<Token> TokenStream,
 }
 
 /// isNextTokenLParen - If the next token lexed will pop this macro off the
-/// expansion stack, return 2.  If the next unexpanded token is a '(', return
-/// 1, otherwise return 0.
-unsigned TokenLexer::isNextTokenLParen() const {
+/// expansion stack, return std::nullopt, otherwise return the next unexpanded
+/// token.
+std::optional<Token> TokenLexer::peekNextPPToken() const {
   // Out of tokens?
   if (isAtEnd())
-    return 2;
-  return Tokens[CurTokenIdx].is(tok::l_paren);
+    return std::nullopt;
+  return Tokens[CurTokenIdx];
 }
 
 /// isParsingPreprocessorDirective - Return true if we are in the middle of a

@yronglin yronglin self-assigned this Jun 12, 2025
Copy link
Contributor

@cor3ntin cor3ntin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like @AaronBallman and @Bigcheese to look at this but it looks reasonable

@@ -2288,7 +2288,9 @@ class Preprocessor {
/// Determine whether the next preprocessor token to be
/// lexed is a '('. If so, consume the token and return true, if not, this
/// method should have no observable side-effect on the lexed tokens.
bool isNextPPTokenLParen();
bool isNextPPTokenLParen() {
return peekNextPPToken().value_or(Token{}).is(tok::l_paren);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we are creating a new token everytime - I'm not sure that it matters though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lexer is almost always on the hot path but Tokens are small... I think it may be worth verifying that this has no performance difference from an approach without value_or.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use an if-stmt to avoid create a dummy Token{}. WDYT? (I'm interested in this performance overhead, give me some time to test it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was thinking of as an alternative, but Erich's suggestion also seems like a reasonable idea worth exploring. It's a bummer to go back to foo == tok::blah || foo == tok::yada instead of using IsOneOf(), but if that gets too frustrating, we could add a helper function like isTokenKindOneOf() that takes a pack of token kinds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional alternative, is to just have

template<TokenKind TK1, TokenKind...TKs>
isNextPPTokenOneOf();

Then just ask the question we REALLY care about.

So uses of THIS function become:
isNextPPTokenOneOf<tok::l_paren>()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! This looks good to me, let me give a try.

Copy link
Collaborator

@erichkeane erichkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find myself wondering based on the patch: DO we actually care about the token itself, or just the kind? Could we instead have peekNextPPToken instead be peekNextPPTokenKind and just return the next token kind?

That way we don't need the optional, and can just do a direct comparison. We could also just return the eof as well rather than treating that as a nullopt.

WDYT?

@yronglin
Copy link
Contributor Author

yronglin commented Jun 12, 2025

Thanks a lot for the review!

I think it may be worth verifying that this has no performance difference from an approach without value_or.

Can we use an if-stmt to avoid create a dummy Token{}. WDYT? (I'm interested in this performance overhead, give me some time to test it).

DO we actually care about the token itself, or just the kind?
Could we instead have peekNextPPToken instead be peekNextPPTokenKind and just return the next token kind?

Yes, currently we just care about token kind. The reason for using Token instead of TokenKind is that when I implement P1857R3, I need utils like Tok.is, Tok.isOneOf in Token to avoid code like Kind == tok::AAA || Kind == tok::BBB. But I have no preference for this. Below is an example from P1857R3 implementation draft:

  auto NextTok = peekNextPPToken().value_or(Token{});
  CurPPLexer->ParsingPreprocessorDirective = SavedParsingPreprocessorDirective;
  if (Result.getIdentifierInfo()->isModulesImport() &&
      NextTok.isOneOf(tok::raw_identifier, tok::less, tok::string_literal,
                      tok::colon)) {
    Result.setKind(tok::kw_import);
    ModuleImportLoc = Result.getLocation();
    IsAtImport = false;
    return true;
  }
  if (Result.getIdentifierInfo()->isModulesDeclaration() &&
      NextTok.isOneOf(tok::raw_identifier, tok::colon, tok::semi)) {
    Result.setKind(tok::kw_module);
    ModuleDeclLoc = Result.getLocation();
    return true;
  }

  // Ok, it's an identifier.
  return false;

That way we don't need the optional, and can just do a direct comparison. We could also just return the eof as well rather than treating that as a nullopt.

The original isNextPPTokenLParen return 1 if the next unexpanded token will return a tok::l_paren token, 0 if it is something else and 2 if there are no more tokens in the buffer controlled by this lexer. so the return value std::nullopt and tok::eof have two different meanings. This function was implemented a long time ago (early last year), I remember that I had ever try to avoid std::optional, but hit some test failures. I'd like to try again to remove std::optional.

@yronglin
Copy link
Contributor Author

Do we also need to update peekNextToken in Lexer/TokenLexer?

so the return value std::nullopt and tok::eof have two different meanings.

We depent std::nullopt to walk up in macro stack in isNextTokenOneOf.

@yronglin
Copy link
Contributor Author

friendly ping~

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yronglin
Copy link
Contributor Author

Many thanks for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants