[clang-format] Support of TableGen identifiers beginning with a number. #78571

hnakamura5 · 2024-01-18T12:57:30Z

TableGen allows the identifiers beginning with a number.
This patch add the support of the recognition of such identifiers.

llvmbot · 2024-01-18T12:58:05Z

@llvm/pr-subscribers-clang-format

Author: Hirofumi Nakamura (hnakamura5)

Changes

TableGen allows the identifiers beginning with a number.
This patch add the support of the recognition of such identifiers.

Full diff: https://github.com/llvm/llvm-project/pull/78571.diff

3 Files Affected:

(modified) clang/lib/Format/FormatTokenLexer.cpp (+43-1)
(modified) clang/lib/Format/FormatTokenLexer.h (+4)
(modified) clang/unittests/Format/TokenAnnotatorTest.cpp (+21)

diff --git a/clang/lib/Format/FormatTokenLexer.cpp b/clang/lib/Format/FormatTokenLexer.cpp
index 25ac9be57c81a9..f1982533f112c7 100644
--- a/clang/lib/Format/FormatTokenLexer.cpp
+++ b/clang/lib/Format/FormatTokenLexer.cpp
@@ -93,8 +93,10 @@ ArrayRef<FormatToken *> FormatTokenLexer::lex() {
       // string literals are correctly identified.
       handleCSharpVerbatimAndInterpolatedStrings();
     }
-    if (Style.isTableGen())
+    if (Style.isTableGen()) {
       handleTableGenMultilineString();
+      handleTableGenNumericLikeIdentifier();
+    }
     if (Tokens.back()->NewlinesBefore > 0 || Tokens.back()->IsMultiline)
       FirstInLineIndex = Tokens.size() - 1;
   } while (Tokens.back()->isNot(tok::eof));
@@ -804,6 +806,46 @@ void FormatTokenLexer::handleTableGenMultilineString() {
       FirstLineText, MultiLineString->OriginalColumn, Style.TabWidth, Encoding);
 }
 
+void FormatTokenLexer::handleTableGenNumericLikeIdentifier() {
+  FormatToken *Tok = Tokens.back();
+  // TableGen identifiers can begin with digits. Such tokens are lexed as
+  // numeric_constant now.
+  if (Tok->isNot(tok::numeric_constant))
+    return;
+  StringRef Text = Tok->TokenText;
+  // Identifiers cannot begin with + or -.
+  if (Text.size() < 1 || Text[0] == '+' || Text[0] == '-')
+    return;
+  // The following check is based on llvm::TGLexer::LexToken.
+  if (isdigit(Text[0])) {
+    size_t I = 0;
+    char NextChar = (char)0;
+    // Identifiers in TalbleGen may begin with digits. Skip to first non-digit.
+    do {
+      NextChar = Text[I++];
+    } while (I < Text.size() && isdigit(NextChar));
+    // All the characters are digits.
+    if (I >= Text.size())
+      return;
+    // Base character. But it does not check the first 0 and that the base is
+    // the second character.
+    if (NextChar == 'x' || NextChar == 'b') {
+      char NextNextChar = Text[I];
+      // This is regarded as binary number.
+      if (isxdigit(NextNextChar)) {
+        if (NextChar == 'b' && (NextNextChar == '0' || NextNextChar == '1'))
+          return;
+        // Regarded as hex number or decimal number.
+        if (NextChar == 'x' || isdigit(NextNextChar))
+          return;
+      }
+    }
+  }
+  // Otherwise, this is actually a identifier.
+  Tok->Tok.setKind(tok::identifier);
+  Tok->Tok.setIdentifierInfo(nullptr);
+}
+
 void FormatTokenLexer::handleTemplateStrings() {
   FormatToken *BacktickToken = Tokens.back();
 
diff --git a/clang/lib/Format/FormatTokenLexer.h b/clang/lib/Format/FormatTokenLexer.h
index 1dec6bbc41514c..65dd733bd53352 100644
--- a/clang/lib/Format/FormatTokenLexer.h
+++ b/clang/lib/Format/FormatTokenLexer.h
@@ -97,6 +97,10 @@ class FormatTokenLexer {
 
   // Handles TableGen multiline strings. It has the form [{ ... }].
   void handleTableGenMultilineString();
+  // Handles TableGen numeric like identifiers.
+  // They have a forms of [0-9]*[_a-zA-Z]([_a-zA-Z0-9]*). But limited to the
+  // case it is not lexed as an integer.
+  void handleTableGenNumericLikeIdentifier();
 
   void tryParsePythonComment();
 
diff --git a/clang/unittests/Format/TokenAnnotatorTest.cpp b/clang/unittests/Format/TokenAnnotatorTest.cpp
index 117d8fe8f7dc12..753e749befa57e 100644
--- a/clang/unittests/Format/TokenAnnotatorTest.cpp
+++ b/clang/unittests/Format/TokenAnnotatorTest.cpp
@@ -2209,6 +2209,27 @@ TEST_F(TokenAnnotatorTest, UnderstandTableGenTokens) {
   EXPECT_EQ(Tokens[0]->ColumnWidth, sizeof("[{ It can break\n") - 1);
   EXPECT_TRUE(Tokens[0]->IsMultiline);
   EXPECT_EQ(Tokens[0]->LastLineColumnWidth, sizeof("   the string. }]") - 1);
+
+  // Identifier tokens. In TableGen, identifiers can begin with a number.
+  // In ambiguous cases, the lexer tries to lex it as a number.
+  // Even if the try fails, it does not fall back to identifier lexing and
+  // regard as an error.
+  // The ambiguity is not documented. The result of those tests are based on the
+  // implementation of llvm::TGLexer::LexToken.
+  Tokens = Annotate("1234");
+  EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+  Tokens = Annotate("0x1abC");
+  EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+  // This is invalid syntax of number, but not an identifier.
+  Tokens = Annotate("0x1234x");
+  EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+  Tokens = Annotate("identifier");
+  EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
+  // Identifier beginning with a number.
+  Tokens = Annotate("2dVector");
+  EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
+  Tokens = Annotate("01234Vector");
+  EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
 }
 
 TEST_F(TokenAnnotatorTest, UnderstandConstructors) {

hnakamura5 · 2024-01-18T13:14:35Z

I checked simply the corner cases in unittest of this patch with the following sample.

// test_id.td
class 01234Vector<int i> {
  int 2dVector = 0x1abc;
  int invalid_num = 0x1234x;
  int 0x1234x = i;
}

def Def: 01234Vector<1>;

The followings are the result of

llvm-tblgen .\test_id.td

(1) With the whole sample code above.

.\test_id.td:4:27: error: expected ';' after declaration
  int invalid_num = 0x1234x;
                           ^ ← This caret points to the last x.

We can see 0x1234x is not lexed as a valid token.

(2) When the line of invalid_num is commented out from the sample.

.\test_id.td:5:7: error: Expected identifier in declaration
  int 0x1234x = i;
      ^ ← This caret points to the first 0.

0x1234x is NOT an identifier.

(3) Additionally, the line of int 0x1234x = i is commented out.

.\test_id.td:2:23: warning: unused template argument: 01234Vector:i
class 01234Vector<int i> {
                      ^
------------- Classes -----------------
class 01234Vector<int 01234Vector:i = ?> {
  int 2dVector = 6844;
}
------------- Defs -----------------
def Def {       // 01234Vector
  int 2dVector = 6844;
}

This is the result after the process completes.

So the following remained code has valid syntax.

class 01234Vector<int i> {
  int 2dVector = 0x1abc;
}

def Def: 01234Vector<1>;

rymiel · 2024-01-18T13:31:09Z

clang/lib/Format/FormatTokenLexer.cpp

+    // Base character. But it does not check the first 0 and that the base is
+    // the second character.


Is the "it does not check" a behaviour of this implementation or also of the tablegen compiler implementation?
I noticed below in the tests you mention the lexer errors on some cases, is this logic meant to match that behaviour?

Yes for the both question. This is about TableGen compiler's lexer.
As you wonder, this comment may be not precise enough. Later I will fix it.

For example,
0x1234x is regarded as integer because the lexer assumes it is a integer at the point it have got 0x1 part. This is an syntax error example written in the unittest.
I want to note here by this comment is,
1x1234x is also regarded as integer (and syntax error). This behavior comes from the lexer does not check the character before 'x' is 0 or other number.
(FYI, 1y1234x is a valid identifier. Such a ambiguity is only when the first non-digit character is 'x' or 'b'. )

clang/lib/Format/FormatTokenLexer.cpp

HazardyKnusperkeks · 2024-01-18T20:26:14Z

clang/lib/Format/FormatTokenLexer.cpp

+    if (NextChar == 'x' || NextChar == 'b') {
+      char NextNextChar = Text[I];
+      // This is regarded as binary number.
+      if (isxdigit(NextNextChar)) {


When do you need the x? It seems to me that this if can just be dropped.

Clarified the condition.

hnakamura5 · 2024-01-19T15:42:39Z

I changed the algorithm of handleTableGenNumericLikeIdentifier.
Stopped to follow literally to the original implementation in llvm::TGLexer::LexToken, and clarified the conditions.

[clang-format] Support of TableGen identifiers beginning with a number.

b472c08

llvmbot added the clang-format label Jan 18, 2024

rymiel reviewed Jan 18, 2024

View reviewed changes

HazardyKnusperkeks reviewed Jan 18, 2024

View reviewed changes

Changed the algorithm of handleTableGenNumericLikeIdentifier

70e6c7c

HazardyKnusperkeks approved these changes Jan 19, 2024

View reviewed changes

hnakamura5 merged commit fcb6737 into llvm:main Jan 20, 2024

hnakamura5 mentioned this pull request Jan 20, 2024

[clang-format] Support of TableGen formatting. #76059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[clang-format] Support of TableGen identifiers beginning with a number. #78571

[clang-format] Support of TableGen identifiers beginning with a number. #78571

Uh oh!

hnakamura5 commented Jan 18, 2024

Uh oh!

llvmbot commented Jan 18, 2024

Uh oh!

hnakamura5 commented Jan 18, 2024

Uh oh!

rymiel Jan 18, 2024

Uh oh!

hnakamura5 Jan 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HazardyKnusperkeks Jan 18, 2024

Uh oh!

hnakamura5 Jan 19, 2024

Uh oh!

hnakamura5 commented Jan 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

		// Base character. But it does not check the first 0 and that the base is
		// the second character.

[clang-format] Support of TableGen identifiers beginning with a number. #78571

[clang-format] Support of TableGen identifiers beginning with a number. #78571

Uh oh!

Conversation

hnakamura5 commented Jan 18, 2024

Uh oh!

llvmbot commented Jan 18, 2024

Uh oh!

hnakamura5 commented Jan 18, 2024

Uh oh!

rymiel Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

hnakamura5 Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HazardyKnusperkeks Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

hnakamura5 Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

hnakamura5 commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hnakamura5 commented Jan 19, 2024 •

edited

Loading