Skip to content

[clang-format] Support of TableGen identifiers beginning with a number. #78571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

hnakamura5
Copy link
Contributor

TableGen allows the identifiers beginning with a number.
This patch add the support of the recognition of such identifiers.

@llvmbot
Copy link
Member

llvmbot commented Jan 18, 2024

@llvm/pr-subscribers-clang-format

Author: Hirofumi Nakamura (hnakamura5)

Changes

TableGen allows the identifiers beginning with a number.
This patch add the support of the recognition of such identifiers.


Full diff: https://github.com/llvm/llvm-project/pull/78571.diff

3 Files Affected:

  • (modified) clang/lib/Format/FormatTokenLexer.cpp (+43-1)
  • (modified) clang/lib/Format/FormatTokenLexer.h (+4)
  • (modified) clang/unittests/Format/TokenAnnotatorTest.cpp (+21)
diff --git a/clang/lib/Format/FormatTokenLexer.cpp b/clang/lib/Format/FormatTokenLexer.cpp
index 25ac9be57c81a9..f1982533f112c7 100644
--- a/clang/lib/Format/FormatTokenLexer.cpp
+++ b/clang/lib/Format/FormatTokenLexer.cpp
@@ -93,8 +93,10 @@ ArrayRef<FormatToken *> FormatTokenLexer::lex() {
       // string literals are correctly identified.
       handleCSharpVerbatimAndInterpolatedStrings();
     }
-    if (Style.isTableGen())
+    if (Style.isTableGen()) {
       handleTableGenMultilineString();
+      handleTableGenNumericLikeIdentifier();
+    }
     if (Tokens.back()->NewlinesBefore > 0 || Tokens.back()->IsMultiline)
       FirstInLineIndex = Tokens.size() - 1;
   } while (Tokens.back()->isNot(tok::eof));
@@ -804,6 +806,46 @@ void FormatTokenLexer::handleTableGenMultilineString() {
       FirstLineText, MultiLineString->OriginalColumn, Style.TabWidth, Encoding);
 }
 
+void FormatTokenLexer::handleTableGenNumericLikeIdentifier() {
+  FormatToken *Tok = Tokens.back();
+  // TableGen identifiers can begin with digits. Such tokens are lexed as
+  // numeric_constant now.
+  if (Tok->isNot(tok::numeric_constant))
+    return;
+  StringRef Text = Tok->TokenText;
+  // Identifiers cannot begin with + or -.
+  if (Text.size() < 1 || Text[0] == '+' || Text[0] == '-')
+    return;
+  // The following check is based on llvm::TGLexer::LexToken.
+  if (isdigit(Text[0])) {
+    size_t I = 0;
+    char NextChar = (char)0;
+    // Identifiers in TalbleGen may begin with digits. Skip to first non-digit.
+    do {
+      NextChar = Text[I++];
+    } while (I < Text.size() && isdigit(NextChar));
+    // All the characters are digits.
+    if (I >= Text.size())
+      return;
+    // Base character. But it does not check the first 0 and that the base is
+    // the second character.
+    if (NextChar == 'x' || NextChar == 'b') {
+      char NextNextChar = Text[I];
+      // This is regarded as binary number.
+      if (isxdigit(NextNextChar)) {
+        if (NextChar == 'b' && (NextNextChar == '0' || NextNextChar == '1'))
+          return;
+        // Regarded as hex number or decimal number.
+        if (NextChar == 'x' || isdigit(NextNextChar))
+          return;
+      }
+    }
+  }
+  // Otherwise, this is actually a identifier.
+  Tok->Tok.setKind(tok::identifier);
+  Tok->Tok.setIdentifierInfo(nullptr);
+}
+
 void FormatTokenLexer::handleTemplateStrings() {
   FormatToken *BacktickToken = Tokens.back();
 
diff --git a/clang/lib/Format/FormatTokenLexer.h b/clang/lib/Format/FormatTokenLexer.h
index 1dec6bbc41514c..65dd733bd53352 100644
--- a/clang/lib/Format/FormatTokenLexer.h
+++ b/clang/lib/Format/FormatTokenLexer.h
@@ -97,6 +97,10 @@ class FormatTokenLexer {
 
   // Handles TableGen multiline strings. It has the form [{ ... }].
   void handleTableGenMultilineString();
+  // Handles TableGen numeric like identifiers.
+  // They have a forms of [0-9]*[_a-zA-Z]([_a-zA-Z0-9]*). But limited to the
+  // case it is not lexed as an integer.
+  void handleTableGenNumericLikeIdentifier();
 
   void tryParsePythonComment();
 
diff --git a/clang/unittests/Format/TokenAnnotatorTest.cpp b/clang/unittests/Format/TokenAnnotatorTest.cpp
index 117d8fe8f7dc12..753e749befa57e 100644
--- a/clang/unittests/Format/TokenAnnotatorTest.cpp
+++ b/clang/unittests/Format/TokenAnnotatorTest.cpp
@@ -2209,6 +2209,27 @@ TEST_F(TokenAnnotatorTest, UnderstandTableGenTokens) {
   EXPECT_EQ(Tokens[0]->ColumnWidth, sizeof("[{ It can break\n") - 1);
   EXPECT_TRUE(Tokens[0]->IsMultiline);
   EXPECT_EQ(Tokens[0]->LastLineColumnWidth, sizeof("   the string. }]") - 1);
+
+  // Identifier tokens. In TableGen, identifiers can begin with a number.
+  // In ambiguous cases, the lexer tries to lex it as a number.
+  // Even if the try fails, it does not fall back to identifier lexing and
+  // regard as an error.
+  // The ambiguity is not documented. The result of those tests are based on the
+  // implementation of llvm::TGLexer::LexToken.
+  Tokens = Annotate("1234");
+  EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+  Tokens = Annotate("0x1abC");
+  EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+  // This is invalid syntax of number, but not an identifier.
+  Tokens = Annotate("0x1234x");
+  EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+  Tokens = Annotate("identifier");
+  EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
+  // Identifier beginning with a number.
+  Tokens = Annotate("2dVector");
+  EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
+  Tokens = Annotate("01234Vector");
+  EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
 }
 
 TEST_F(TokenAnnotatorTest, UnderstandConstructors) {

@hnakamura5
Copy link
Contributor Author

I checked simply the corner cases in unittest of this patch with the following sample.

// test_id.td
class 01234Vector<int i> {
  int 2dVector = 0x1abc;
  int invalid_num = 0x1234x;
  int 0x1234x = i;
}

def Def: 01234Vector<1>;

The followings are the result of

llvm-tblgen .\test_id.td

(1) With the whole sample code above.

.\test_id.td:4:27: error: expected ';' after declaration
  int invalid_num = 0x1234x;
                           ^ ← This caret points to the last x.

We can see 0x1234x is not lexed as a valid token.

(2) When the line of invalid_num is commented out from the sample.

.\test_id.td:5:7: error: Expected identifier in declaration
  int 0x1234x = i;
      ^ ← This caret points to the first 0.

0x1234x is NOT an identifier.

(3) Additionally, the line of int 0x1234x = i is commented out.

.\test_id.td:2:23: warning: unused template argument: 01234Vector:i
class 01234Vector<int i> {
                      ^
------------- Classes -----------------
class 01234Vector<int 01234Vector:i = ?> {
  int 2dVector = 6844;
}
------------- Defs -----------------
def Def {       // 01234Vector
  int 2dVector = 6844;
}

This is the result after the process completes.

So the following remained code has valid syntax.

class 01234Vector<int i> {
  int 2dVector = 0x1abc;
}

def Def: 01234Vector<1>;

Comment on lines 830 to 831
// Base character. But it does not check the first 0 and that the base is
// the second character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "it does not check" a behaviour of this implementation or also of the tablegen compiler implementation?
I noticed below in the tests you mention the lexer errors on some cases, is this logic meant to match that behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for the both question. This is about TableGen compiler's lexer.
As you wonder, this comment may be not precise enough. Later I will fix it.

For example,
0x1234x is regarded as integer because the lexer assumes it is a integer at the point it have got 0x1 part. This is an syntax error example written in the unittest.
I want to note here by this comment is,
1x1234x is also regarded as integer (and syntax error). This behavior comes from the lexer does not check the character before 'x' is 0 or other number.
(FYI, 1y1234x is a valid identifier. Such a ambiguity is only when the first non-digit character is 'x' or 'b'. )

if (NextChar == 'x' || NextChar == 'b') {
char NextNextChar = Text[I];
// This is regarded as binary number.
if (isxdigit(NextNextChar)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you need the x? It seems to me that this if can just be dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the condition.

@hnakamura5
Copy link
Contributor Author

hnakamura5 commented Jan 19, 2024

I changed the algorithm of handleTableGenNumericLikeIdentifier.
Stopped to follow literally to the original implementation in llvm::TGLexer::LexToken, and clarified the conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants