-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[clang-format] Support of TableGen identifiers beginning with a number. #78571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[clang-format] Support of TableGen identifiers beginning with a number. #78571
Conversation
@llvm/pr-subscribers-clang-format Author: Hirofumi Nakamura (hnakamura5) ChangesTableGen allows the identifiers beginning with a number. Full diff: https://github.com/llvm/llvm-project/pull/78571.diff 3 Files Affected:
diff --git a/clang/lib/Format/FormatTokenLexer.cpp b/clang/lib/Format/FormatTokenLexer.cpp
index 25ac9be57c81a9..f1982533f112c7 100644
--- a/clang/lib/Format/FormatTokenLexer.cpp
+++ b/clang/lib/Format/FormatTokenLexer.cpp
@@ -93,8 +93,10 @@ ArrayRef<FormatToken *> FormatTokenLexer::lex() {
// string literals are correctly identified.
handleCSharpVerbatimAndInterpolatedStrings();
}
- if (Style.isTableGen())
+ if (Style.isTableGen()) {
handleTableGenMultilineString();
+ handleTableGenNumericLikeIdentifier();
+ }
if (Tokens.back()->NewlinesBefore > 0 || Tokens.back()->IsMultiline)
FirstInLineIndex = Tokens.size() - 1;
} while (Tokens.back()->isNot(tok::eof));
@@ -804,6 +806,46 @@ void FormatTokenLexer::handleTableGenMultilineString() {
FirstLineText, MultiLineString->OriginalColumn, Style.TabWidth, Encoding);
}
+void FormatTokenLexer::handleTableGenNumericLikeIdentifier() {
+ FormatToken *Tok = Tokens.back();
+ // TableGen identifiers can begin with digits. Such tokens are lexed as
+ // numeric_constant now.
+ if (Tok->isNot(tok::numeric_constant))
+ return;
+ StringRef Text = Tok->TokenText;
+ // Identifiers cannot begin with + or -.
+ if (Text.size() < 1 || Text[0] == '+' || Text[0] == '-')
+ return;
+ // The following check is based on llvm::TGLexer::LexToken.
+ if (isdigit(Text[0])) {
+ size_t I = 0;
+ char NextChar = (char)0;
+ // Identifiers in TalbleGen may begin with digits. Skip to first non-digit.
+ do {
+ NextChar = Text[I++];
+ } while (I < Text.size() && isdigit(NextChar));
+ // All the characters are digits.
+ if (I >= Text.size())
+ return;
+ // Base character. But it does not check the first 0 and that the base is
+ // the second character.
+ if (NextChar == 'x' || NextChar == 'b') {
+ char NextNextChar = Text[I];
+ // This is regarded as binary number.
+ if (isxdigit(NextNextChar)) {
+ if (NextChar == 'b' && (NextNextChar == '0' || NextNextChar == '1'))
+ return;
+ // Regarded as hex number or decimal number.
+ if (NextChar == 'x' || isdigit(NextNextChar))
+ return;
+ }
+ }
+ }
+ // Otherwise, this is actually a identifier.
+ Tok->Tok.setKind(tok::identifier);
+ Tok->Tok.setIdentifierInfo(nullptr);
+}
+
void FormatTokenLexer::handleTemplateStrings() {
FormatToken *BacktickToken = Tokens.back();
diff --git a/clang/lib/Format/FormatTokenLexer.h b/clang/lib/Format/FormatTokenLexer.h
index 1dec6bbc41514c..65dd733bd53352 100644
--- a/clang/lib/Format/FormatTokenLexer.h
+++ b/clang/lib/Format/FormatTokenLexer.h
@@ -97,6 +97,10 @@ class FormatTokenLexer {
// Handles TableGen multiline strings. It has the form [{ ... }].
void handleTableGenMultilineString();
+ // Handles TableGen numeric like identifiers.
+ // They have a forms of [0-9]*[_a-zA-Z]([_a-zA-Z0-9]*). But limited to the
+ // case it is not lexed as an integer.
+ void handleTableGenNumericLikeIdentifier();
void tryParsePythonComment();
diff --git a/clang/unittests/Format/TokenAnnotatorTest.cpp b/clang/unittests/Format/TokenAnnotatorTest.cpp
index 117d8fe8f7dc12..753e749befa57e 100644
--- a/clang/unittests/Format/TokenAnnotatorTest.cpp
+++ b/clang/unittests/Format/TokenAnnotatorTest.cpp
@@ -2209,6 +2209,27 @@ TEST_F(TokenAnnotatorTest, UnderstandTableGenTokens) {
EXPECT_EQ(Tokens[0]->ColumnWidth, sizeof("[{ It can break\n") - 1);
EXPECT_TRUE(Tokens[0]->IsMultiline);
EXPECT_EQ(Tokens[0]->LastLineColumnWidth, sizeof(" the string. }]") - 1);
+
+ // Identifier tokens. In TableGen, identifiers can begin with a number.
+ // In ambiguous cases, the lexer tries to lex it as a number.
+ // Even if the try fails, it does not fall back to identifier lexing and
+ // regard as an error.
+ // The ambiguity is not documented. The result of those tests are based on the
+ // implementation of llvm::TGLexer::LexToken.
+ Tokens = Annotate("1234");
+ EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+ Tokens = Annotate("0x1abC");
+ EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+ // This is invalid syntax of number, but not an identifier.
+ Tokens = Annotate("0x1234x");
+ EXPECT_TOKEN(Tokens[0], tok::numeric_constant, TT_Unknown);
+ Tokens = Annotate("identifier");
+ EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
+ // Identifier beginning with a number.
+ Tokens = Annotate("2dVector");
+ EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
+ Tokens = Annotate("01234Vector");
+ EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
}
TEST_F(TokenAnnotatorTest, UnderstandConstructors) {
|
I checked simply the corner cases in unittest of this patch with the following sample.
The followings are the result of
(1) With the whole sample code above.
We can see (2) When the line of invalid_num is commented out from the sample.
(3) Additionally, the line of int 0x1234x = i is commented out.
This is the result after the process completes. So the following remained code has valid syntax.
|
// Base character. But it does not check the first 0 and that the base is | ||
// the second character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the "it does not check" a behaviour of this implementation or also of the tablegen compiler implementation?
I noticed below in the tests you mention the lexer errors on some cases, is this logic meant to match that behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for the both question. This is about TableGen compiler's lexer.
As you wonder, this comment may be not precise enough. Later I will fix it.
For example,
0x1234x
is regarded as integer because the lexer assumes it is a integer at the point it have got 0x1
part. This is an syntax error example written in the unittest.
I want to note here by this comment is,
1x1234x
is also regarded as integer (and syntax error). This behavior comes from the lexer does not check the character before 'x' is 0 or other number.
(FYI, 1y1234x
is a valid identifier. Such a ambiguity is only when the first non-digit character is 'x' or 'b'. )
if (NextChar == 'x' || NextChar == 'b') { | ||
char NextNextChar = Text[I]; | ||
// This is regarded as binary number. | ||
if (isxdigit(NextNextChar)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you need the x
? It seems to me that this if
can just be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the condition.
I changed the algorithm of handleTableGenNumericLikeIdentifier. |
TableGen allows the identifiers beginning with a number.
This patch add the support of the recognition of such identifiers.