[clang-format] Handle C++ keywords in other languages better #132941

sstwcw · 2025-03-25T14:43:41Z

There is some code to make sure that C++ keywords that are identifiers in the other languages are not treated as keywords. Right now, the kind is set to identifier, and the identifier info is cleared. The latter is probably so that the code for identifying C++ structures does not recognize those structures by mistake when formatting a language that does not have those structures. But we did not find an instance where the language can have the sequence of tokens, the code tries to parse the structure as if it is C++, but without checking for the language setting. However, there are places where the code checks whether the identifier info field is null or not in places where an identifier and a keyword are treated the same way. For example, the name of a function in JavaScript. This patch removes the lines that clear the identifier info. This way, a C++ keyword gets treated in the same way as an identifier in those places.

JavaScript

New

async function
union(
    myparamnameiswaytooloooong) {
}

Old

async function
    union(
        myparamnameiswaytooloooong) {
}

Java

New

enum union { ABC, CDE }

Old

enum
union { ABC, CDE }

There is some code to make sure that C++ keywords that are identifiers in the other languages are not treated as keywords. Right now, the kind is set to identifier, and the identifier info is cleared. The latter is probably so that the code for identifying C++ structures does not recognize those structures by mistake when formatting a language that does not have those structures. But we did not find an instance where the language can have the sequence of tokens, the code tries to parse the structure as if it is C++, but without checking for the language setting. However, there are places where the code checks whether the identifier info field is null or not in places where an identifier and a keyword are treated the same way. For example, the name of a function in JavaScript. This patch removes the lines that clear the identifier info. This way, a C++ keyword gets treated in the same way as an identifier in those places. JavaScript New ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Old ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Java New ```Java enum union { ABC, CDE } ``` Old ```Java enum union { ABC, CDE } ```

llvmbot · 2025-03-25T14:44:44Z

@llvm/pr-subscribers-clang-format

Author: None (sstwcw)

Changes

There is some code to make sure that C++ keywords that are identifiers in the other languages are not treated as keywords. Right now, the kind is set to identifier, and the identifier info is cleared. The latter is probably so that the code for identifying C++ structures does not recognize those structures by mistake when formatting a language that does not have those structures. But we did not find an instance where the language can have the sequence of tokens, the code tries to parse the structure as if it is C++, but without checking for the language setting. However, there are places where the code checks whether the identifier info field is null or not in places where an identifier and a keyword are treated the same way. For example, the name of a function in JavaScript. This patch removes the lines that clear the identifier info. This way, a C++ keyword gets treated in the same way as an identifier in those places.

JavaScript

New

async function
union(
    myparamnameiswaytooloooong) {
}

Old

async function
    union(
        myparamnameiswaytooloooong) {
}

Java

New

enum union { ABC, CDE }

Old

enum
union { ABC, CDE }

Full diff: https://github.com/llvm/llvm-project/pull/132941.diff

3 Files Affected:

(modified) clang/lib/Format/FormatTokenLexer.cpp (-3)
(modified) clang/unittests/Format/FormatTestJS.cpp (+10)
(modified) clang/unittests/Format/FormatTestJava.cpp (+2)

diff --git a/clang/lib/Format/FormatTokenLexer.cpp b/clang/lib/Format/FormatTokenLexer.cpp
index eed54a11684b5..014b10b206d90 100644
--- a/clang/lib/Format/FormatTokenLexer.cpp
+++ b/clang/lib/Format/FormatTokenLexer.cpp
@@ -1306,15 +1306,12 @@ FormatToken *FormatTokenLexer::getNextToken() {
         FormatTok->isOneOf(tok::kw_struct, tok::kw_union, tok::kw_delete,
                            tok::kw_operator)) {
       FormatTok->Tok.setKind(tok::identifier);
-      FormatTok->Tok.setIdentifierInfo(nullptr);
     } else if (Style.isJavaScript() &&
                FormatTok->isOneOf(tok::kw_struct, tok::kw_union,
                                   tok::kw_operator)) {
       FormatTok->Tok.setKind(tok::identifier);
-      FormatTok->Tok.setIdentifierInfo(nullptr);
     } else if (Style.isTableGen() && !Keywords.isTableGenKeyword(*FormatTok)) {
       FormatTok->Tok.setKind(tok::identifier);
-      FormatTok->Tok.setIdentifierInfo(nullptr);
     }
   } else if (FormatTok->is(tok::greatergreater)) {
     FormatTok->Tok.setKind(tok::greater);
diff --git a/clang/unittests/Format/FormatTestJS.cpp b/clang/unittests/Format/FormatTestJS.cpp
index 78c9f887a159b..6fedf1e2c0079 100644
--- a/clang/unittests/Format/FormatTestJS.cpp
+++ b/clang/unittests/Format/FormatTestJS.cpp
@@ -834,6 +834,11 @@ TEST_F(FormatTestJS, AsyncFunctions) {
                "}",
                "async function hello(myparamnameiswaytooloooong) {}",
                getGoogleJSStyleWithColumns(10));
+  verifyFormat("async function\n"
+               "union(\n"
+               "    myparamnameiswaytooloooong) {\n"
+               "}",
+               getGoogleJSStyleWithColumns(10));
   verifyFormat("class C {\n"
                "  async hello(\n"
                "      myparamnameiswaytooloooong) {\n"
@@ -1369,6 +1374,7 @@ TEST_F(FormatTestJS, WrapRespectsAutomaticSemicolonInsertion) {
                getGoogleJSStyleWithColumns(10));
   verifyFormat("await theReckoning;", getGoogleJSStyleWithColumns(10));
   verifyFormat("some['a']['b']", getGoogleJSStyleWithColumns(10));
+  verifyFormat("union['a']['b']", getGoogleJSStyleWithColumns(10));
   verifyFormat("x = (a['a']\n"
                "      ['b']);",
                getGoogleJSStyleWithColumns(10));
@@ -2500,6 +2506,10 @@ TEST_F(FormatTestJS, NonNullAssertionOperator) {
 TEST_F(FormatTestJS, CppKeywords) {
   // Make sure we don't mess stuff up because of C++ keywords.
   verifyFormat("return operator && (aa);");
+  verifyFormat("enum operator {\n"
+               "  A = 1,\n"
+               "  B\n"
+               "}");
   // .. or QT ones.
   verifyFormat("const slots: Slot[];");
   // use the "!" assertion operator to validate that clang-format understands
diff --git a/clang/unittests/Format/FormatTestJava.cpp b/clang/unittests/Format/FormatTestJava.cpp
index 33998bc7ff858..e01c1d6d7e684 100644
--- a/clang/unittests/Format/FormatTestJava.cpp
+++ b/clang/unittests/Format/FormatTestJava.cpp
@@ -158,6 +158,8 @@ TEST_F(FormatTestJava, AnonymousClasses) {
 
 TEST_F(FormatTestJava, EnumDeclarations) {
   verifyFormat("enum SomeThing { ABC, CDE }");
+  // A C++ keyword should not mess things up.
+  verifyFormat("enum union { ABC, CDE }");
   verifyFormat("enum SomeThing {\n"
                "  ABC,\n"
                "  CDE,\n"

clang/unittests/Format/FormatTestJS.cpp

There is some code to make sure that C++ keywords that are identifiers in the other languages are not treated as keywords. Right now, the kind is set to identifier, and the identifier info is cleared. The latter is probably so that the code for identifying C++ structures does not recognize those structures by mistake when formatting a language that does not have those structures. But we did not find an instance where the language can have the sequence of tokens, the code tries to parse the structure as if it is C++ using the identifier info instead of the token kind, but without checking for the language setting. However, there are places where the code checks whether the identifier info field is null or not. They are places where an identifier and a keyword are treated the same way. For example, the name of a function in JavaScript. This patch removes the lines that clear the identifier info. This way, a C++ keyword gets treated in the same way as an identifier in those places. JavaScript New ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Old ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Java New ```Java enum union { ABC, CDE } ``` Old ```Java enum union { ABC, CDE } ```

…2941) There is some code to make sure that C++ keywords that are identifiers in the other languages are not treated as keywords. Right now, the kind is set to identifier, and the identifier info is cleared. The latter is probably so that the code for identifying C++ structures does not recognize those structures by mistake when formatting a language that does not have those structures. But we did not find an instance where the language can have the sequence of tokens, the code tries to parse the structure as if it is C++ using the identifier info instead of the token kind, but without checking for the language setting. However, there are places where the code checks whether the identifier info field is null or not. They are places where an identifier and a keyword are treated the same way. For example, the name of a function in JavaScript. This patch removes the lines that clear the identifier info. This way, a C++ keyword gets treated in the same way as an identifier in those places. JavaScript New ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Old ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Java New ```Java enum union { ABC, CDE } ``` Old ```Java enum union { ABC, CDE } ```

…132941)" This reverts commit ab7cee8 which had formatting errors.

…lvm#132941)" This reverts commit ab7cee8 which had formatting errors.

…2941) There is some code to make sure that C++ keywords that are identifiers in the other languages are not treated as keywords. Right now, the kind is set to identifier, and the identifier info is cleared. The latter is probably so that the code for identifying C++ structures does not recognize those structures by mistake when formatting a language that does not have those structures. But we did not find an instance where the language can have the sequence of tokens, the code tries to parse the structure as if it is C++ using the identifier info instead of the token kind, but without checking for the language setting. However, there are places where the code checks whether the identifier info field is null or not. They are places where an identifier and a keyword are treated the same way. For example, the name of a function in JavaScript. This patch removes the lines that clear the identifier info. This way, a C++ keyword gets treated in the same way as an identifier in those places. JavaScript New ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Old ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Java New ```Java enum union { ABC, CDE } ``` Old ```Java enum union { ABC, CDE } ```

sstwcw · 2025-04-09T15:00:55Z

Can you have a look again? I rushed last time. I updated the patch and then merged into the main branch right away. There were formatting problems.

…2941) There is some code to make sure that C++ keywords that are identifiers in the other languages are not treated as keywords. Right now, the kind is set to identifier, and the identifier info is cleared. The latter is probably so that the code for identifying C++ structures does not recognize those structures by mistake when formatting a language that does not have those structures. But we did not find an instance where the language can have the sequence of tokens, the code tries to parse the structure as if it is C++ using the identifier info instead of the token kind, but without checking for the language setting. However, there are places where the code checks whether the identifier info field is null or not. They are places where an identifier and a keyword are treated the same way. For example, the name of a function in JavaScript. This patch removes the lines that clear the identifier info. This way, a C++ keyword gets treated in the same way as an identifier in those places. JavaScript New ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Old ```JavaScript async function union( myparamnameiswaytooloooong) { } ``` Java New ```Java enum union { ABC, CDE } ``` Old ```Java enum union { ABC, CDE } ``` This reverts commit 97dcbde.

llvmbot added the clang-format label Mar 25, 2025

HazardyKnusperkeks approved these changes Mar 25, 2025

View reviewed changes

owenca reviewed Mar 29, 2025

View reviewed changes

clang/unittests/Format/FormatTestJS.cpp Outdated Show resolved Hide resolved

clang/unittests/Format/FormatTestJS.cpp Outdated Show resolved Hide resolved

sstwcw closed this Mar 31, 2025

sstwcw deleted the format-keyword branch March 31, 2025 14:07

owenca added a commit that referenced this pull request Apr 2, 2025

Revert "[clang-format] Handle C++ keywords in other languages better (#…

97dcbde

…132941)" This reverts commit ab7cee8 which had formatting errors.

Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request Apr 2, 2025

Revert "[clang-format] Handle C++ keywords in other languages better (l…

377a784

…lvm#132941)" This reverts commit ab7cee8 which had formatting errors.

Merge branch 'main'

d2f7780

sstwcw reopened this Apr 9, 2025

owenca approved these changes Apr 9, 2025

View reviewed changes

sstwcw merged commit ed85822 into llvm:main Apr 10, 2025
13 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[clang-format] Handle C++ keywords in other languages better #132941

[clang-format] Handle C++ keywords in other languages better #132941

Uh oh!

sstwcw commented Mar 25, 2025

Uh oh!

llvmbot commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

sstwcw commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

[clang-format] Handle C++ keywords in other languages better #132941

[clang-format] Handle C++ keywords in other languages better #132941

Uh oh!

Conversation

sstwcw commented Mar 25, 2025

Uh oh!

llvmbot commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

sstwcw commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!