Skip to content

[clang-format] Limit how much work guessLanguage() can do #78925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

HighCommander4
Copy link
Collaborator

guessLanguage() uses UnwrappedLineParser to process different preprocessor variants of a file. For large files with many preprocessor branches, the number of variants can be very large and the operation can hang for a long time and eventually OOM. (This has been observed particularly for single-header libraries such as miniaudio.h).

This patch implements a limit on how many variants guessLanguage() analyzes, to avoid such a performance cliff.

The limit is expressed as a maximum number of lines (summed over preprocessor variants) to process. This allows shorter files to have more variants processed than longer files.

Fixes clangd/clangd#719
Fixes clangd/clangd#1384
Fixes #70945

This patch does not fix the broader problem of actually trying to format such large headers, which involves using UnwrappedLineParser from call sites other than guessLanguage(), though the approach in the patch could be extended to other call sites as well.

@llvmbot
Copy link
Member

llvmbot commented Jan 22, 2024

@llvm/pr-subscribers-clang-format

Author: Nathan Ridge (HighCommander4)

Changes

guessLanguage() uses UnwrappedLineParser to process different preprocessor variants of a file. For large files with many preprocessor branches, the number of variants can be very large and the operation can hang for a long time and eventually OOM. (This has been observed particularly for single-header libraries such as miniaudio.h).

This patch implements a limit on how many variants guessLanguage() analyzes, to avoid such a performance cliff.

The limit is expressed as a maximum number of lines (summed over preprocessor variants) to process. This allows shorter files to have more variants processed than longer files.

Fixes clangd/clangd#719
Fixes clangd/clangd#1384
Fixes #70945

This patch does not fix the broader problem of actually trying to format such large headers, which involves using UnwrappedLineParser from call sites other than guessLanguage(), though the approach in the patch could be extended to other call sites as well.


Full diff: https://github.com/llvm/llvm-project/pull/78925.diff

5 Files Affected:

  • (modified) clang/lib/Format/Format.cpp (+7-1)
  • (modified) clang/lib/Format/TokenAnalyzer.cpp (+5-3)
  • (modified) clang/lib/Format/TokenAnalyzer.h (+3-1)
  • (modified) clang/lib/Format/UnwrappedLineParser.cpp (+14-2)
  • (modified) clang/lib/Format/UnwrappedLineParser.h (+7-1)
diff --git a/clang/lib/Format/Format.cpp b/clang/lib/Format/Format.cpp
index f798d555bf9929..fbafce38d52b77 100644
--- a/clang/lib/Format/Format.cpp
+++ b/clang/lib/Format/Format.cpp
@@ -2830,7 +2830,7 @@ class Cleaner : public TokenAnalyzer {
 class ObjCHeaderStyleGuesser : public TokenAnalyzer {
 public:
   ObjCHeaderStyleGuesser(const Environment &Env, const FormatStyle &Style)
-      : TokenAnalyzer(Env, Style), IsObjC(false) {}
+      : TokenAnalyzer(Env, Style, MaxLinesToProcess), IsObjC(false) {}
 
   std::pair<tooling::Replacements, unsigned>
   analyze(TokenAnnotator &Annotator,
@@ -2846,6 +2846,12 @@ class ObjCHeaderStyleGuesser : public TokenAnalyzer {
   bool isObjC() { return IsObjC; }
 
 private:
+  // Limit the number of variants of the file TokenAnalyzer processes for
+  // the purpose of guessing the language. An inaccurate guess is better than
+  // hanging for a long time or OOMing, which has been observed with real
+  // libraries which are single-header with many preprocessor branches.
+  static const unsigned MaxLinesToProcess = (1 << 20);
+
   static bool
   guessIsObjC(const SourceManager &SourceManager,
               const SmallVectorImpl<AnnotatedLine *> &AnnotatedLines,
diff --git a/clang/lib/Format/TokenAnalyzer.cpp b/clang/lib/Format/TokenAnalyzer.cpp
index bd648c430f9b0a..4f1195d5f82b55 100644
--- a/clang/lib/Format/TokenAnalyzer.cpp
+++ b/clang/lib/Format/TokenAnalyzer.cpp
@@ -83,12 +83,14 @@ Environment::Environment(StringRef Code, StringRef FileName,
       ID(VirtualSM->get().getMainFileID()), FirstStartColumn(FirstStartColumn),
       NextStartColumn(NextStartColumn), LastStartColumn(LastStartColumn) {}
 
-TokenAnalyzer::TokenAnalyzer(const Environment &Env, const FormatStyle &Style)
+TokenAnalyzer::TokenAnalyzer(const Environment &Env, const FormatStyle &Style,
+                             unsigned MaxLinesToProcess)
     : Style(Style), Env(Env),
       AffectedRangeMgr(Env.getSourceManager(), Env.getCharRanges()),
       UnwrappedLines(1),
       Encoding(encoding::detectEncoding(
-          Env.getSourceManager().getBufferData(Env.getFileID()))) {
+          Env.getSourceManager().getBufferData(Env.getFileID()))),
+      MaxLinesToProcess(MaxLinesToProcess) {
   LLVM_DEBUG(
       llvm::dbgs() << "File encoding: "
                    << (Encoding == encoding::Encoding_UTF8 ? "UTF8" : "unknown")
@@ -109,7 +111,7 @@ TokenAnalyzer::process(bool SkipAnnotation) {
   SmallVector<FormatToken *, 10> Tokens(Toks.begin(), Toks.end());
   UnwrappedLineParser Parser(Env.getSourceManager(), Style, Lex.getKeywords(),
                              Env.getFirstStartColumn(), Tokens, *this,
-                             Allocator, IdentTable);
+                             Allocator, IdentTable, MaxLinesToProcess);
   Parser.parse();
   assert(UnwrappedLines.back().empty());
   unsigned Penalty = 0;
diff --git a/clang/lib/Format/TokenAnalyzer.h b/clang/lib/Format/TokenAnalyzer.h
index 4086dab1c94c3a..65770d4be5ca28 100644
--- a/clang/lib/Format/TokenAnalyzer.h
+++ b/clang/lib/Format/TokenAnalyzer.h
@@ -87,7 +87,8 @@ class Environment {
 
 class TokenAnalyzer : public UnwrappedLineConsumer {
 public:
-  TokenAnalyzer(const Environment &Env, const FormatStyle &Style);
+  TokenAnalyzer(const Environment &Env, const FormatStyle &Style,
+                unsigned MaxLinesToProcess = 0);
 
   std::pair<tooling::Replacements, unsigned>
   process(bool SkipAnnotation = false);
@@ -109,6 +110,7 @@ class TokenAnalyzer : public UnwrappedLineConsumer {
   AffectedRangeManager AffectedRangeMgr;
   SmallVector<SmallVector<UnwrappedLine, 16>, 2> UnwrappedLines;
   encoding::Encoding Encoding;
+  unsigned MaxLinesToProcess;
 };
 
 } // end namespace format
diff --git a/clang/lib/Format/UnwrappedLineParser.cpp b/clang/lib/Format/UnwrappedLineParser.cpp
index 684609747a5513..d925e19f8bf3bc 100644
--- a/clang/lib/Format/UnwrappedLineParser.cpp
+++ b/clang/lib/Format/UnwrappedLineParser.cpp
@@ -151,7 +151,7 @@ UnwrappedLineParser::UnwrappedLineParser(
     const AdditionalKeywords &Keywords, unsigned FirstStartColumn,
     ArrayRef<FormatToken *> Tokens, UnwrappedLineConsumer &Callback,
     llvm::SpecificBumpPtrAllocator<FormatToken> &Allocator,
-    IdentifierTable &IdentTable)
+    IdentifierTable &IdentTable, unsigned MaxLinesToProcess)
     : Line(new UnwrappedLine), MustBreakBeforeNextToken(false),
       CurrentLines(&Lines), Style(Style), Keywords(Keywords),
       CommentPragmasRegex(Style.CommentPragmas), Tokens(nullptr),
@@ -160,7 +160,8 @@ UnwrappedLineParser::UnwrappedLineParser(
                        ? IG_Rejected
                        : IG_Inited),
       IncludeGuardToken(nullptr), FirstStartColumn(FirstStartColumn),
-      Macros(Style.Macros, SourceMgr, Style, Allocator, IdentTable) {}
+      Macros(Style.Macros, SourceMgr, Style, Allocator, IdentTable),
+      MaxLinesToProcess(MaxLinesToProcess) {}
 
 void UnwrappedLineParser::reset() {
   PPBranchLevel = -1;
@@ -194,6 +195,8 @@ void UnwrappedLineParser::reset() {
 void UnwrappedLineParser::parse() {
   IndexedTokenSource TokenSource(AllTokens);
   Line->FirstStartColumn = FirstStartColumn;
+  size_t TotalLinesProcessed = 0;
+  size_t LastReport = 0;
   do {
     LLVM_DEBUG(llvm::dbgs() << "----\n");
     reset();
@@ -235,6 +238,7 @@ void UnwrappedLineParser::parse() {
         Callback.consumeUnwrappedLine(Line);
       }
       Callback.finishRun();
+      TotalLinesProcessed += ExpandedLines.size();
     }
 
     LLVM_DEBUG(llvm::dbgs() << "Unwrapped lines:\n");
@@ -243,6 +247,14 @@ void UnwrappedLineParser::parse() {
       Callback.consumeUnwrappedLine(Line);
     }
     Callback.finishRun();
+    TotalLinesProcessed += Lines.size();
+    if ((TotalLinesProcessed / 10000) > LastReport) {
+      llvm::errs() << "Processed " << TotalLinesProcessed << " lines\n";
+      LastReport = (TotalLinesProcessed / 10000);
+    }
+    if (MaxLinesToProcess > 0 && TotalLinesProcessed >= MaxLinesToProcess) {
+      break;
+    }
     Lines.clear();
     while (!PPLevelBranchIndex.empty() &&
            PPLevelBranchIndex.back() + 1 >= PPLevelBranchCount.back()) {
diff --git a/clang/lib/Format/UnwrappedLineParser.h b/clang/lib/Format/UnwrappedLineParser.h
index 739298690bbd76..f54ad6bbd79a20 100644
--- a/clang/lib/Format/UnwrappedLineParser.h
+++ b/clang/lib/Format/UnwrappedLineParser.h
@@ -111,7 +111,8 @@ class UnwrappedLineParser {
                       unsigned FirstStartColumn, ArrayRef<FormatToken *> Tokens,
                       UnwrappedLineConsumer &Callback,
                       llvm::SpecificBumpPtrAllocator<FormatToken> &Allocator,
-                      IdentifierTable &IdentTable);
+                      IdentifierTable &IdentTable,
+                      unsigned MaxLinesToProcess = 0);
 
   void parse();
 
@@ -406,6 +407,11 @@ class UnwrappedLineParser {
 
   MacroExpander Macros;
 
+  // If set to a nonzero value, stop generating new runs after this
+  // many lines (summed over all the runs generated so far) have been
+  // processed.
+  unsigned MaxLinesToProcess;
+
   friend class ScopedLineState;
   friend class CompoundStatementIndenter;
 };

@HighCommander4
Copy link
Collaborator Author

HighCommander4 commented Jan 22, 2024

A few additional notes:

  • I chose the value 1 << 20 for MaxLinesToProcess empirically to get guessLanguage() to return in about 1 second on my (average-for-a-developer-machine) hardware. I don't feel strongly about the value and I'm happy to go with a different one.
  • The reason I didn't try to impose the limit on all uses of UnwrappedLineParser is that I wasn't sure what the impact would be on the results of trying to format an affected file. As such, trying to format such a file still hangs / OOMs, but that's still an issue that fewer people will run into than the guessLanguage() hang which occurs any time such a file is even opened in clangd. As such, I think the patch in its current form still has a lot of value.

@@ -243,6 +247,14 @@ void UnwrappedLineParser::parse() {
Callback.consumeUnwrappedLine(Line);
}
Callback.finishRun();
TotalLinesProcessed += Lines.size();
if ((TotalLinesProcessed / 10000) > LastReport) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this verbose output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I left that in there by accident. Removed now.

Copy link
Contributor

@mydeveloperday mydeveloperday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a unit test for this?

guessLanguage() uses UnwrappedLineParser to process different
preprocessor variants of a file. For large files with many
preprocessor branches, the number of variants can be very large
and the operation can hang for a long time and eventually OOM.
(This has been observed particularly for single-header libraries
such as miniaudio.h).

This patch implements a limit on how many variants guessLanguage()
analyzes, to avoid such a performance cliff.

The limit is expressed as a maximum number of lines (summed over
preprocessor variants) to process. This allows shorter files to
have more variants processed than longer files.

Fixes clangd/clangd#719
Fixes clangd/clangd#1384
Fixes llvm#70945

This patch does not fix the broader problem of actually trying to
format such large headers, which involves using UnwrappedLineParser
from call sites other than guessLanguage(), though the approach
in the patch could be extended to other call sites as well.
@mydeveloperday
Copy link
Contributor

While I get this fixes the guessLanguage limitation, if its OOM the I guess isn't it going to be OOM for any format? (as you mention above), couldn't we just use the .clang-format-ignore file to say... hey don't bother with this file, we can't do it...

I feel like this will be a piece of code if we accept, might likely never go beyond the guess language usage. So it will just become more technical debt. Fixing the underlying problem isn't easy because it really amounts to changing the whole way clang-format deals with preprocessor paths.

@HighCommander4
Copy link
Collaborator Author

Can we write a unit test for this?

I'm open to trying but there are a few open questions in my mind:

  • Is it fine if the failure mode of a testcase is triggering an OOM?
  • Triggering the problematic codepath requires a combination of a large file (thousands of lines) and many preprocessor branches on those lines. Should I be looking to copy one of the affected single-header libraries (e.g. miniaudio.h) into the test suite for this purpose? Or should I try to synthesize a file with dummy content that meets these criteria?

@HighCommander4
Copy link
Collaborator Author

HighCommander4 commented Jan 22, 2024

couldn't we just use the .clang-format-ignore file to say... hey don't bother with this file, we can't do it...

I think that's a good solution for avoiding the OOM when actually trying to format the file, in combination with this patch for avoiding the OOM in guessLanguage().

The reason this is such a severe issue for clangd is that clangd calls into guessLanguage() whenever a file is opened, thereby triggering the OOM for any downstream user of such a single-header library who happens to open the header in their editor, not just developers of the library itself. Those users have no reason to try to format the library header, so this would solve what's otherwise a puzzling hang for them with no configuration required.

@HighCommander4
Copy link
Collaborator Author

HighCommander4 commented Feb 20, 2024

Can we write a unit test for this?

I'm open to trying but there are a few open questions in my mind:

  • Is it fine if the failure mode of a testcase is triggering an OOM?

  • Triggering the problematic codepath requires a combination of a large file (thousands of lines) and many preprocessor branches on those lines. Should I be looking to copy one of the affected single-header libraries (e.g. miniaudio.h) into the test suite for this purpose? Or should I try to synthesize a file with dummy content that meets these criteria?

@mydeveloperday do you have any thoughts on this, or other feedback on how to proceed with the patch?

@owenca
Copy link
Contributor

owenca commented Feb 20, 2024

couldn't we just use the .clang-format-ignore file to say... hey don't bother with this file, we can't do it...

I think that's a good solution for avoiding the OOM when actually trying to format the file, in combination with this patch for avoiding the OOM in guessLanguage().

The reason this is such a severe issue for clangd is that clangd calls into guessLanguage() whenever a file is opened, thereby triggering the OOM for any downstream user of such a single-header library who happens to open the header in their editor, not just developers of the library itself. Those users have no reason to try to format the library header, so this would solve what's otherwise a puzzling hang for them with no configuration required.

The OOM is not limited to guessLanguage(). (See 119a728 for an improvement to the performance of guessLanguage().) It occurs even if you rename the header files to .cpp files and clang-format the renamed files directly. I'm with @mydeveloperday on this and believe a more general solution is needed.

@HighCommander4
Copy link
Collaborator Author

HighCommander4 commented Feb 21, 2024

The OOM is not limited to guessLanguage(). [...] It occurs even if you rename the header files to .cpp files and clang-format the renamed files directly.

I'm aware of that.

However, for clangd users, a crucial difference between guessLanguage() vs. other code in libFormat that uses UnwrappedLineParser, is that guessLanguage() gets called as soon as a file is opened in the editor, while the other code is only called if the user explicitly tries to format the file.

For the files affected by this OOM (single-file libraries like miniaudio, mathlink, minilzo, blis, where the large number of configuration-related preprocessor branches and the large number of lines in the file conspire to make the combined length of all permutations intractable to work with), the number of users who merely use these libraries (and so may open them in the editor to look at the header, but will not try to format it) far exceeds the number of users who actually develop these libraries (who would edit and format the header). Thus, by avoiding the problem in guessLanguage(), we would resolve the issue for >99% of clangd users that run into it, which would be a significant improvement.

(See 119a728 for an improvement to the performance of guessLanguage().)

Thanks for this improvement. However, it does not meaningfully help avoid the OOM for the above libaries (the OOM happens while we're still in UnwrappedLineParser::parse()).

@owenca
Copy link
Contributor

owenca commented Feb 22, 2024

However, for clangd users, a crucial difference between guessLanguage() vs. other code in libFormat that uses UnwrappedLineParser, is that guessLanguage() gets called as soon as a file is opened in the editor, while the other code is only called if the user explicitly tries to format the file.

For the files affected by this OOM (single-file libraries like miniaudio, mathlink, minilzo, blis, where the large number of configuration-related preprocessor branches and the large number of lines in the file conspire to make the combined length of all permutations intractable to work with), the number of users who merely use these libraries (and so may open them in the editor to look at the header, but will not try to format it) far exceeds the number of users who actually develop these libraries (who would edit and format the header). Thus, by avoiding the problem in guessLanguage(), we would resolve the issue for >99% of clangd users that run into it, which would be a significant improvement.

If we add a bool GuessObjC parameter to guessLanguage(), would that solve the problem?

@HighCommander4
Copy link
Collaborator Author

If we add a bool GuessObjC parameter to guessLanguage(), would that solve the problem?

I assume you mean making it so that when clangd calls into guessLanguage(), the ObjC guessing algorithm is skipped entirely.

I had a similar idea originally (discussed around here) but I realized that this introduces the possibility that a .h file is treated as one language when formatted by clangd, and a different language when formatted by clang-format (and so you could get e.g. a user's editor (which uses clangd) fighting back and forth with e.g. a post-commit hook (which uses clang-format)).

So, I don't think this is a viable approach; we'd be trading a crash affecting a small number of .h files, for a less severe but annoying issue potentially affecting a much larger set of .h files.

@owenca
Copy link
Contributor

owenca commented Feb 23, 2024

If we add a bool GuessObjC parameter to guessLanguage(), would that solve the problem?

I assume you mean making it so that when clangd calls into guessLanguage(), the ObjC guessing algorithm is skipped entirely.

I'm not familiar with clangd use cases, but I thought the GuessObjC flag would allow clangd to skip the guessing ObjC part for those "users who merely use these libraries (and so may open them in the editor to look at the header, but will not try to format it)". If it wouldn't work as you pointed out, then we probably need a new ObjC guesser that only relies on the lexer, similar to what's done in IntegerLiteralSeparatorFixer::process().

@HighCommander4
Copy link
Collaborator Author

I thought the GuessObjC flag would allow clangd to skip the guessing ObjC part for those "users who merely use these libraries (and so may open them in the editor to look at the header, but will not try to format it)".

Ok, I see what you're getting at.

Clangd calls into guessLanguage() via format::getStyle() which calls guessLanguage().

I audited clangd's code to see what it uses the returned FormatStyle for:

  1. When hovering over a symbol, we call format::reformat() on the code snippet shown in the hover (e.g. definition of the variable whose use you are hovering over).
  2. Formatting edits (via format::cleanupAroundReplacements() and format::formatReplacements()) made by various clangd operations (e.g. accepting a code completion proposal, performing a rename or other code action, etc.)
  3. Inserting new #include directives (e.g. associated with a code completion proposal, or a quick-fix for a diagnostic). For this one, clangd inspects IncludeStyle manually to choose an insertion point.
  4. Formatting the whole file (or a selected range of text) using format::reformat().

Of these, (1) is the only one that can be triggered without modifying the file.

So, if format::getStyle() had a GuessObjC parameter (which it propagated into guessLanguage()), I think we could refactor things such that (1) used GuessObjC=false and the others used GuessObjC=true. (We may even be able to use GuessObjC=false for (2) and (3), since those are just formatting a small amount of code.)

@owenca if you're ok in principle with this libFormat API extension (adding an optional GuessObjC parameter to format::getStyle()), I'm happy to try prototyping these changes in clangd.

@owenca
Copy link
Contributor

owenca commented Feb 25, 2024

@HighCommander4 see #82911

@HighCommander4
Copy link
Collaborator Author

HighCommander4 commented Feb 26, 2024

@HighCommander4 see #82911

Thanks. I've prototyped the clangd changes on top of this, and it seem to be working (i.e. successfully avoiding the OOM for workflows that don't involve formatting the file). I've pushed the changes to my fork; I will submit them for review once the dependent patch merges.

@owenca
Copy link
Contributor

owenca commented Feb 26, 2024

@HighCommander4 I got a better idea. It seems that we don't need to add the parameter! See #82957. I think all you need to do is to call getStyle() with the default "" for the StringRef Code argument to skip the Objective-C guesser.

@HighCommander4
Copy link
Collaborator Author

@HighCommander4 I got a better idea. It seems that we don't need to add the parameter! See #82957. I think all you need to do is to call getStyle() with the default "" for the StringRef Code argument to skip the Objective-C guesser.

Thanks. That does seem simpler, and seems like it should work just as well since the language guesser is the only thing that uses the Code argument.

@owenca
Copy link
Contributor

owenca commented Feb 27, 2024

@HighCommander4 I got a better idea. It seems that we don't need to add the parameter! See #82957. I think all you need to do is to call getStyle() with the default "" for the StringRef Code argument to skip the Objective-C guesser.

Thanks. That does seem simpler, and seems like it should work just as well since the language guesser is the only thing that uses the Code argument.

I've merged #82957.

@HighCommander4
Copy link
Collaborator Author

I've merged #82957.

Thanks. I've written the clangd patch and posted it for review at #84133.

If that gets approved, I will abandon this PR.

@HighCommander4
Copy link
Collaborator Author

HighCommander4 commented Mar 11, 2024

Thanks. I've written the clangd patch and posted it for review at #84133.

If that gets approved, I will abandon this PR.

#84133 has merged now. I'll accordingly close this.

Thank you @owenca for your feedback and helping reach a solution that accomplished the goals of this patch in a cleaner way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants