-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[clang-format] Limit how much work guessLanguage() can do #78925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-clang-format Author: Nathan Ridge (HighCommander4) ChangesguessLanguage() uses UnwrappedLineParser to process different preprocessor variants of a file. For large files with many preprocessor branches, the number of variants can be very large and the operation can hang for a long time and eventually OOM. (This has been observed particularly for single-header libraries such as miniaudio.h). This patch implements a limit on how many variants guessLanguage() analyzes, to avoid such a performance cliff. The limit is expressed as a maximum number of lines (summed over preprocessor variants) to process. This allows shorter files to have more variants processed than longer files. Fixes clangd/clangd#719 This patch does not fix the broader problem of actually trying to format such large headers, which involves using UnwrappedLineParser from call sites other than guessLanguage(), though the approach in the patch could be extended to other call sites as well. Full diff: https://github.com/llvm/llvm-project/pull/78925.diff 5 Files Affected:
diff --git a/clang/lib/Format/Format.cpp b/clang/lib/Format/Format.cpp
index f798d555bf9929..fbafce38d52b77 100644
--- a/clang/lib/Format/Format.cpp
+++ b/clang/lib/Format/Format.cpp
@@ -2830,7 +2830,7 @@ class Cleaner : public TokenAnalyzer {
class ObjCHeaderStyleGuesser : public TokenAnalyzer {
public:
ObjCHeaderStyleGuesser(const Environment &Env, const FormatStyle &Style)
- : TokenAnalyzer(Env, Style), IsObjC(false) {}
+ : TokenAnalyzer(Env, Style, MaxLinesToProcess), IsObjC(false) {}
std::pair<tooling::Replacements, unsigned>
analyze(TokenAnnotator &Annotator,
@@ -2846,6 +2846,12 @@ class ObjCHeaderStyleGuesser : public TokenAnalyzer {
bool isObjC() { return IsObjC; }
private:
+ // Limit the number of variants of the file TokenAnalyzer processes for
+ // the purpose of guessing the language. An inaccurate guess is better than
+ // hanging for a long time or OOMing, which has been observed with real
+ // libraries which are single-header with many preprocessor branches.
+ static const unsigned MaxLinesToProcess = (1 << 20);
+
static bool
guessIsObjC(const SourceManager &SourceManager,
const SmallVectorImpl<AnnotatedLine *> &AnnotatedLines,
diff --git a/clang/lib/Format/TokenAnalyzer.cpp b/clang/lib/Format/TokenAnalyzer.cpp
index bd648c430f9b0a..4f1195d5f82b55 100644
--- a/clang/lib/Format/TokenAnalyzer.cpp
+++ b/clang/lib/Format/TokenAnalyzer.cpp
@@ -83,12 +83,14 @@ Environment::Environment(StringRef Code, StringRef FileName,
ID(VirtualSM->get().getMainFileID()), FirstStartColumn(FirstStartColumn),
NextStartColumn(NextStartColumn), LastStartColumn(LastStartColumn) {}
-TokenAnalyzer::TokenAnalyzer(const Environment &Env, const FormatStyle &Style)
+TokenAnalyzer::TokenAnalyzer(const Environment &Env, const FormatStyle &Style,
+ unsigned MaxLinesToProcess)
: Style(Style), Env(Env),
AffectedRangeMgr(Env.getSourceManager(), Env.getCharRanges()),
UnwrappedLines(1),
Encoding(encoding::detectEncoding(
- Env.getSourceManager().getBufferData(Env.getFileID()))) {
+ Env.getSourceManager().getBufferData(Env.getFileID()))),
+ MaxLinesToProcess(MaxLinesToProcess) {
LLVM_DEBUG(
llvm::dbgs() << "File encoding: "
<< (Encoding == encoding::Encoding_UTF8 ? "UTF8" : "unknown")
@@ -109,7 +111,7 @@ TokenAnalyzer::process(bool SkipAnnotation) {
SmallVector<FormatToken *, 10> Tokens(Toks.begin(), Toks.end());
UnwrappedLineParser Parser(Env.getSourceManager(), Style, Lex.getKeywords(),
Env.getFirstStartColumn(), Tokens, *this,
- Allocator, IdentTable);
+ Allocator, IdentTable, MaxLinesToProcess);
Parser.parse();
assert(UnwrappedLines.back().empty());
unsigned Penalty = 0;
diff --git a/clang/lib/Format/TokenAnalyzer.h b/clang/lib/Format/TokenAnalyzer.h
index 4086dab1c94c3a..65770d4be5ca28 100644
--- a/clang/lib/Format/TokenAnalyzer.h
+++ b/clang/lib/Format/TokenAnalyzer.h
@@ -87,7 +87,8 @@ class Environment {
class TokenAnalyzer : public UnwrappedLineConsumer {
public:
- TokenAnalyzer(const Environment &Env, const FormatStyle &Style);
+ TokenAnalyzer(const Environment &Env, const FormatStyle &Style,
+ unsigned MaxLinesToProcess = 0);
std::pair<tooling::Replacements, unsigned>
process(bool SkipAnnotation = false);
@@ -109,6 +110,7 @@ class TokenAnalyzer : public UnwrappedLineConsumer {
AffectedRangeManager AffectedRangeMgr;
SmallVector<SmallVector<UnwrappedLine, 16>, 2> UnwrappedLines;
encoding::Encoding Encoding;
+ unsigned MaxLinesToProcess;
};
} // end namespace format
diff --git a/clang/lib/Format/UnwrappedLineParser.cpp b/clang/lib/Format/UnwrappedLineParser.cpp
index 684609747a5513..d925e19f8bf3bc 100644
--- a/clang/lib/Format/UnwrappedLineParser.cpp
+++ b/clang/lib/Format/UnwrappedLineParser.cpp
@@ -151,7 +151,7 @@ UnwrappedLineParser::UnwrappedLineParser(
const AdditionalKeywords &Keywords, unsigned FirstStartColumn,
ArrayRef<FormatToken *> Tokens, UnwrappedLineConsumer &Callback,
llvm::SpecificBumpPtrAllocator<FormatToken> &Allocator,
- IdentifierTable &IdentTable)
+ IdentifierTable &IdentTable, unsigned MaxLinesToProcess)
: Line(new UnwrappedLine), MustBreakBeforeNextToken(false),
CurrentLines(&Lines), Style(Style), Keywords(Keywords),
CommentPragmasRegex(Style.CommentPragmas), Tokens(nullptr),
@@ -160,7 +160,8 @@ UnwrappedLineParser::UnwrappedLineParser(
? IG_Rejected
: IG_Inited),
IncludeGuardToken(nullptr), FirstStartColumn(FirstStartColumn),
- Macros(Style.Macros, SourceMgr, Style, Allocator, IdentTable) {}
+ Macros(Style.Macros, SourceMgr, Style, Allocator, IdentTable),
+ MaxLinesToProcess(MaxLinesToProcess) {}
void UnwrappedLineParser::reset() {
PPBranchLevel = -1;
@@ -194,6 +195,8 @@ void UnwrappedLineParser::reset() {
void UnwrappedLineParser::parse() {
IndexedTokenSource TokenSource(AllTokens);
Line->FirstStartColumn = FirstStartColumn;
+ size_t TotalLinesProcessed = 0;
+ size_t LastReport = 0;
do {
LLVM_DEBUG(llvm::dbgs() << "----\n");
reset();
@@ -235,6 +238,7 @@ void UnwrappedLineParser::parse() {
Callback.consumeUnwrappedLine(Line);
}
Callback.finishRun();
+ TotalLinesProcessed += ExpandedLines.size();
}
LLVM_DEBUG(llvm::dbgs() << "Unwrapped lines:\n");
@@ -243,6 +247,14 @@ void UnwrappedLineParser::parse() {
Callback.consumeUnwrappedLine(Line);
}
Callback.finishRun();
+ TotalLinesProcessed += Lines.size();
+ if ((TotalLinesProcessed / 10000) > LastReport) {
+ llvm::errs() << "Processed " << TotalLinesProcessed << " lines\n";
+ LastReport = (TotalLinesProcessed / 10000);
+ }
+ if (MaxLinesToProcess > 0 && TotalLinesProcessed >= MaxLinesToProcess) {
+ break;
+ }
Lines.clear();
while (!PPLevelBranchIndex.empty() &&
PPLevelBranchIndex.back() + 1 >= PPLevelBranchCount.back()) {
diff --git a/clang/lib/Format/UnwrappedLineParser.h b/clang/lib/Format/UnwrappedLineParser.h
index 739298690bbd76..f54ad6bbd79a20 100644
--- a/clang/lib/Format/UnwrappedLineParser.h
+++ b/clang/lib/Format/UnwrappedLineParser.h
@@ -111,7 +111,8 @@ class UnwrappedLineParser {
unsigned FirstStartColumn, ArrayRef<FormatToken *> Tokens,
UnwrappedLineConsumer &Callback,
llvm::SpecificBumpPtrAllocator<FormatToken> &Allocator,
- IdentifierTable &IdentTable);
+ IdentifierTable &IdentTable,
+ unsigned MaxLinesToProcess = 0);
void parse();
@@ -406,6 +407,11 @@ class UnwrappedLineParser {
MacroExpander Macros;
+ // If set to a nonzero value, stop generating new runs after this
+ // many lines (summed over all the runs generated so far) have been
+ // processed.
+ unsigned MaxLinesToProcess;
+
friend class ScopedLineState;
friend class CompoundStatementIndenter;
};
|
A few additional notes:
|
@@ -243,6 +247,14 @@ void UnwrappedLineParser::parse() { | |||
Callback.consumeUnwrappedLine(Line); | |||
} | |||
Callback.finishRun(); | |||
TotalLinesProcessed += Lines.size(); | |||
if ((TotalLinesProcessed / 10000) > LastReport) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this verbose output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I left that in there by accident. Removed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write a unit test for this?
guessLanguage() uses UnwrappedLineParser to process different preprocessor variants of a file. For large files with many preprocessor branches, the number of variants can be very large and the operation can hang for a long time and eventually OOM. (This has been observed particularly for single-header libraries such as miniaudio.h). This patch implements a limit on how many variants guessLanguage() analyzes, to avoid such a performance cliff. The limit is expressed as a maximum number of lines (summed over preprocessor variants) to process. This allows shorter files to have more variants processed than longer files. Fixes clangd/clangd#719 Fixes clangd/clangd#1384 Fixes llvm#70945 This patch does not fix the broader problem of actually trying to format such large headers, which involves using UnwrappedLineParser from call sites other than guessLanguage(), though the approach in the patch could be extended to other call sites as well.
05c751d
to
47ebe97
Compare
While I get this fixes the guessLanguage limitation, if its OOM the I guess isn't it going to be OOM for any format? (as you mention above), couldn't we just use the .clang-format-ignore file to say... hey don't bother with this file, we can't do it... I feel like this will be a piece of code if we accept, might likely never go beyond the guess language usage. So it will just become more technical debt. Fixing the underlying problem isn't easy because it really amounts to changing the whole way clang-format deals with preprocessor paths. |
I'm open to trying but there are a few open questions in my mind:
|
I think that's a good solution for avoiding the OOM when actually trying to format the file, in combination with this patch for avoiding the OOM in The reason this is such a severe issue for clangd is that clangd calls into |
@mydeveloperday do you have any thoughts on this, or other feedback on how to proceed with the patch? |
The OOM is not limited to |
I'm aware of that. However, for clangd users, a crucial difference between For the files affected by this OOM (single-file libraries like miniaudio, mathlink, minilzo, blis, where the large number of configuration-related preprocessor branches and the large number of lines in the file conspire to make the combined length of all permutations intractable to work with), the number of users who merely use these libraries (and so may open them in the editor to look at the header, but will not try to format it) far exceeds the number of users who actually develop these libraries (who would edit and format the header). Thus, by avoiding the problem in
Thanks for this improvement. However, it does not meaningfully help avoid the OOM for the above libaries (the OOM happens while we're still in |
If we add a |
I assume you mean making it so that when clangd calls into I had a similar idea originally (discussed around here) but I realized that this introduces the possibility that a So, I don't think this is a viable approach; we'd be trading a crash affecting a small number of |
I'm not familiar with clangd use cases, but I thought the |
Ok, I see what you're getting at. Clangd calls into I audited clangd's code to see what it uses the returned
Of these, (1) is the only one that can be triggered without modifying the file. So, if @owenca if you're ok in principle with this libFormat API extension (adding an optional |
Thanks. I've prototyped the clangd changes on top of this, and it seem to be working (i.e. successfully avoiding the OOM for workflows that don't involve formatting the file). I've pushed the changes to my fork; I will submit them for review once the dependent patch merges. |
@HighCommander4 I got a better idea. It seems that we don't need to add the parameter! See #82957. I think all you need to do is to call |
Thanks. That does seem simpler, and seems like it should work just as well since the language guesser is the only thing that uses the |
I've merged #82957. |
guessLanguage() uses UnwrappedLineParser to process different preprocessor variants of a file. For large files with many preprocessor branches, the number of variants can be very large and the operation can hang for a long time and eventually OOM. (This has been observed particularly for single-header libraries such as miniaudio.h).
This patch implements a limit on how many variants guessLanguage() analyzes, to avoid such a performance cliff.
The limit is expressed as a maximum number of lines (summed over preprocessor variants) to process. This allows shorter files to have more variants processed than longer files.
Fixes clangd/clangd#719
Fixes clangd/clangd#1384
Fixes #70945
This patch does not fix the broader problem of actually trying to format such large headers, which involves using UnwrappedLineParser from call sites other than guessLanguage(), though the approach in the patch could be extended to other call sites as well.