-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[libSyntax] Avoid lexing of trivia into pieces if possible #35649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9f8dba8
to
8cc387f
Compare
8cc387f
to
6fc8726
Compare
6fc8726
to
ab036a5
Compare
ab036a5
to
18eaeb0
Compare
The lexer is only responsible for skipping over trivia and noting their length. A separate TriviaLexer can be invoked to split the raw trivia string into its pieces. Since most of the time the trivia pieces aren't needed, this will allow us to later only parse trivia into pieces when they are explicitly needed.
18eaeb0
to
6604eec
Compare
StringRef tokenText = ArenaSourceBuffer.substr(tokStartOffset, tokLength); | ||
StringRef trailingTriviaText = ArenaSourceBuffer.substr( | ||
trailingTriviaStartOffset, trailingTrivia.size()); | ||
|
||
auto ownedText = OwnedString::makeRefCounted(tokenText); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that RawSyntax
always have SyntaxArena
and SyntaxArena
owns the memory of ArenaSourceBuffer
. Do we really need to make ref-counted owned string? I feel storing tokenText
in RaySyntax
is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just abolish OwnedString
, and instead, copyToArenaIfNeeded(tokenText)
just like trivia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my next PR ;-) I wanted to not inflate this even further.
Specifically, it’s #35733
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
This is an intermediate state in which the lexer delegates the responsibility for trivia lexing to the parser. Later, the parser will delegate this responsibility to SyntaxParsingContext which will hand it over to SyntaxParseAction, which will only lex the pieces if it is really necessary to do so.
This is again a transitional state before SyntaxParsingContext hands the responsibility over to SyntaxTreeCreator and from there to SyntaxParseActions.
Next and final stop: SyntaxParseActions
The SyntaxParseActions can decide how to handle the raw trivia, either lex them into pieces or store them raw to be lexed when needed.
…ntaxArena buffer Referencing a string in arbitrary memory is not safe since the source buffer to which it points may have been freed. Instead copy all strings into the SyntaxArena. Since RawSyntax nodes retain their arena, they can be sure that the string won't disappear if it lives in their arena. To avoid lots of small copies, we copy the entire source buffer once into the syntax arena and make StringRefs point into that buffer.
…ntainsPointer calls In practice SyntaxArena.containsPointer is almost always called with a pointer from the SyntaxArena's source buffer. To avoid walking through all of the bump allocator's slabs until we find the one containing the source buffer, add a hot use memory region (which lives inside the bump allocator) that is checked first before consulting the bump allocator.
If the lexer itself keeps track of where the first comment of a token starts, we can avoid parsing trivia into pieces.
6604eec
to
a8c0136
Compare
@swift-ci Please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -2530,7 +2525,10 @@ Token Lexer::getTokenAtLocation(const SourceManager &SM, SourceLoc Loc, | |||
return L.peekNextToken(); | |||
} | |||
|
|||
void Lexer::lexTrivia(ParsedTrivia &Pieces, bool IsForTrailingTrivia) { | |||
StringRef Lexer::lexTrivia(bool IsForTrailingTrivia) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to return StringRef
here? TrailingTrivia
can also do like:
const char *trailingTriviaStart = CurPtr
lextTrivia(true);
TrailingTrivia = StringRef(trailingTriviaStart, CurPtr - trailingTriviaStart);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but we'd only be creating the StringRef
at the call site to store it in LeadingTrivia
. Without measuring, I also assume that constructing and returning a StringRef
also isn't particularly expensive, so I’d rather have a cleaner API here. Or do you have a different feeling about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging now to get this in. If you have a feeling that this will improve performance or anything else, I can investigate in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just like to keep the consistency between LeadingTrivia
and TrailingTrivia
.
Another design would be lexTrivia
receive the pointer and form the trivia inside it:
void lexTrivia(bool IsForTrailingTrivia, const char *TriviaStart) {
// Advance 'CurPtr' to the end of the trivia ...
return StringRef(TriviaStart, CurPtr - TriviaStart)
}
LeadingTrivia = lexTrivia(false, LeadingTriviaStart)
I'll put up a PR when I'm available.
Most of the time, we don’t actually care about the pieces that trivia consist of. Instead of always lexing them into tokens (and keeping a slightly more expensive data structure that contains the tokens around), lex the trivia first into a
StringRef
that contains their raw content. If needed, this raw content can later be decomposed into pieces.Note that this PR may subtly change the way garbage text is being split into pieces, because the new trivia lexer will only split garbage text on whitespace and comment markers whereas the old lexer was centered around boundaries of Swift identifiers etc.
Since I don’t have strong opinions on how garbage text should be split, I chose to leave the
TriviaLexer
implementation simple instead of teaching it all the Swift identifier boundary rules.Performance checklist (performed on my local machine)
ParseSourceFileRequest
, ~1.5% inParseMembersRequest