Skip to content

[libSyntax] Avoid lexing of trivia into pieces if possible #35649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 8, 2021

Conversation

ahoppen
Copy link
Member

@ahoppen ahoppen commented Jan 29, 2021

Most of the time, we don’t actually care about the pieces that trivia consist of. Instead of always lexing them into tokens (and keeping a slightly more expensive data structure that contains the tokens around), lex the trivia first into a StringRef that contains their raw content. If needed, this raw content can later be decomposed into pieces.

Note that this PR may subtly change the way garbage text is being split into pieces, because the new trivia lexer will only split garbage text on whitespace and comment markers whereas the old lexer was centered around boundaries of Swift identifiers etc.
Since I don’t have strong opinions on how garbage text should be split, I chose to leave the TriviaLexer implementation simple instead of teaching it all the Swift identifier boundary rules.

Performance checklist (performed on my local machine)

  • No regression during compilation → ~7% improvement in ParseSourceFileRequest, ~1.5% in ParseMembersRequest
  • No regression during code completion
  • No regression during SwiftSyntax parsing

@ahoppen ahoppen requested a review from rintaro February 3, 2021 11:14
@ahoppen ahoppen marked this pull request as ready for review February 3, 2021 11:14
@swiftlang swiftlang deleted a comment from swift-ci Feb 3, 2021
@swiftlang swiftlang deleted a comment from swift-ci Feb 3, 2021
@swiftlang swiftlang deleted a comment from swift-ci Feb 3, 2021
The lexer is only responsible for skipping over trivia and noting their
length. A separate TriviaLexer can be invoked to split the raw trivia
string into its pieces.

Since most of the time the trivia pieces aren't needed, this will allow
us to later only parse trivia into pieces when they are explicitly
needed.
@swiftlang swiftlang deleted a comment from swift-ci Feb 4, 2021
StringRef tokenText = ArenaSourceBuffer.substr(tokStartOffset, tokLength);
StringRef trailingTriviaText = ArenaSourceBuffer.substr(
trailingTriviaStartOffset, trailingTrivia.size());

auto ownedText = OwnedString::makeRefCounted(tokenText);
Copy link
Member

@rintaro rintaro Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that RawSyntax always have SyntaxArena and SyntaxArena owns the memory of ArenaSourceBuffer. Do we really need to make ref-counted owned string? I feel storing tokenText in RaySyntax is sufficient.

Copy link
Member

@rintaro rintaro Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just abolish OwnedString, and instead, copyToArenaIfNeeded(tokenText) just like trivia.

Copy link
Member Author

@ahoppen ahoppen Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my next PR ;-) I wanted to not inflate this even further.

Specifically, it’s #35733

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

This is an intermediate state in which the lexer delegates the
responsibility for trivia lexing to the parser. Later, the parser will
delegate this responsibility to SyntaxParsingContext which will hand it
over to SyntaxParseAction, which will only lex the pieces if it is
really necessary to do so.
This is again a transitional state before SyntaxParsingContext hands
the responsibility over to SyntaxTreeCreator and from there to
SyntaxParseActions.
The SyntaxParseActions can decide how to handle the raw trivia, either
lex them into pieces or store them raw to be lexed when needed.
…ntaxArena buffer

Referencing a string in arbitrary memory is not safe since the source
buffer to which it points may have been freed. Instead copy all strings
into the SyntaxArena. Since RawSyntax nodes retain their arena, they can
be sure that the string won't disappear if it lives in their arena.

To avoid lots of small copies, we copy the entire source buffer once
into the syntax arena and make StringRefs point into that buffer.
…ntainsPointer calls

In practice SyntaxArena.containsPointer is almost always called with a
pointer from the SyntaxArena's source buffer. To avoid walking through
all of the bump allocator's slabs until we find the one containing the
source buffer, add a hot use memory region (which lives inside the bump
allocator) that is checked first before consulting the bump allocator.
If the lexer itself keeps track of where the first comment of a token
starts, we can avoid parsing trivia into pieces.
@ahoppen
Copy link
Member Author

ahoppen commented Feb 5, 2021

@swift-ci Please test

Copy link
Member

@rintaro rintaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -2530,7 +2525,10 @@ Token Lexer::getTokenAtLocation(const SourceManager &SM, SourceLoc Loc,
return L.peekNextToken();
}

void Lexer::lexTrivia(ParsedTrivia &Pieces, bool IsForTrailingTrivia) {
StringRef Lexer::lexTrivia(bool IsForTrailingTrivia) {
Copy link
Member

@rintaro rintaro Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to return StringRef here? TrailingTrivia can also do like:

const char *trailingTriviaStart = CurPtr
lextTrivia(true);
TrailingTrivia = StringRef(trailingTriviaStart, CurPtr - trailingTriviaStart);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but we'd only be creating the StringRef at the call site to store it in LeadingTrivia. Without measuring, I also assume that constructing and returning a StringRef also isn't particularly expensive, so I’d rather have a cleaner API here. Or do you have a different feeling about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging now to get this in. If you have a feeling that this will improve performance or anything else, I can investigate in a follow-up PR.

Copy link
Member

@rintaro rintaro Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just like to keep the consistency between LeadingTrivia and TrailingTrivia.
Another design would be lexTrivia receive the pointer and form the trivia inside it:

void lexTrivia(bool IsForTrailingTrivia, const char *TriviaStart) {
  // Advance 'CurPtr' to the end of the trivia ...

  return StringRef(TriviaStart, CurPtr - TriviaStart)
}

LeadingTrivia = lexTrivia(false, LeadingTriviaStart)

I'll put up a PR when I'm available.

@ahoppen ahoppen merged commit d0e27bb into swiftlang:main Feb 8, 2021
@ahoppen ahoppen deleted the trivia-parsing branch April 5, 2022 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants