[libSyntax] Avoid lexing of trivia into pieces if possible #35649

ahoppen · 2021-01-29T16:40:18Z

Most of the time, we don’t actually care about the pieces that trivia consist of. Instead of always lexing them into tokens (and keeping a slightly more expensive data structure that contains the tokens around), lex the trivia first into a StringRef that contains their raw content. If needed, this raw content can later be decomposed into pieces.

Note that this PR may subtly change the way garbage text is being split into pieces, because the new trivia lexer will only split garbage text on whitespace and comment markers whereas the old lexer was centered around boundaries of Swift identifiers etc.
Since I don’t have strong opinions on how garbage text should be split, I chose to leave the TriviaLexer implementation simple instead of teaching it all the Swift identifier boundary rules.

Performance checklist (performed on my local machine)

No regression during compilation → ~7% improvement in ParseSourceFileRequest, ~1.5% in ParseMembersRequest
No regression during code completion
No regression during SwiftSyntax parsing

The lexer is only responsible for skipping over trivia and noting their length. A separate TriviaLexer can be invoked to split the raw trivia string into its pieces. Since most of the time the trivia pieces aren't needed, this will allow us to later only parse trivia into pieces when they are explicitly needed.

rintaro · 2021-02-04T21:55:43Z

lib/SyntaxParse/SyntaxTreeCreator.cpp

+  StringRef tokenText = ArenaSourceBuffer.substr(tokStartOffset, tokLength);
+  StringRef trailingTriviaText = ArenaSourceBuffer.substr(
+      trailingTriviaStartOffset, trailingTrivia.size());
+
  auto ownedText = OwnedString::makeRefCounted(tokenText);


Now that RawSyntax always have SyntaxArena and SyntaxArena owns the memory of ArenaSourceBuffer. Do we really need to make ref-counted owned string? I feel storing tokenText in RaySyntax is sufficient.

Maybe we can just abolish OwnedString, and instead, copyToArenaIfNeeded(tokenText) just like trivia.

That's my next PR ;-) I wanted to not inflate this even further.

Specifically, it’s #35733

lib/Parse/Lexer.cpp

This is an intermediate state in which the lexer delegates the responsibility for trivia lexing to the parser. Later, the parser will delegate this responsibility to SyntaxParsingContext which will hand it over to SyntaxParseAction, which will only lex the pieces if it is really necessary to do so.

This is again a transitional state before SyntaxParsingContext hands the responsibility over to SyntaxTreeCreator and from there to SyntaxParseActions.

Next and final stop: SyntaxParseActions

The SyntaxParseActions can decide how to handle the raw trivia, either lex them into pieces or store them raw to be lexed when needed.

…s when requested

…ntaxArena buffer Referencing a string in arbitrary memory is not safe since the source buffer to which it points may have been freed. Instead copy all strings into the SyntaxArena. Since RawSyntax nodes retain their arena, they can be sure that the string won't disappear if it lives in their arena. To avoid lots of small copies, we copy the entire source buffer once into the syntax arena and make StringRefs point into that buffer.

…ntainsPointer calls In practice SyntaxArena.containsPointer is almost always called with a pointer from the SyntaxArena's source buffer. To avoid walking through all of the bump allocator's slabs until we find the one containing the source buffer, add a hot use memory region (which lives inside the bump allocator) that is checked first before consulting the bump allocator.

If the lexer itself keeps track of where the first comment of a token starts, we can avoid parsing trivia into pieces.

ahoppen · 2021-02-05T07:16:28Z

@swift-ci Please test

rintaro

LGTM!

rintaro · 2021-02-05T18:34:43Z

lib/Parse/Lexer.cpp

@@ -2530,7 +2525,10 @@ Token Lexer::getTokenAtLocation(const SourceManager &SM, SourceLoc Loc,
  return L.peekNextToken();
 }

-void Lexer::lexTrivia(ParsedTrivia &Pieces, bool IsForTrailingTrivia) {
+StringRef Lexer::lexTrivia(bool IsForTrailingTrivia) {


We don't need to return StringRef here? TrailingTrivia can also do like:

const char *trailingTriviaStart = CurPtr lextTrivia(true); TrailingTrivia = StringRef(trailingTriviaStart, CurPtr - trailingTriviaStart);

We could, but we'd only be creating the StringRef at the call site to store it in LeadingTrivia. Without measuring, I also assume that constructing and returning a StringRef also isn't particularly expensive, so I’d rather have a cleaner API here. Or do you have a different feeling about this?

Merging now to get this in. If you have a feeling that this will improve performance or anything else, I can investigate in a follow-up PR.

I'd just like to keep the consistency between LeadingTrivia and TrailingTrivia.
Another design would be lexTrivia receive the pointer and form the trivia inside it:

void lexTrivia(bool IsForTrailingTrivia, const char *TriviaStart) { // Advance 'CurPtr' to the end of the trivia ... return StringRef(TriviaStart, CurPtr - TriviaStart) } LeadingTrivia = lexTrivia(false, LeadingTriviaStart)

I'll put up a PR when I'm available.

ahoppen force-pushed the trivia-parsing branch from 9f8dba8 to 8cc387f Compare February 3, 2021 10:56

ahoppen requested a review from rintaro February 3, 2021 11:14

ahoppen marked this pull request as ready for review February 3, 2021 11:14

ahoppen force-pushed the trivia-parsing branch from 8cc387f to 6fc8726 Compare February 3, 2021 13:09

swiftlang deleted a comment from swift-ci Feb 3, 2021

ahoppen force-pushed the trivia-parsing branch from 6fc8726 to ab036a5 Compare February 3, 2021 13:11

ahoppen mentioned this pull request Feb 3, 2021

[libSyntax] Store the token's text in the SyntaxArena #35733

Merged

3 tasks

ahoppen force-pushed the trivia-parsing branch from ab036a5 to 18eaeb0 Compare February 3, 2021 14:28

swiftlang deleted a comment from swift-ci Feb 3, 2021

ahoppen force-pushed the trivia-parsing branch from 18eaeb0 to 6604eec Compare February 4, 2021 13:28

swiftlang deleted a comment from swift-ci Feb 4, 2021

rintaro reviewed Feb 4, 2021

View reviewed changes

lib/Parse/Lexer.cpp Outdated Show resolved Hide resolved

ahoppen added 10 commits February 5, 2021 08:15

[Lexer] Adjust tests for new delayed trivia lexing

a7641a7

[Lexer] Push trivia piece lexing down to SyntaxParsingContext

08ad703

This is again a transitional state before SyntaxParsingContext hands the responsibility over to SyntaxTreeCreator and from there to SyntaxParseActions.

[Lexer] Push trivia piece lexing down to ParsedRawSyntaxRecorder

6d5d8da

Next and final stop: SyntaxParseActions

[Lexer] Push trivia piece lexing down to SyntaxParseActions

3adefd3

The SyntaxParseActions can decide how to handle the raw trivia, either lex them into pieces or store them raw to be lexed when needed.

[libSyntax] Store raw trivia inside RawSyntax and only lex into piece…

5e1ba8b

…s when requested

[libSyntax] Adjust tests for raw trivia being stored in RawSyntax

db3a520

[Lexer] Eliminate unnecessary calls to TriviaLexer::lexTrivia

a8c0136

If the lexer itself keeps track of where the first comment of a token starts, we can avoid parsing trivia into pieces.

ahoppen force-pushed the trivia-parsing branch from 6604eec to a8c0136 Compare February 5, 2021 07:16

rintaro approved these changes Feb 5, 2021

View reviewed changes

rintaro reviewed Feb 5, 2021

View reviewed changes

ahoppen merged commit d0e27bb into swiftlang:main Feb 8, 2021

ahoppen mentioned this pull request Feb 11, 2021

[Lexer] Improve lexing of BOM trivia #35917

Merged

ahoppen deleted the trivia-parsing branch April 5, 2022 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libSyntax] Avoid lexing of trivia into pieces if possible #35649

[libSyntax] Avoid lexing of trivia into pieces if possible #35649

Uh oh!

ahoppen commented Jan 29, 2021 •

edited

Loading

Uh oh!

rintaro Feb 4, 2021 •

edited

Loading

Uh oh!

rintaro Feb 4, 2021 •

edited

Loading

Uh oh!

ahoppen Feb 5, 2021 •

edited

Loading

Uh oh!

rintaro Feb 5, 2021

Uh oh!

Uh oh!

ahoppen commented Feb 5, 2021

Uh oh!

rintaro left a comment

Uh oh!

rintaro Feb 5, 2021 •

edited

Loading

Uh oh!

ahoppen Feb 7, 2021

Uh oh!

ahoppen Feb 8, 2021

Uh oh!

rintaro Feb 8, 2021 •

edited

Loading

Uh oh!

Uh oh!

[libSyntax] Avoid lexing of trivia into pieces if possible #35649

[libSyntax] Avoid lexing of trivia into pieces if possible #35649

Uh oh!

Conversation

ahoppen commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance checklist (performed on my local machine)

Uh oh!

rintaro Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rintaro Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahoppen Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rintaro Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahoppen commented Feb 5, 2021

Uh oh!

rintaro left a comment

Choose a reason for hiding this comment

Uh oh!

rintaro Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahoppen Feb 7, 2021

Choose a reason for hiding this comment

Uh oh!

ahoppen Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

rintaro Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahoppen commented Jan 29, 2021 •

edited

Loading

rintaro Feb 4, 2021 •

edited

Loading

rintaro Feb 4, 2021 •

edited

Loading

ahoppen Feb 5, 2021 •

edited

Loading

rintaro Feb 5, 2021 •

edited

Loading

rintaro Feb 8, 2021 •

edited

Loading