Refactor the lexer to make it easier to understand, maintain and more swifty #1227

ahoppen · 2023-01-13T14:43:07Z

Honestly, I don’t know how best to review this. Essentially, I’ve gone through the entire lexer (without string lexing because I want to refactor that anyway) and made the following changes

peek is now safe by default and returns nil if we reached the end of the file. There is peek(matches:) and peek(doesntMatch:) to check if the current token is one of the expected
Split the lexer into multiple files
Re-arranges some methods to be grouped more logically
All the lex* methods now expect to be placed at the first character they are lexing (previously they expected the first character to already be lexed). This means we could get rid of the tokStart parameter. Also I think it’s more natural this way
Added a few more assertions
Made all variables lowercase
Added some documentation
Enabled argument labels
Plus probably a few miscellaneous more that I forgot

There should be no functionality change in this PR.

ahoppen · 2023-01-13T14:44:19Z

@swift-ci Please test

bnbarham · 2023-01-13T18:24:55Z

Sources/SwiftParser/Lexer/Cursor.swift

+  /// If the current character is in `matching`, advance the cursor and return `true`.
+  /// Otherwise, this is a no-op and returns `false`.
+  @_disfavoredOverload  // favor the stamped out copies
+  mutating func advance(matching characters: CharacterByte...) -> Bool {


What's the max number of characters we pass here (and similar for the peek case)? I'd prefer to just not have this one at all, I don't think we ever want to pay the cost of the ARC traffic.

It’s three. I just stamped out three copies.

bnbarham · 2023-01-13T18:27:33Z

Sources/SwiftParser/Lexer.swift

+extension Lexer.Cursor {
+  func peek(at offset: Int = 0) -> UInt8? {
+    assert(offset >= 0)
+    guard offset < self.input.count else {


It would be good to check the performance cost of running this on every single peek, ie. do we still need an unsafe version for when we know we're not at EOF?

Performance measurements of testSelfParsePerformance (release builds):

Current main: 6.061s

With Produce separate tokens for raw string delimiters and string quotes in the lexer #1192: 6.285s (+3.7%)

This PR: 6.330s (+ 0.7%)

I think this is acceptable given the added safety this gives us.

Raw data

main Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.062, relative standard deviation: 1.030%, values: [6.188462, 6.127593, 6.119602, 6.032848, 6.075716, 6.015036, 5.986559, 6.063117, 6.017368, 5.995290], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100 Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.061, relative standard deviation: 0.619%, values: [6.136058, 6.019346, 6.036355, 6.052679, 6.034555, 6.046183, 6.101317, 6.109356, 6.043739, 6.031751], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100 lex-string-delimiters Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.298, relative standard deviation: 0.360%, values: [6.314035, 6.311459, 6.294189, 6.284121, 6.263693, 6.275958, 6.329505, 6.316271, 6.323488, 6.268774], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100 Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.272, relative standard deviation: 0.664%, values: [6.369508, 6.263280, 6.288378, 6.267686, 6.229722, 6.224330, 6.228591, 6.259549, 6.308582, 6.278303], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100 This PR Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.340, relative standard deviation: 0.672%, values: [6.456374, 6.327926, 6.361954, 6.321446, 6.342133, 6.335581, 6.328410, 6.287928, 6.318708, 6.322288], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.319, relative standard deviation: 0.740%, values: [6.443643, 6.314616, 6.300575, 6.312995, 6.347714, 6.276697, 6.307492, 6.329889, 6.280654, 6.279295], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100

I'm somewhat surprised by the ~4% from the delimiters split. Any idea why that's so high?

EDIT: I assume just the check on state for every token

I just measured again using Instruments (and a command line tool that contains the code from testSelfParsePerformance instead of testSelfParsePerformance itself) and couldn’t find a significant difference between any of the branches. Not sure what introduced the variation before.

bnbarham · 2023-01-13T18:30:39Z

Sources/SwiftParser/Lexer.swift

+
+  /// Stamped out copy of `peek(at:matches:)` in case we are matching a single
+  /// character to avoid construction of an array.
+  func peek(at offset: Int = 0, matches character: PeekMatch) -> Bool {


advance has matching which IMO reads better. You could also make doesntMatch unless instead. It sort of sounds like a condition, but not as much as if and when would for the positive case 🤷.

Discussed offline to rename this to is(at:)

ahoppen · 2023-01-16T11:20:11Z

@swift-ci Please test

ahoppen · 2023-01-16T11:48:28Z

@swift-ci Please test

bnbarham reviewed Jan 13, 2023

View reviewed changes

ahoppen added 19 commits January 16, 2023 12:19

Make peeking in the lexer automatically check for end of file

fb72bc6

Split Lexer into multiple files

9dbe501

A few miscellaneous lexer improvements

88e4b84

Lowercase all variables in the lexer

2f27a9a

Use argument labels in the lexer

e7dbd8e

Refactor and document lexer cusor advancing

a1362a8

Refactor operator boundness

fc75223

Refactor advanceValidatingUTF8Character

879053c

Re-arrange string literal lexing and main lexer entry point

5710473

Refactor lexUnknown

39caacd

Refactor lexTrivia

34b13fe

Add test cases for BOM

dc8cd52

Re-arrange some more string-related lexing functions

95b1c33

Refactor lexNumber

701d13b

Refactor lexMagicPoundLiteral

02994da

Make operator lexing peek-based

a728bd2

Refactor tryLexRegexLiteral

c5e4ef3

Make lexUnknown peek-based

c648e85

Make lexImpl peek-based

050d162

ahoppen changed the title ~~WIP: Refactor the lexer to make it easier to understand, maintain and more swifty 🚥 #1192~~ Refactor the lexer to make it easier to understand, maintain and more swifty Jan 16, 2023

ahoppen force-pushed the ahoppen/refactor-lexer branch from 0a99f1d to ec2b1b9 Compare January 16, 2023 11:20

Rename peek(matches:) to is(at:)

8addb9a

ahoppen force-pushed the ahoppen/refactor-lexer branch from ec2b1b9 to 8addb9a Compare January 16, 2023 11:48

ahoppen marked this pull request as ready for review January 17, 2023 15:06

ahoppen merged commit 759043e into swiftlang:main Jan 17, 2023

ahoppen deleted the ahoppen/refactor-lexer branch January 17, 2023 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor the lexer to make it easier to understand, maintain and more swifty #1227

Refactor the lexer to make it easier to understand, maintain and more swifty #1227

Uh oh!

ahoppen commented Jan 13, 2023 •

edited

Loading

Uh oh!

ahoppen commented Jan 13, 2023

Uh oh!

bnbarham Jan 13, 2023

Uh oh!

ahoppen Jan 13, 2023

Uh oh!

bnbarham Jan 13, 2023

Uh oh!

ahoppen Jan 13, 2023 •

edited

Loading

Uh oh!

bnbarham Jan 13, 2023 •

edited

Loading

Uh oh!

ahoppen Jan 14, 2023

Uh oh!

bnbarham Jan 13, 2023

Uh oh!

ahoppen Jan 13, 2023

Uh oh!

ahoppen commented Jan 16, 2023

Uh oh!

ahoppen commented Jan 16, 2023

Uh oh!

Uh oh!

Refactor the lexer to make it easier to understand, maintain and more swifty #1227

Refactor the lexer to make it easier to understand, maintain and more swifty #1227

Uh oh!

Conversation

ahoppen commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahoppen commented Jan 13, 2023

Uh oh!

bnbarham Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

ahoppen Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

bnbarham Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

ahoppen Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnbarham Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahoppen Jan 14, 2023

Choose a reason for hiding this comment

Uh oh!

bnbarham Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

ahoppen Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

ahoppen commented Jan 16, 2023

Uh oh!

ahoppen commented Jan 16, 2023

Uh oh!

Uh oh!

ahoppen commented Jan 13, 2023 •

edited

Loading

ahoppen Jan 13, 2023 •

edited

Loading

bnbarham Jan 13, 2023 •

edited

Loading