-
Notifications
You must be signed in to change notification settings - Fork 441
Refactor the lexer to make it easier to understand, maintain and more swifty #1227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci Please test |
/// If the current character is in `matching`, advance the cursor and return `true`. | ||
/// Otherwise, this is a no-op and returns `false`. | ||
@_disfavoredOverload // favor the stamped out copies | ||
mutating func advance(matching characters: CharacterByte...) -> Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the max number of characters we pass here (and similar for the peek case)? I'd prefer to just not have this one at all, I don't think we ever want to pay the cost of the ARC traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s three. I just stamped out three copies.
extension Lexer.Cursor { | ||
func peek(at offset: Int = 0) -> UInt8? { | ||
assert(offset >= 0) | ||
guard offset < self.input.count else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to check the performance cost of running this on every single peek, ie. do we still need an unsafe version for when we know we're not at EOF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance measurements of testSelfParsePerformance
(release builds):
- Current
main
: 6.061s - With Produce separate tokens for raw string delimiters and string quotes in the lexer #1192: 6.285s (+3.7%)
- This PR: 6.330s (+ 0.7%)
I think this is acceptable given the added safety this gives us.
Raw data
main
Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.062, relative standard deviation: 1.030%, values: [6.188462, 6.127593, 6.119602, 6.032848, 6.075716, 6.015036, 5.986559, 6.063117, 6.017368, 5.995290], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100
Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.061, relative standard deviation: 0.619%, values: [6.136058, 6.019346, 6.036355, 6.052679, 6.034555, 6.046183, 6.101317, 6.109356, 6.043739, 6.031751], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100
lex-string-delimiters
Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.298, relative standard deviation: 0.360%, values: [6.314035, 6.311459, 6.294189, 6.284121, 6.263693, 6.275958, 6.329505, 6.316271, 6.323488, 6.268774], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100
Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.272, relative standard deviation: 0.664%, values: [6.369508, 6.263280, 6.288378, 6.267686, 6.229722, 6.224330, 6.228591, 6.259549, 6.308582, 6.278303], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100
This PR
Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.340, relative standard deviation: 0.672%, values: [6.456374, 6.327926, 6.361954, 6.321446, 6.342133, 6.335581, 6.328410, 6.287928, 6.318708, 6.322288], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage
Test Case '-[SwiftParserTest.ParserTests testSelfParsePerformance]' measured [Time, seconds] average: 6.319, relative standard deviation: 0.740%, values: [6.443643, 6.314616, 6.300575, 6.312995, 6.347714, 6.276697, 6.307492, 6.329889, 6.280654, 6.279295], performanceMetricID:com.apple.XCTPerformanceMetric_WallClockTime, baselineName: "", baselineAverage: , polarity: prefers smaller, maxPercentRegression: 10.000%, maxPercentRelativeStandardDeviation: 10.000%, maxRegression: 0.100, maxStandardDeviation: 0.100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat surprised by the ~4% from the delimiters split. Any idea why that's so high?
EDIT: I assume just the check on state
for every token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just measured again using Instruments (and a command line tool that contains the code from testSelfParsePerformance
instead of testSelfParsePerformance
itself) and couldn’t find a significant difference between any of the branches. Not sure what introduced the variation before.
Sources/SwiftParser/Lexer.swift
Outdated
|
||
/// Stamped out copy of `peek(at:matches:)` in case we are matching a single | ||
/// character to avoid construction of an array. | ||
func peek(at offset: Int = 0, matches character: PeekMatch) -> Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advance
has matching
which IMO reads better. You could also make doesntMatch
unless
instead. It sort of sounds like a condition, but not as much as if
and when
would for the positive case 🤷.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline to rename this to is(at:)
0a99f1d
to
ec2b1b9
Compare
@swift-ci Please test |
ec2b1b9
to
8addb9a
Compare
@swift-ci Please test |
Honestly, I don’t know how best to review this. Essentially, I’ve gone through the entire lexer (without string lexing because I want to refactor that anyway) and made the following changes
peek
is now safe by default and returnsnil
if we reached the end of the file. There ispeek(matches:)
andpeek(doesntMatch:)
to check if the current token is one of the expectedlex*
methods now expect to be placed at the first character they are lexing (previously they expected the first character to already be lexed). This means we could get rid of thetokStart
parameter. Also I think it’s more natural this wayThere should be no functionality change in this PR.