Skip to content

Produce separate tokens for raw string delimiters and string quotes in the lexer #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 16, 2023

Conversation

ahoppen
Copy link
Member

@ahoppen ahoppen commented Jan 5, 2023

The eventual goal of this change is that we no longer need to re-lex string literals from the parser to separate them into their components. Instead, the lexer should just produce the lexemes that will later be put into the syntax tree as tokens.

The downside of this is that the lexer now needs to carry state and know whether it is lexing a string literal. On the upside, the string literal parser could be significantly simplified and the diagnostics got better without any further changes.

continue
} else {
// diagnose(TokStart, diag::lex_unterminated_string)
return (.unknown, [])
return LexerResult(.stringLiteralContents, newMode: .normal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is dangerous without possibly corresponding changes to string literal contents parsing. We need to make sure that its scanning and indexing logic can tolerate an early eof and strings like "\(xyz \v"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be working… At least the existing test cases didn't catch anything. I haven't banged it harder than that though

@@ -456,7 +493,7 @@ extension Lexer.Cursor {
return isDelimeter
})

guard clone.advance(matching: UInt8(ascii: #"""#)) != nil else {
guard !clone.isAtEndOfFile && clone.peek() == UInt8(ascii: #"""#) else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not for this PR, but it'd be nice if we had a safePeek or something similar. Soooo many isAtEndOfFile && peek checks

@@ -492,6 +529,58 @@ extension Lexer.Cursor {
return false
}

mutating func lexStringQuote() -> LexerResult {
func newMode(currentMode: LexerCursorMode, kind: StringLiteralKind) -> LexerCursorMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer nextState/state over mode, but I don't really care. state just makes it super clear that it's a state machine :P.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea 👍 I renamed it to state.

Comment on lines +24 to +25
DiagnosticSpec(message: "expected string literal in '#assert' directive"),
DiagnosticSpec(message: "unexpected code '123' in '#assert' directive"),
Copy link
Contributor

@bnbarham bnbarham Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two diagnostics here is a little weird. Maybe we should just mark the next unexpected after a missing token and then have the fix-it remove the unexpected? I haven't checked whether that would make other cases worse though.

@@ -73,16 +67,13 @@ final class RawStringErrorsTests: XCTestCase {
func testRawStringErrors5() {
AssertParse(
#####"""
let _ = ###"invalid"###1️⃣#2️⃣#3️⃣#4️⃣
let _ = ###"invalid"###1️⃣###
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of 3 isn't it?

Comment on lines +137 to +140
DiagnosticSpec(locationMarker: "1️⃣", message: #"expected '"' to end string literal"#),
DiagnosticSpec(locationMarker: "1️⃣", message: "unexpected code in string literal"),
DiagnosticSpec(locationMarker: "2️⃣", message: "expected ')' in string literal"),
DiagnosticSpec(locationMarker: "3️⃣", message: #"expected '"""' to end string literal"#),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a weird case I suppose, but it's a little strange that we don't treat this as bar being part of the string within the interpolation. Though having a string within an interpolation seems fairly unusual anyway, so... maybe doesn't actually matter.

I think ideally we'd have just two diagnostics: one for the missing ") (or two, whatever) and another for the missing """.

Comment on lines +32 to +34
DiagnosticSpec(locationMarker: "1️⃣", message: #"unexpected code '"' in string literal"#),
DiagnosticSpec(locationMarker: "2️⃣", message: "expected ')' in string literal"),
DiagnosticSpec(locationMarker: "2️⃣", message: #"expected '"' to end string literal"#),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to handle this one better, ie. don't eat the " as unexpected. That makes 7 worse, but IMO this would be by far the more common case.

Comment on lines +45 to +47
DiagnosticSpec(message: #"expected '"' to end string literal"#),
DiagnosticSpec(message: "expected ')' in string literal"),
DiagnosticSpec(message: #"expected '"' to end string literal"#),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as 2, though honestly I'm not sure it's worth keeping both. That's true for most of the testUncloseStringInterpolation* TBH.

@ahoppen ahoppen changed the title Produce separate tokens for raw string delimiters and string quotes in the lexer Produce separate tokens for raw string delimiters and string quotes in the lexer 🚥 #1191 Jan 10, 2023
@ahoppen ahoppen force-pushed the ahoppen/lex-string-delimiteres branch 2 times, most recently from 5735bd2 to 5a6c6bd Compare January 11, 2023 18:51
@ahoppen ahoppen changed the title Produce separate tokens for raw string delimiters and string quotes in the lexer 🚥 #1191 Produce separate tokens for raw string delimiters and string quotes in the lexer Jan 11, 2023
@ahoppen
Copy link
Member Author

ahoppen commented Jan 11, 2023

@swift-ci Please test

@ahoppen ahoppen changed the title Produce separate tokens for raw string delimiters and string quotes in the lexer Produce separate tokens for raw string delimiters and string quotes in the lexer 🚥 #1176 Jan 12, 2023
@ahoppen ahoppen force-pushed the ahoppen/lex-string-delimiteres branch from 5a6c6bd to df7fad7 Compare January 12, 2023 13:42
@ahoppen ahoppen changed the title Produce separate tokens for raw string delimiters and string quotes in the lexer 🚥 #1176 Produce separate tokens for raw string delimiters and string quotes in the lexer Jan 12, 2023
@ahoppen ahoppen force-pushed the ahoppen/lex-string-delimiteres branch from df7fad7 to f126e6c Compare January 12, 2023 16:18
@ahoppen
Copy link
Member Author

ahoppen commented Jan 12, 2023

@swift-ci Please test

@ahoppen ahoppen force-pushed the ahoppen/lex-string-delimiteres branch 2 times, most recently from cdf6e52 to ecf42cf Compare January 13, 2023 14:32
@ahoppen
Copy link
Member Author

ahoppen commented Jan 13, 2023

@swift-ci Please test

@rintaro rintaro self-assigned this Jan 13, 2023
@ahoppen
Copy link
Member Author

ahoppen commented Jan 14, 2023

@swift-ci Please test

@ahoppen
Copy link
Member Author

ahoppen commented Jan 14, 2023

@swift-ci Please test macOS

@ahoppen ahoppen force-pushed the ahoppen/lex-string-delimiteres branch 2 times, most recently from b853853 to df7c412 Compare January 14, 2023 11:57
@ahoppen
Copy link
Member Author

ahoppen commented Jan 14, 2023

@swift-ci Please test

…n the lexer

The eventual goal of this change is that we no longer need to re-lex string literals from the parser to separate them into their components. Instead, the lexer should just produce the lexemes that will later be put into the syntax tree as tokens.

The downside of this is that the lexer now needs to carry state and know whether it is lexing a string literal. On the upside, the string literal parser could be significantly simplified and the diagnostics got better without any further changes.
@ahoppen ahoppen force-pushed the ahoppen/lex-string-delimiteres branch from df7c412 to 41807be Compare January 16, 2023 08:44
@ahoppen
Copy link
Member Author

ahoppen commented Jan 16, 2023

@swift-ci Please test

@ahoppen ahoppen merged commit b931534 into swiftlang:main Jan 16, 2023
@ahoppen ahoppen deleted the ahoppen/lex-string-delimiteres branch January 16, 2023 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants