[Do not merge] [Syntax] support invalid characters as trivia #14967
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR follows my previous PR #14962.I will rebase this after previous PR is merged.
This PR update Lexer to support invalid characters as trivia.
Currently, libSyntax does not support them, so round trip conversion is failed if source file contains them.
Invalid characters meaning in here are invalid utf-8 byte sequence and invalid unicode code point for Swift source.
lexImpl
skip them and diagnose to replace them to white space.In others, contiguous characters which is valid for body of identifier but not start becomes to unknown token.
U+201D (unicode right quote) becomes single character unknown token.
U+201C (unicode left quote) becomes single character unknown token.
U+201C [text...] U+201D becomes one long unknown token.
To keep this behavior and convert skipping case to trivia,
I split logic into
lexInvalidCharacters
function fromlexImpl
.And I made
isStartOfInvalidCharacters
function which judge a position is start point of this.Implementation of
isStartOfInvalidCharacters
repeats logic in switch-case flow inlexImpl
,so I think that is bad. But I still not have better idea.
A test case
round_trip_invalids.swift
checks round trip about these characters.A test case
tokens_invalids.swift
checks diagnostic messages about them and token, trivia conversion.If invalid bytes and chars written in test file directly,
it makes hard to read and edit.
So I use
sed
inRUN
to embed these bytes to file in runtime.I think that finally this PR makes libSyntax perfect for round trip functionality
with arbitraly source code even if it is not valid UTF-8 text.