Skip to content

[Do not merge] [Syntax] support invalid characters as trivia #14967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

omochi
Copy link
Contributor

@omochi omochi commented Mar 4, 2018

This PR follows my previous PR #14962.
I will rebase this after previous PR is merged.

This PR update Lexer to support invalid characters as trivia.
Currently, libSyntax does not support them, so round trip conversion is failed if source file contains them.

Invalid characters meaning in here are invalid utf-8 byte sequence and invalid unicode code point for Swift source.
lexImpl skip them and diagnose to replace them to white space.
In others, contiguous characters which is valid for body of identifier but not start becomes to unknown token.
U+201D (unicode right quote) becomes single character unknown token.
U+201C (unicode left quote) becomes single character unknown token.
U+201C [text...] U+201D becomes one long unknown token.

To keep this behavior and convert skipping case to trivia,
I split logic into lexInvalidCharacters function from lexImpl.
And I made isStartOfInvalidCharacters function which judge a position is start point of this.

Implementation of isStartOfInvalidCharacters repeats logic in switch-case flow in lexImpl,
so I think that is bad. But I still not have better idea.

A test case round_trip_invalids.swift checks round trip about these characters.
A test case tokens_invalids.swift checks diagnostic messages about them and token, trivia conversion.
If invalid bytes and chars written in test file directly,
it makes hard to read and edit.
So I use sed in RUN to embed these bytes to file in runtime.

I think that finally this PR makes libSyntax perfect for round trip functionality
with arbitraly source code even if it is not valid UTF-8 text.

@omochi omochi force-pushed the syntax-invalid-chars branch 4 times, most recently from 087ad98 to 4b31531 Compare March 5, 2018 07:56
@omochi omochi force-pushed the syntax-invalid-chars branch from 4b31531 to c56ae83 Compare March 5, 2018 10:23
@omochi
Copy link
Contributor Author

omochi commented Mar 5, 2018

I rebased it. Please review this.

@omochi
Copy link
Contributor Author

omochi commented Mar 5, 2018

I share my design ideas.

Design A

This PR style. Make lexInvalidCharacters and implement full pattern dispatch in this which must equals to lexImpl dispatch.

Pros: Simple control flow. All leading trivias parsed in lexTrivia at once.
Cons: Keep lexInvalidCharacters and lexImpl having same dispatch logic is hard.

Design B

Implemented in here.
#14979

Appending leading trivia in where each skipping invalid chars point.
To implement this, another internal loop is needed in lexTrivia below Restart: label.

Pros: Patch difference is small and no double repeated logics.
Cons: lexImpl control flow is more complex. Trivia modification code are spread at multiple position in source code.

Large problem: I lost consideration about trailing trivia!

Design C

Refactor design A to make abstract character dispatch flow.
Using virtual method and override or make higher kind function which take template lambda in arguments.
And (if use virtual method,) believe C++ optimizer to devirtualize method call.

Pros: Simple flow and no double repeated logics.
Cons: Performance may be worse. Syntactic overhead grow code size and worse readability.

@omochi omochi changed the title [Syntax] support invalid characters as trivia [Do not merge] [Syntax] support invalid characters as trivia Mar 5, 2018
@omochi
Copy link
Contributor Author

omochi commented Mar 5, 2018

To make discussion clear, I am making design B.
In working, I start to feel B is better than A.
I will share this later. Please wait merge this.

@omochi
Copy link
Contributor Author

omochi commented Mar 5, 2018

I clearly understood that design B is better and it does not need another inside loop.
I close this PR and please look #14979.

@omochi omochi closed this Mar 5, 2018
@omochi omochi deleted the syntax-invalid-chars branch March 7, 2018 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant