[Do not merge] [Syntax] support invalid characters as trivia #14967

omochi · 2018-03-04T13:06:02Z

This PR follows my previous PR #14962.
I will rebase this after previous PR is merged.

This PR update Lexer to support invalid characters as trivia.
Currently, libSyntax does not support them, so round trip conversion is failed if source file contains them.

Invalid characters meaning in here are invalid utf-8 byte sequence and invalid unicode code point for Swift source.
lexImpl skip them and diagnose to replace them to white space.
In others, contiguous characters which is valid for body of identifier but not start becomes to unknown token.
U+201D (unicode right quote) becomes single character unknown token.
U+201C (unicode left quote) becomes single character unknown token.
U+201C [text...] U+201D becomes one long unknown token.

To keep this behavior and convert skipping case to trivia,
I split logic into lexInvalidCharacters function from lexImpl.
And I made isStartOfInvalidCharacters function which judge a position is start point of this.

Implementation of isStartOfInvalidCharacters repeats logic in switch-case flow in lexImpl,
so I think that is bad. But I still not have better idea.

A test case round_trip_invalids.swift checks round trip about these characters.
A test case tokens_invalids.swift checks diagnostic messages about them and token, trivia conversion.
If invalid bytes and chars written in test file directly,
it makes hard to read and edit.
So I use sed in RUN to embed these bytes to file in runtime.

I think that finally this PR makes libSyntax perfect for round trip functionality
with arbitraly source code even if it is not valid UTF-8 text.

omochi · 2018-03-05T10:26:04Z

I rebased it. Please review this.

omochi · 2018-03-05T11:39:18Z

I share my design ideas.

Design A

This PR style. Make lexInvalidCharacters and implement full pattern dispatch in this which must equals to lexImpl dispatch.

Pros: Simple control flow. All leading trivias parsed in lexTrivia at once.
Cons: Keep lexInvalidCharacters and lexImpl having same dispatch logic is hard.

Design B

Implemented in here.
#14979

Appending leading trivia in where each skipping invalid chars point.
~~To implement this, another internal loop is needed in lexTrivia below Restart: label.~~

Pros: Patch difference is small and no double repeated logics.
Cons: ~~lexImpl control flow is more complex.~~ Trivia modification code are spread at multiple position in source code.

Large problem: I lost consideration about trailing trivia!

Design C

Refactor design A to make abstract character dispatch flow.
Using virtual method and override or make higher kind function which take template lambda in arguments.
And (if use virtual method,) believe C++ optimizer to devirtualize method call.

Pros: Simple flow and no double repeated logics.
Cons: Performance may be worse. Syntactic overhead grow code size and worse readability.

omochi · 2018-03-05T13:58:27Z

To make discussion clear, I am making design B.
In working, I start to feel B is better than A.
I will share this later. Please wait merge this.

omochi · 2018-03-05T14:58:52Z

I clearly understood that design B is better and it does not need another inside loop.
I close this PR and please look #14979.

omochi force-pushed the syntax-invalid-chars branch 4 times, most recently from 087ad98 to 4b31531 Compare March 5, 2018 07:56

omochi added 3 commits March 5, 2018 19:23

[Parse] split lexInvalidCharacters from lexImpl

4ee8506

[Parse] add InLexTrivia flag to lexInvalidCharacters

ca19fe4

[Parse] handle invalid chars in lexTrivia

c56ae83

omochi force-pushed the syntax-invalid-chars branch from 4b31531 to c56ae83 Compare March 5, 2018 10:23

omochi changed the title ~~[Syntax] support invalid characters as trivia~~ [Do not merge] [Syntax] support invalid characters as trivia Mar 5, 2018

omochi mentioned this pull request Mar 5, 2018

[Do not merge] [Syntax] parse invalid chars as trivia #14979

Closed

omochi closed this Mar 5, 2018

omochi mentioned this pull request Mar 6, 2018

[Syntax] Parse invalid characters as trivia #15011

Merged

omochi deleted the syntax-invalid-chars branch March 7, 2018 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Do not merge] [Syntax] support invalid characters as trivia #14967

[Do not merge] [Syntax] support invalid characters as trivia #14967

Uh oh!

omochi commented Mar 4, 2018 •

edited

Loading

Uh oh!

omochi commented Mar 5, 2018

Uh oh!

omochi commented Mar 5, 2018 •

edited

Loading

Uh oh!

omochi commented Mar 5, 2018

Uh oh!

omochi commented Mar 5, 2018

Uh oh!

Uh oh!

[Do not merge] [Syntax] support invalid characters as trivia #14967

[Do not merge] [Syntax] support invalid characters as trivia #14967

Uh oh!

Conversation

omochi commented Mar 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omochi commented Mar 5, 2018

Uh oh!

omochi commented Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design A

Design B

Design C

Uh oh!

omochi commented Mar 5, 2018

Uh oh!

omochi commented Mar 5, 2018

Uh oh!

Uh oh!

omochi commented Mar 4, 2018 •

edited

Loading

omochi commented Mar 5, 2018 •

edited

Loading