[Do not merge] [Syntax] parse invalid chars as trivia #14979

omochi · 2018-03-05T14:56:44Z

I reject my PR and will reconstruct improved one.

This implementation does not lex invalid chars in trailing trivia.
Token sequence is split at invalid byte sequence and it becomes leading trivia.
It keeps round trip but does not follow ground trivia lexing rules.
I believe that to achieve this perfectly is very important for user of libSyntax in future.

I will close this PR later after reviewers read my response to each review comment.

I submitted new PR.
#15011

This PR add ability to Lexer that it handles invalid chars as trivia.
It makes libSyntax can round trip convert swift source which contains invalid chars.

Invalid chars means here is invalid UTF-8 byte sequence and invalid Unicode codepoint in Swift.
~~They are skipped in lexTrivia.~~
They are skipped in lexImpl before.

I added logic that push them to LeadingTrivia.

I refactor around there to make code more readable.

And I add round trip test case and token, trivia syntax handling.

I use replacing trick to embed such invalid byte sequence to test input.

I submitted another PR #14967 before which achieve same purpose with this PR.
But I found better design so write this PR and rejected #14967 by myself.

I think that finally this PR makes libSyntax perfect for round trip functionality
with arbitraly source code even if it is not valid UTF-8 text.

Please review this PR.

harlanhaskins · 2018-03-05T19:15:48Z

test/Syntax/tokens_invalids.swift

+// CHECK-LABEL: 31:1
+// CHECK-NEXT:  (Token unknown
+// CHECK-NEXT:   (trivia newline 1)
+// CHECK-NEXT:   (text="\xE2\x80\x9C"))


Are these characters invalid? If so, should they be garbage_text trivia instead of token text?

@harlanhaskins
Ah.. no.

E2 80 9C is U+201C, unicode style left quote.
http://unicode.org/cldr/utility/character.jsp?a=201C

E2 80 9D is U+201D, unicode style right quote.
https://unicode.org/cldr/utility/character.jsp?a=201D

They becomes to single character unknown token, or,
If found that pattern they enclosure text, it becomes long unknown token.
e.g. <201C>hello world<201D

The reason of this test in here, just that the logic about handling them are there in invalid character detection control flow.

And these characters are safe in text file,
I will split test case file about U+201C/201D.

harlanhaskins · 2018-03-05T19:16:15Z

Thanks for tackling this big oversight!

nkcsgexi

LGTM. Thank you for preserving roundtripness for invalid chars! In future, please try to put the functional change part and the refactoring (NFC) part to separate commits to help code reviewers better understand the impact 😄

rintaro · 2018-03-06T00:32:47Z

@omochi
To make it clear, this PR doesn't lex invalid UTF8 sequence as a trailing trivia, right?
For example identifier SP <invalid UTF8 seq> SP LF identifier will be:

(Token identifier
  (text="...")
  (trivia space 1))
(Token identifier
  (trivia garbage_text <invalid UTF8 seq>)
  (trivia space 1)
  (trivia newline 1)
  (text="..."))

Could you add test case for this?

rintaro · 2018-03-06T00:35:23Z

@nkcsgexi Adding ?w=1 makes it easy to review this kind of diffs :)
https://github.com/apple/swift/pull/14979/files?w=1

rintaro · 2018-03-06T00:52:28Z

test/Syntax/tokens_invalids.swift

+// 5a is Z. "ZN" style marker is used for marker. N is number.
+
+// C2 is utf8 2 byte character start byte.
+// RUN: cat %t | sed 's/'$(echo -ne "\x5a1")'/'$(echo -ne "\xc2")'/g' > %t.sed


// RUN: cat %t | sed 's/Z1/'$(echo -ne "\xc2")'/g' > %t.sed ^^

doesn't work? I think it's easier to read.

No. If I do so, Z1 in sed line also replaced by this sed itself.
It makes problematic input source.

If Lexer is correct, it also skipped as line comment of course.
But for example it introduce other complex discussion about diagnostics about invalid character in comment.

omochi · 2018-03-06T02:23:29Z

@nkcsgexi OK I will split commit.

@rintaro Oh no, you are right.
I missed consideration about trailing trivia.

Ofcourse I will add test case about it.
But adding this lexing is difficult from here...
We need character dispatch in lexTrivia same as in lexImpl
which is represented as switch-case logic.

My previous PR did not have trailing trivia test case either.
But It may handle trailing trivia correctly.
#14967

During my design refactoring, I lost this functionality.
(I call previous PR as design A)

Please see my lexInvalidCharacters function implementation in previous PR
and my design discussion #14967 (comment) .

How approach is best about this problem?

omochi · 2018-03-06T02:59:39Z

I reject my PR and will reconstruct improved one.

This implementation does not lex invalid chars in trailing trivia.
Token sequence is split at invalid byte sequence and it becomes leading trivia.
It keeps round trip but does not follow ground trivia lexing rules.
I believe that to achieve this perfectly is very important for user of libSyntax in future.

I will close this PR later after reviewers read my response to each review comment.

[Syntax] parse invalid chars as trivia

75c3397

omochi mentioned this pull request Mar 5, 2018

[Do not merge] [Syntax] support invalid characters as trivia #14967

Closed

nkcsgexi requested review from rintaro and harlanhaskins March 5, 2018 19:12

harlanhaskins reviewed Mar 5, 2018

View reviewed changes

nkcsgexi approved these changes Mar 5, 2018

View reviewed changes

rintaro reviewed Mar 6, 2018

View reviewed changes

omochi changed the title ~~[Syntax] parse invalid chars as trivia~~ [Do not merge] [Syntax] parse invalid chars as trivia Mar 6, 2018

omochi mentioned this pull request Mar 6, 2018

[Syntax] Parse invalid characters as trivia #15011

Merged

omochi closed this Mar 6, 2018

omochi deleted the syntax-invalid-chars-B branch March 7, 2018 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Do not merge] [Syntax] parse invalid chars as trivia #14979

[Do not merge] [Syntax] parse invalid chars as trivia #14979

Uh oh!

omochi commented Mar 5, 2018 •

edited

Loading

Uh oh!

harlanhaskins Mar 5, 2018

Uh oh!

omochi Mar 6, 2018

Uh oh!

harlanhaskins commented Mar 5, 2018

Uh oh!

nkcsgexi left a comment

Uh oh!

rintaro commented Mar 6, 2018 •

edited

Loading

Uh oh!

rintaro commented Mar 6, 2018

Uh oh!

rintaro Mar 6, 2018 •

edited

Loading

Uh oh!

omochi Mar 6, 2018

Uh oh!

omochi commented Mar 6, 2018 •

edited

Loading

Uh oh!

omochi commented Mar 6, 2018

Uh oh!

Uh oh!

[Do not merge] [Syntax] parse invalid chars as trivia #14979

[Do not merge] [Syntax] parse invalid chars as trivia #14979

Uh oh!

Conversation

omochi commented Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harlanhaskins Mar 5, 2018

Choose a reason for hiding this comment

Uh oh!

omochi Mar 6, 2018

Choose a reason for hiding this comment

Uh oh!

harlanhaskins commented Mar 5, 2018

Uh oh!

nkcsgexi left a comment

Choose a reason for hiding this comment

Uh oh!

rintaro commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rintaro commented Mar 6, 2018

Uh oh!

rintaro Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omochi Mar 6, 2018

Choose a reason for hiding this comment

Uh oh!

omochi commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omochi commented Mar 6, 2018

Uh oh!

Uh oh!

omochi commented Mar 5, 2018 •

edited

Loading

rintaro commented Mar 6, 2018 •

edited

Loading

rintaro Mar 6, 2018 •

edited

Loading

omochi commented Mar 6, 2018 •

edited

Loading