Skip to content

[Do not merge] [Syntax] parse invalid chars as trivia #14979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

omochi
Copy link
Contributor

@omochi omochi commented Mar 5, 2018

I reject my PR and will reconstruct improved one.

This implementation does not lex invalid chars in trailing trivia.
Token sequence is split at invalid byte sequence and it becomes leading trivia.
It keeps round trip but does not follow ground trivia lexing rules.
I believe that to achieve this perfectly is very important for user of libSyntax in future.

I will close this PR later after reviewers read my response to each review comment.

I submitted new PR.
#15011


This PR add ability to Lexer that it handles invalid chars as trivia.
It makes libSyntax can round trip convert swift source which contains invalid chars.

Invalid chars means here is invalid UTF-8 byte sequence and invalid Unicode codepoint in Swift.
They are skipped in lexTrivia.
They are skipped in lexImpl before.

I added logic that push them to LeadingTrivia.

I refactor around there to make code more readable.

And I add round trip test case and token, trivia syntax handling.

I use replacing trick to embed such invalid byte sequence to test input.

I submitted another PR #14967 before which achieve same purpose with this PR.
But I found better design so write this PR and rejected #14967 by myself.

I think that finally this PR makes libSyntax perfect for round trip functionality
with arbitraly source code even if it is not valid UTF-8 text.

Please review this PR.

// CHECK-LABEL: 31:1
// CHECK-NEXT: (Token unknown
// CHECK-NEXT: (trivia newline 1)
// CHECK-NEXT: (text="\xE2\x80\x9C"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these characters invalid? If so, should they be garbage_text trivia instead of token text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harlanhaskins
Ah.. no.

E2 80 9C is U+201C, unicode style left quote.
http://unicode.org/cldr/utility/character.jsp?a=201C

E2 80 9D is U+201D, unicode style right quote.
https://unicode.org/cldr/utility/character.jsp?a=201D

They becomes to single character unknown token, or,
If found that pattern they enclosure text, it becomes long unknown token.
e.g. <201C>hello world<201D

The reason of this test in here, just that the logic about handling them are there in invalid character detection control flow.

And these characters are safe in text file,
I will split test case file about U+201C/201D.

@harlanhaskins
Copy link
Contributor

Thanks for tackling this big oversight!

Copy link
Contributor

@nkcsgexi nkcsgexi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for preserving roundtripness for invalid chars! In future, please try to put the functional change part and the refactoring (NFC) part to separate commits to help code reviewers better understand the impact 😄

@rintaro
Copy link
Member

rintaro commented Mar 6, 2018

@omochi
To make it clear, this PR doesn't lex invalid UTF8 sequence as a trailing trivia, right?
For example identifier SP <invalid UTF8 seq> SP LF identifier will be:

(Token identifier
  (text="...")
  (trivia space 1))
(Token identifier
  (trivia garbage_text <invalid UTF8 seq>)
  (trivia space 1)
  (trivia newline 1)
  (text="..."))

Could you add test case for this?

@rintaro
Copy link
Member

rintaro commented Mar 6, 2018

@nkcsgexi Adding ?w=1 makes it easy to review this kind of diffs :)
https://github.com/apple/swift/pull/14979/files?w=1

// 5a is Z. "ZN" style marker is used for marker. N is number.

// C2 is utf8 2 byte character start byte.
// RUN: cat %t | sed 's/'$(echo -ne "\x5a1")'/'$(echo -ne "\xc2")'/g' > %t.sed
Copy link
Member

@rintaro rintaro Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// RUN: cat %t | sed 's/Z1/'$(echo -ne "\xc2")'/g' > %t.sed
                        ^^

doesn't work? I think it's easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. If I do so, Z1 in sed line also replaced by this sed itself.
It makes problematic input source.

If Lexer is correct, it also skipped as line comment of course.
But for example it introduce other complex discussion about diagnostics about invalid character in comment.

@omochi
Copy link
Contributor Author

omochi commented Mar 6, 2018

@nkcsgexi OK I will split commit.

@rintaro Oh no, you are right.
I missed consideration about trailing trivia.

Ofcourse I will add test case about it.
But adding this lexing is difficult from here...
We need character dispatch in lexTrivia same as in lexImpl
which is represented as switch-case logic.

My previous PR did not have trailing trivia test case either.
But It may handle trailing trivia correctly.
#14967

During my design refactoring, I lost this functionality.
(I call previous PR as design A)

Please see my lexInvalidCharacters function implementation in previous PR
and my design discussion #14967 (comment) .

How approach is best about this problem?

@omochi omochi changed the title [Syntax] parse invalid chars as trivia [Do not merge] [Syntax] parse invalid chars as trivia Mar 6, 2018
@omochi
Copy link
Contributor Author

omochi commented Mar 6, 2018

I reject my PR and will reconstruct improved one.

This implementation does not lex invalid chars in trailing trivia.
Token sequence is split at invalid byte sequence and it becomes leading trivia.
It keeps round trip but does not follow ground trivia lexing rules.
I believe that to achieve this perfectly is very important for user of libSyntax in future.

I will close this PR later after reviewers read my response to each review comment.

@omochi omochi closed this Mar 6, 2018
@omochi omochi deleted the syntax-invalid-chars-B branch March 7, 2018 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants