-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Do not merge] [Syntax] parse invalid chars as trivia #14979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// CHECK-LABEL: 31:1 | ||
// CHECK-NEXT: (Token unknown | ||
// CHECK-NEXT: (trivia newline 1) | ||
// CHECK-NEXT: (text="\xE2\x80\x9C")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these characters invalid? If so, should they be garbage_text
trivia instead of token text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harlanhaskins
Ah.. no.
E2 80 9C is U+201C, unicode style left quote.
http://unicode.org/cldr/utility/character.jsp?a=201C
E2 80 9D is U+201D, unicode style right quote.
https://unicode.org/cldr/utility/character.jsp?a=201D
They becomes to single character unknown token, or,
If found that pattern they enclosure text, it becomes long unknown token.
e.g. <201C>hello world<201D
The reason of this test in here, just that the logic about handling them are there in invalid character detection control flow.
And these characters are safe in text file,
I will split test case file about U+201C/201D.
Thanks for tackling this big oversight! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for preserving roundtripness for invalid chars! In future, please try to put the functional change part and the refactoring (NFC) part to separate commits to help code reviewers better understand the impact 😄
@omochi
Could you add test case for this? |
@nkcsgexi Adding |
// 5a is Z. "ZN" style marker is used for marker. N is number. | ||
|
||
// C2 is utf8 2 byte character start byte. | ||
// RUN: cat %t | sed 's/'$(echo -ne "\x5a1")'/'$(echo -ne "\xc2")'/g' > %t.sed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RUN: cat %t | sed 's/Z1/'$(echo -ne "\xc2")'/g' > %t.sed
^^
doesn't work? I think it's easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. If I do so, Z1
in sed line also replaced by this sed itself.
It makes problematic input source.
If Lexer is correct, it also skipped as line comment of course.
But for example it introduce other complex discussion about diagnostics about invalid character in comment.
@nkcsgexi OK I will split commit. @rintaro Oh no, you are right. Ofcourse I will add test case about it. My previous PR did not have trailing trivia test case either. During my design refactoring, I lost this functionality. Please see my How approach is best about this problem? |
I reject my PR and will reconstruct improved one. This implementation does not lex invalid chars in trailing trivia. I will close this PR later after reviewers read my response to each review comment. |
I reject my PR and will reconstruct improved one.
This implementation does not lex invalid chars in trailing trivia.
Token sequence is split at invalid byte sequence and it becomes leading trivia.
It keeps round trip but does not follow ground trivia lexing rules.
I believe that to achieve this perfectly is very important for user of libSyntax in future.
I will close this PR later after reviewers read my response to each review comment.
I submitted new PR.
#15011
This PR add ability to Lexer that it handles invalid chars as trivia.
It makes libSyntax can round trip convert swift source which contains invalid chars.
Invalid chars means here is invalid UTF-8 byte sequence and invalid Unicode codepoint in Swift.
They are skipped inlexTrivia
.They are skipped in
lexImpl
before.I added logic that push them to
LeadingTrivia
.I refactor around there to make code more readable.
And I add round trip test case and token, trivia syntax handling.
I use replacing trick to embed such invalid byte sequence to test input.
I submitted another PR #14967 before which achieve same purpose with this PR.
But I found better design so write this PR and rejected #14967 by myself.
I think that finally this PR makes libSyntax perfect for round trip functionality
with arbitraly source code even if it is not valid UTF-8 text.
Please review this PR.