-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Parse] Refine UTF8 validation-related aspects #70763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parse] Refine UTF8 validation-related aspects #70763
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks good to me 👍🏽
Could you also open a corresponding PR to fix this in the new lexer that’s implemented in Swift? https://github.com/apple/swift-syntax/blob/44f69680412c7802a7512f4e34f143c210b4dee2/Sources/SwiftParser/Lexer/UnicodeScalarExtensions.swift
return C <= 0x80 || (C >= 0xC2 && C < 0xF5); | ||
return C < 0x80 || (C >= 0xC2 && C < 0xF5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: Do you know of any character that uses a 0x80
continuation byte? If so, it would be nice to add a test case containing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahoppen
Added a test case that uses edge continuation bytes (0x80 and 0xBF), and another case for the lexer to accurately diagnose when there's a 0x80 in the source.
@ahoppen BTW, recently started studying Swift compiler infrastructure and am a big fan of your playground series introducing the compiler. I appreciate your efforts for the community; it helps a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extensive test cases!
@swift-ci Please smoke test |
@swift-ci Please smoke test Linux |
Refined some UTF8 validation-related aspects.
Ensure that the function call
isStartOfUTF8Character(0x80)
returns false.The expression
!isStartOfUTF8Character(CurByte)
should handle the pattern of 0b10xxxxxx when the EncodedBytes evaluates to 1.Guess there have been API changes in the LLVM function, and we no longer require the type conversion wrapper function for it. This is because the llvm::countl_one function now takes a parameter as a template type.