Skip to content

[Parse] Refine UTF8 validation-related aspects #70763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 10, 2024

Conversation

pinkjuice66
Copy link
Contributor

Refined some UTF8 validation-related aspects.

  • '0x80' is not a valid byte for the start of a UTF8 character; it's a continuation byte.(RFC 3629)
    Ensure that the function call isStartOfUTF8Character(0x80) returns false.
  • Removed a duplicate condition check : (EncodedBytes == 1 || !isStartOfUTF8Character(CurByte))
    The expression !isStartOfUTF8Character(CurByte) should handle the pattern of 0b10xxxxxx when the EncodedBytes evaluates to 1.
  • Removed CLO8(:) function.
    Guess there have been API changes in the LLVM function, and we no longer require the type conversion wrapper function for it. This is because the llvm::countl_one function now takes a parameter as a template type.

Copy link
Member

@ahoppen ahoppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good to me 👍🏽

Could you also open a corresponding PR to fix this in the new lexer that’s implemented in Swift? https://github.com/apple/swift-syntax/blob/44f69680412c7802a7512f4e34f143c210b4dee2/Sources/SwiftParser/Lexer/UnicodeScalarExtensions.swift

Comment on lines -105 to +99
return C <= 0x80 || (C >= 0xC2 && C < 0xF5);
return C < 0x80 || (C >= 0xC2 && C < 0xF5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Do you know of any character that uses a 0x80 continuation byte? If so, it would be nice to add a test case containing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahoppen
Added a test case that uses edge continuation bytes (0x80 and 0xBF), and another case for the lexer to accurately diagnose when there's a 0x80 in the source.

@pinkjuice66
Copy link
Contributor Author

@ahoppen
Sure, will be following up on it in the Swift version.

BTW, recently started studying Swift compiler infrastructure and am a big fan of your playground series introducing the compiler. I appreciate your efforts for the community; it helps a lot.

Copy link
Member

@ahoppen ahoppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extensive test cases!

@ahoppen
Copy link
Member

ahoppen commented Jan 10, 2024

@swift-ci Please smoke test

@ahoppen
Copy link
Member

ahoppen commented Jan 10, 2024

@swift-ci Please smoke test Linux

@ahoppen ahoppen merged commit 5edd379 into swiftlang:main Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants