[Parse] Refine UTF8 validation-related aspects #70763

pinkjuice66 · 2024-01-08T14:59:22Z

Refined some UTF8 validation-related aspects.

'0x80' is not a valid byte for the start of a UTF8 character; it's a continuation byte.(RFC 3629)
Ensure that the function call isStartOfUTF8Character(0x80) returns false.
Removed a duplicate condition check : (EncodedBytes == 1 || !isStartOfUTF8Character(CurByte))
The expression !isStartOfUTF8Character(CurByte) should handle the pattern of 0b10xxxxxx when the EncodedBytes evaluates to 1.
Removed CLO8(:) function.
Guess there have been API changes in the LLVM function, and we no longer require the type conversion wrapper function for it. This is because the llvm::countl_one function now takes a parameter as a template type.

ahoppen

Thanks. Looks good to me 👍🏽

Could you also open a corresponding PR to fix this in the new lexer that’s implemented in Swift? https://github.com/apple/swift-syntax/blob/44f69680412c7802a7512f4e34f143c210b4dee2/Sources/SwiftParser/Lexer/UnicodeScalarExtensions.swift

ahoppen · 2024-01-08T22:42:44Z

lib/Parse/Lexer.cpp

-  return C <= 0x80 || (C >= 0xC2 && C < 0xF5);
+  return C < 0x80 || (C >= 0xC2 && C < 0xF5);


Out of curiosity: Do you know of any character that uses a 0x80 continuation byte? If so, it would be nice to add a test case containing it.

@ahoppen
Added a test case that uses edge continuation bytes (0x80 and 0xBF), and another case for the lexer to accurately diagnose when there's a 0x80 in the source.

pinkjuice66 · 2024-01-09T10:47:30Z

@ahoppen
Sure, will be following up on it in the Swift version.

BTW, recently started studying Swift compiler infrastructure and am a big fan of your playground series introducing the compiler. I appreciate your efforts for the community; it helps a lot.

ahoppen

Thanks for the extensive test cases!

ahoppen · 2024-01-10T00:40:15Z

@swift-ci Please smoke test

ahoppen · 2024-01-10T20:42:33Z

@swift-ci Please smoke test Linux

pinkjuice66 added 3 commits January 8, 2024 20:20

[Parse] Correct the range for the start of a UTF8 character

1269b97

[Parse] Remove duplicate condition check

d1ac870

[Parse] Eliminate unnecessary type conversion wrapper function

e6d8d39

pinkjuice66 requested review from ahoppen, bnbarham, CodaFi, DougGregor, hamishknight and rintaro as code owners January 8, 2024 14:59

ahoppen approved these changes Jan 8, 2024

View reviewed changes

[Parse] Add test cases for validating UTF-8 correctness

50d6b1f

ahoppen approved these changes Jan 10, 2024

View reviewed changes

ahoppen enabled auto-merge January 10, 2024 00:40

pinkjuice66 mentioned this pull request Jan 10, 2024

[Parser] Correct the start byte range for UTF8 characters. swiftlang/swift-syntax#2426

Merged

ahoppen merged commit 5edd379 into swiftlang:main Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Parse] Refine UTF8 validation-related aspects #70763

[Parse] Refine UTF8 validation-related aspects #70763

Uh oh!

pinkjuice66 commented Jan 8, 2024

Uh oh!

ahoppen left a comment

Uh oh!

ahoppen Jan 8, 2024

Uh oh!

pinkjuice66 Jan 9, 2024

Uh oh!

pinkjuice66 commented Jan 9, 2024

Uh oh!

ahoppen left a comment

Uh oh!

ahoppen commented Jan 10, 2024

Uh oh!

ahoppen commented Jan 10, 2024

Uh oh!

Uh oh!

		return C <= 0x80 \|\| (C >= 0xC2 && C < 0xF5);
		return C < 0x80 \|\| (C >= 0xC2 && C < 0xF5);

[Parse] Refine UTF8 validation-related aspects #70763

[Parse] Refine UTF8 validation-related aspects #70763

Uh oh!

Conversation

pinkjuice66 commented Jan 8, 2024

Uh oh!

ahoppen left a comment

Choose a reason for hiding this comment

Uh oh!

ahoppen Jan 8, 2024

Choose a reason for hiding this comment

Uh oh!

pinkjuice66 Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

pinkjuice66 commented Jan 9, 2024

Uh oh!

ahoppen left a comment

Choose a reason for hiding this comment

Uh oh!

ahoppen commented Jan 10, 2024

Uh oh!

ahoppen commented Jan 10, 2024

Uh oh!

Uh oh!