Skip to content

[Parser] Correct the start byte range for UTF8 characters. #2426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 20 additions & 13 deletions Sources/SwiftParser/Lexer/UnicodeScalarExtensions.swift
Original file line number Diff line number Diff line change
Expand Up @@ -156,12 +156,6 @@ extension Unicode.Scalar {
// including and above the DEL character U+7F.
return self.value >= 0x20 && self.value < 0x7F
}

var isStartOfUTF8Character: Bool {
// RFC 2279: The octet values FE and FF never appear.
// RFC 3629: The octet values C0, C1, F5 to FF never appear.
return self.value <= 0x80 || (self.value >= 0xC2 && self.value < 0xF5)
}
}

extension Unicode.Scalar {
Expand All @@ -179,20 +173,25 @@ extension Unicode.Scalar {
return Unicode.Scalar(curByte)
}

// Read the number of high bits set, which indicates the number of bytes in
// the character.
let encodedBytes = (~(UInt32(curByte) << 24)).leadingZeroBitCount

// If this is 0b10XXXXXX, then it is a continuation character.
if encodedBytes == 1 || !Unicode.Scalar(curByte).isStartOfUTF8Character {
// If this is not the start of a UTF8 character,
// then it is either a continuation byte or an invalid UTF8 code point.
if !curByte.isStartOfUTF8Character {
// Skip until we get the start of another character. This is guaranteed to
// at least stop at the nul at the end of the buffer.
while let peeked = peek(), !Unicode.Scalar(peeked).isStartOfUTF8Character {
while let peeked = peek(), !peeked.isStartOfUTF8Character {
_ = advance()
}
return nil
}

// Read the number of high bits set, which indicates the number of bytes in
// the character.
let encodedBytes = (~curByte).leadingZeroBitCount
// We have a multi-byte UTF-8 scalar.
// Single-byte UTF-8 scalars are handled at the start of the function by checking `curByte < 0x80`.
// `isStartOfUTF8Character` guaranteed that the `curByte` has 2 to 4 leading ones.
precondition(encodedBytes >= 2 && encodedBytes <= 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a doc comment why this precondition holds? Something like

// We have a multi-byte UTF-8 scalar.
// Single-byte UTF-8 scalars are handled at the start of the function by checking `curByte < 0x80`.
// `isStartOfUTF8Character` guaranteed that the `curByte` has 2 to 4 leading ones.

Just to clarify that we should never hit this precondition if running the lexer on invalid UTF-8.

Copy link
Contributor Author

@pinkjuice66 pinkjuice66 Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!
Your words accurately articulate the rationale for the precondition, so no modifications are needed.
@ahoppen


// Drop the high bits indicating the # bytes of the result.
var charValue = UInt32(curByte << encodedBytes) >> encodedBytes

Expand Down Expand Up @@ -252,3 +251,11 @@ extension Unicode.Scalar {
return self.lexing(advance: advance, peek: peek)
}
}

extension UInt8 {
var isStartOfUTF8Character: Bool {
// RFC 2279: The octet values FE and FF never appear.
// RFC 3629: The octet values C0, C1, F5 to FF never appear.
return self < 0x80 || (self >= 0xC2 && self < 0xF5)
}
}
19 changes: 19 additions & 0 deletions Tests/SwiftParserTest/LexerTests.swift
Original file line number Diff line number Diff line change
Expand Up @@ -1504,4 +1504,23 @@ public class LexerTests: ParserTestCase {
]
)
}

func testUnicodeContainTheEdgeContinuationByte() {
// A continuation byte must be in the range greater than or
// equal to 0x80 and less than or equal to 0xBF

// À(0xC3 0x80), 㗀(0xE3 0x97 0x80), 🀀(0xF0 0x9F 0x80 0x80),
// ÿ(0xC3 0xBF), 俿(0xE4 0xBF 0xBF), 𐐿(0xF0 0x90 0x90 0xBF)
assertLexemes(
"À 㗀 🀀 ÿ 俿 𐐿",
lexemes: [
LexemeSpec(.identifier, text: "À", trailing: " "),
LexemeSpec(.identifier, text: "㗀", trailing: " "),
LexemeSpec(.identifier, text: "🀀", trailing: " "),
LexemeSpec(.identifier, text: "ÿ", trailing: " "),
LexemeSpec(.identifier, text: "俿", trailing: " "),
LexemeSpec(.identifier, text: "𐐿"),
]
)
}
}