-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Okay, I added some tests for the issue. |
int LIndex, VIndex; | ||
LIndex = code - LBase; | ||
VIndex = PyUnicode_READ(kind, data, i+1) - VBase; | ||
code = SBase + (LIndex*VCount+VIndex)*TCount; | ||
i+=2; | ||
if (i < len && | ||
TBase <= PyUnicode_READ(kind, data, i) && | ||
PyUnicode_READ(kind, data, i) <= (TBase+TCount)) { | ||
TBase < PyUnicode_READ(kind, data, i) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this should be <
rather than <=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
That code determines PyUnicode_READ(kind, data, i)
is a trailing(final) consonant while TBase(0x11A7) is the last Vowel in Hangul (Hangul Jamo).
So <
is correct rather than <=
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! And after checking (which I should have done before leaving my comment), I see that this agrees with section 3.12 of (version 10 of ) the standard.
Still, Python eyes are rather used to seeing half-open ranges, so anything other than lower <= value < high
looks surprising. Is it worth adding a comment explaining what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll add some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added some comments. Is it enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, that's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give a supplement:
Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition
After Unicode 4.1.0, here is TBase < code < TBase+TCount
, which in line with the latest version (Unicode 10.0)
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition
This change happened in 2005.
I think it can be merged. Is there anything I need to do? |
Hello? |
@Pusnow |
Done, thank you for response. |
I closed and reopened the PR to force to reschedule a test on AppVeyor: it just started a new job, https://ci.appveyor.com/project/python/cpython/build/3.8build17701 |
Thanks @Pusnow for the PR, and @zhangyangyu for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7, 3.6, 3.7. |
GH-7702 is a backport of this pull request to the 3.7 branch. |
…ythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>
GH-7703 is a backport of this pull request to the 3.6 branch. |
…ythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>
Sorry, @Pusnow and @zhangyangyu, I could not cleanly backport this to |
…H-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>
…u11c3 (pythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>
GH-7704 is a backport of this pull request to the 2.7 branch. |
…H-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>
…H-1958) (GH-7704) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>
https://bugs.python.org/issue29456