Skip to content

bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 15, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions Lib/test/test_unicodedata.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,19 @@ def test_issue10254(self):
b = 'C\u0338' * 20 + '\xC7'
self.assertEqual(self.db.normalize('NFC', a), b)

def test_issue29456(self):
# Fix #29456
u1176_str_a = '\u1100\u1176\u11a8'
u1176_str_b = '\u1100\u1176\u11a8'
u11a7_str_a = '\u1100\u1175\u11a7'
u11a7_str_b = '\uae30\u11a7'
u11c3_str_a = '\u1100\u1175\u11c3'
u11c3_str_b = '\uae30\u11c3'
self.assertEqual(self.db.normalize('NFC', u1176_str_a), u1176_str_b)
self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)


def test_east_asian_width(self):
eaw = self.db.east_asian_width
self.assertRaises(TypeError, eaw, b'a')
Expand Down
1 change: 1 addition & 0 deletions Misc/ACKS
Original file line number Diff line number Diff line change
Expand Up @@ -1729,6 +1729,7 @@ Jason Yeo
EungJun Yi
Bob Yodlowski
Danny Yoo
Wonsup Yoon
Rory Yorke
George Yoshida
Kazuhiro Yoshida
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix bugs in hangul normalization: u1176, u11a7 and u11c3
10 changes: 7 additions & 3 deletions Modules/unicodedata.c
Original file line number Diff line number Diff line change
Expand Up @@ -681,15 +681,19 @@ nfc_nfkc(PyObject *self, PyObject *input, int k)
if (LBase <= code && code < (LBase+LCount) &&
i + 1 < len &&
VBase <= PyUnicode_READ(kind, data, i+1) &&
PyUnicode_READ(kind, data, i+1) <= (VBase+VCount)) {
PyUnicode_READ(kind, data, i+1) < (VBase+VCount)) {
/* check L character is a modern leading consonant (0x1100 ~ 0x1112)
and V character is a modern vowel (0x1161 ~ 0x1175). */
int LIndex, VIndex;
LIndex = code - LBase;
VIndex = PyUnicode_READ(kind, data, i+1) - VBase;
code = SBase + (LIndex*VCount+VIndex)*TCount;
i+=2;
if (i < len &&
TBase <= PyUnicode_READ(kind, data, i) &&
PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
TBase < PyUnicode_READ(kind, data, i) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this should be < rather than <=?

Copy link
Contributor Author

@Pusnow Pusnow Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
That code determines PyUnicode_READ(kind, data, i) is a trailing(final) consonant while TBase(0x11A7) is the last Vowel in Hangul (Hangul Jamo).
So < is correct rather than <=.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And after checking (which I should have done before leaving my comment), I see that this agrees with section 3.12 of (version 10 of ) the standard.

Still, Python eyes are rather used to seeing half-open ranges, so anything other than lower <= value < high looks surprising. Is it worth adding a comment explaining what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll add some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added some comments. Is it enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, that's helpful.

Copy link

@ghost ghost Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give a supplement:

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0)
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.

PyUnicode_READ(kind, data, i) < (TBase+TCount)) {
/* check T character is a modern trailing consonant
(0x11A8 ~ 0x11C2). */
code += PyUnicode_READ(kind, data, i)-TBase;
i++;
}
Expand Down