Skip to content

Commit 1889c4c

Browse files
zhangyangyuPusnow
andauthored
bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) (GH-7704)
Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>
1 parent fc8ea20 commit 1889c4c

File tree

4 files changed

+21
-2
lines changed

4 files changed

+21
-2
lines changed

Lib/test/test_unicodedata.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,19 @@ def test_issue10254(self):
204204
b = u'C\u0338' * 20 + u'\xC7'
205205
self.assertEqual(self.db.normalize('NFC', a), b)
206206

207+
def test_issue29456(self):
208+
# Fix #29456
209+
u1176_str_a = u'\u1100\u1176\u11a8'
210+
u1176_str_b = u'\u1100\u1176\u11a8'
211+
u11a7_str_a = u'\u1100\u1175\u11a7'
212+
u11a7_str_b = u'\uae30\u11a7'
213+
u11c3_str_a = u'\u1100\u1175\u11c3'
214+
u11c3_str_b = u'\uae30\u11c3'
215+
self.assertEqual(self.db.normalize('NFC', u1176_str_a), u1176_str_b)
216+
self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
217+
self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)
218+
219+
207220
def test_east_asian_width(self):
208221
eaw = self.db.east_asian_width
209222
self.assertRaises(TypeError, eaw, 'a')

Misc/ACKS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1578,6 +1578,7 @@ Jason Yeo
15781578
EungJun Yi
15791579
Bob Yodlowski
15801580
Danny Yoo
1581+
Wonsup Yoon
15811582
Rory Yorke
15821583
George Yoshida
15831584
Kazuhiro Yoshida
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fix bugs in hangul normalization: u1176, u11a7 and u11c3

Modules/unicodedata.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -664,14 +664,18 @@ nfc_nfkc(PyObject *self, PyObject *input, int k)
664664
pairs, since we always have decomposed data. */
665665
if (LBase <= *i && *i < (LBase+LCount) &&
666666
i + 1 < end &&
667-
VBase <= i[1] && i[1] <= (VBase+VCount)) {
667+
VBase <= i[1] && i[1] < (VBase+VCount)) {
668+
/* check L character is a modern leading consonant (0x1100 ~ 0x1112)
669+
and V character is a modern vowel (0x1161 ~ 0x1175). */
668670
int LIndex, VIndex;
669671
LIndex = i[0] - LBase;
670672
VIndex = i[1] - VBase;
671673
code = SBase + (LIndex*VCount+VIndex)*TCount;
672674
i+=2;
673675
if (i < end &&
674-
TBase <= *i && *i <= (TBase+TCount)) {
676+
TBase < *i && *i < (TBase+TCount)) {
677+
/* check T character is a modern trailing consonant
678+
(0x11A8 ~ 0x11C2). */
675679
code += *i-TBase;
676680
i++;
677681
}

0 commit comments

Comments
 (0)