bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958

Pusnow · 2017-06-05T15:48:55Z

https://bugs.python.org/issue29456

corona10 · 2017-07-28T04:35:27Z

@Pusnow
I am not a committer of this library.
But here is a one thing I want to review.
Can you add test codes about your changing?
You can add your test cases in here.

Thank you.

Pusnow · 2017-07-29T06:35:15Z

Okay, I added some tests for the issue.

mdickinson · 2017-08-02T14:56:06Z

Modules/unicodedata.c

          int LIndex, VIndex;
          LIndex = code - LBase;
          VIndex = PyUnicode_READ(kind, data, i+1) - VBase;
          code = SBase + (LIndex*VCount+VIndex)*TCount;
          i+=2;
          if (i < len &&
-              TBase <= PyUnicode_READ(kind, data, i) &&
-              PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
+              TBase < PyUnicode_READ(kind, data, i) &&


Are you sure this should be < rather than <=?

Yes.
That code determines PyUnicode_READ(kind, data, i) is a trailing(final) consonant while TBase(0x11A7) is the last Vowel in Hangul (Hangul Jamo).
So < is correct rather than <=.

Thanks! And after checking (which I should have done before leaving my comment), I see that this agrees with section 3.12 of (version 10 of ) the standard.

Still, Python eyes are rather used to seeing half-open ranges, so anything other than lower <= value < high looks surprising. Is it worth adding a comment explaining what's going on?

Okay, I'll add some comments.

I've just added some comments. Is it enough?

Thanks! Yes, that's helpful.

Let me give a supplement:

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0)
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.

Pusnow · 2017-08-07T13:09:58Z

I think it can be merged. Is there anything I need to do?

Pusnow · 2017-08-24T08:44:34Z

Hello?

corona10 · 2017-08-24T08:47:47Z

@Pusnow
There should be a Misc/NEWS.d entry for this change using blurb.
See https://devguide.python.org/committing/#what-s-new-and-news-entries

Pusnow · 2017-08-24T09:17:17Z

Done, thank you for response.

vstinner · 2018-06-15T08:56:50Z

I closed and reopened the PR to force to reschedule a test on AppVeyor: it just started a new job, https://ci.appveyor.com/project/python/cpython/build/3.8build17701

miss-islington · 2018-06-15T12:03:17Z

Thanks @Pusnow for the PR, and @zhangyangyu for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7, 3.6, 3.7.
🐍🍒⛏🤖

bedevere-bot · 2018-06-15T12:03:38Z

GH-7702 is a backport of this pull request to the 3.7 branch.

…ythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>

bedevere-bot · 2018-06-15T12:04:27Z

GH-7703 is a backport of this pull request to the 3.6 branch.

…ythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>

miss-islington · 2018-06-15T12:05:20Z

Sorry, @Pusnow and @zhangyangyu, I could not cleanly backport this to 2.7 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker d134809cd3764c6a634eab7bb8995e3e2eff14d5 2.7

…H-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>

…u11c3 (pythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>

bedevere-bot · 2018-06-15T12:23:29Z

GH-7704 is a backport of this pull request to the 2.7 branch.

…H-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>

…H-1958) (GH-7704) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <[email protected]>

bpo-29456: fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3

2a7c327

the-knights-who-say-ni added the CLA signed label Jun 5, 2017

bpo-29456: Add test for #29456

445ff49

bpo-29456: fix white space

b8a62b2

mdickinson reviewed Aug 2, 2017

View reviewed changes

bpo-29456: Add comments

f14ba8b

bpo-29456: Update ACKS

2fc7fb8

bpo-29456: Add Misc/NEWS.d entry

823716d

brettcannon added the awaiting review label Feb 2, 2018

zhangyangyu approved these changes Jun 15, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Jun 15, 2018

zhangyangyu added needs backport to 3.6 labels Jun 15, 2018

vstinner closed this Jun 15, 2018

vstinner reopened this Jun 15, 2018

zhangyangyu merged commit d134809 into python:master Jun 15, 2018

bedevere-bot removed the awaiting merge label Jun 15, 2018

bedevere-bot removed the needs backport to 3.7 label Jun 15, 2018

bedevere-bot removed the needs backport to 3.6 label Jun 15, 2018

miss-islington assigned zhangyangyu Jun 15, 2018

bedevere-bot removed the needs backport to 2.7 label Jun 15, 2018

Uh oh!

bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958

bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958

Uh oh!

Conversation

Pusnow commented Jun 5, 2017 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corona10 commented Jul 28, 2017

Uh oh!

Pusnow commented Jul 29, 2017

Uh oh!

mdickinson Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

Pusnow Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdickinson Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

Pusnow Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

Pusnow Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

mdickinson Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

ghost Aug 10, 2017 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pusnow commented Aug 7, 2017

Uh oh!

Pusnow commented Aug 24, 2017

Uh oh!

corona10 commented Aug 24, 2017

Uh oh!

Pusnow commented Aug 24, 2017

Uh oh!

vstinner commented Jun 15, 2018

Uh oh!

miss-islington commented Jun 15, 2018

Uh oh!

bedevere-bot commented Jun 15, 2018

Uh oh!

bedevere-bot commented Jun 15, 2018

Uh oh!

miss-islington commented Jun 15, 2018

Uh oh!

bedevere-bot commented Jun 15, 2018

Uh oh!

Uh oh!

Pusnow commented Jun 5, 2017 •

edited by bedevere-bot

Loading

Pusnow Aug 2, 2017 •

edited

Loading

ghost Aug 10, 2017 •

edited by ghost

Loading