Skip to content

bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 15, 2018
Merged

bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958

merged 6 commits into from
Jun 15, 2018

Conversation

Pusnow
Copy link
Contributor

@Pusnow Pusnow commented Jun 5, 2017

@corona10
Copy link
Member

@Pusnow
I am not a committer of this library.
But here is a one thing I want to review.
Can you add test codes about your changing?
You can add your test cases in here.

Thank you.

@Pusnow
Copy link
Contributor Author

Pusnow commented Jul 29, 2017

Okay, I added some tests for the issue.

int LIndex, VIndex;
LIndex = code - LBase;
VIndex = PyUnicode_READ(kind, data, i+1) - VBase;
code = SBase + (LIndex*VCount+VIndex)*TCount;
i+=2;
if (i < len &&
TBase <= PyUnicode_READ(kind, data, i) &&
PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
TBase < PyUnicode_READ(kind, data, i) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this should be < rather than <=?

Copy link
Contributor Author

@Pusnow Pusnow Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
That code determines PyUnicode_READ(kind, data, i) is a trailing(final) consonant while TBase(0x11A7) is the last Vowel in Hangul (Hangul Jamo).
So < is correct rather than <=.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And after checking (which I should have done before leaving my comment), I see that this agrees with section 3.12 of (version 10 of ) the standard.

Still, Python eyes are rather used to seeing half-open ranges, so anything other than lower <= value < high looks surprising. Is it worth adding a comment explaining what's going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll add some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added some comments. Is it enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, that's helpful.

Copy link

@ghost ghost Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give a supplement:

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0)
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.

@Pusnow
Copy link
Contributor Author

Pusnow commented Aug 7, 2017

I think it can be merged. Is there anything I need to do?

@Pusnow
Copy link
Contributor Author

Pusnow commented Aug 24, 2017

Hello?

@corona10
Copy link
Member

@Pusnow
There should be a Misc/NEWS.d entry for this change using blurb.
See https://devguide.python.org/committing/#what-s-new-and-news-entries

@Pusnow
Copy link
Contributor Author

Pusnow commented Aug 24, 2017

Done, thank you for response.

@vstinner
Copy link
Member

I closed and reopened the PR to force to reschedule a test on AppVeyor: it just started a new job, https://ci.appveyor.com/project/python/cpython/build/3.8build17701

@zhangyangyu zhangyangyu merged commit d134809 into python:master Jun 15, 2018
@miss-islington
Copy link
Contributor

Thanks @Pusnow for the PR, and @zhangyangyu for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7, 3.6, 3.7.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-7702 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 15, 2018
…ythonGH-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <[email protected]>
@bedevere-bot
Copy link

GH-7703 is a backport of this pull request to the 3.6 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 15, 2018
…ythonGH-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <[email protected]>
@miss-islington
Copy link
Contributor

Sorry, @Pusnow and @zhangyangyu, I could not cleanly backport this to 2.7 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker d134809cd3764c6a634eab7bb8995e3e2eff14d5 2.7

miss-islington added a commit that referenced this pull request Jun 15, 2018
…H-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <[email protected]>
zhangyangyu pushed a commit to zhangyangyu/cpython that referenced this pull request Jun 15, 2018
…u11c3 (pythonGH-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3])..
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <[email protected]>
@bedevere-bot
Copy link

GH-7704 is a backport of this pull request to the 2.7 branch.

miss-islington added a commit that referenced this pull request Jun 15, 2018
…H-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <[email protected]>
zhangyangyu added a commit that referenced this pull request Jun 15, 2018
…H-1958) (GH-7704)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3])..
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants