Skip to content

bpo-30736: upgrade to Unicode 10.0 #2344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 23, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Doc/library/stdtypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -354,7 +354,7 @@ Notes:
The numeric literals accepted include the digits ``0`` to ``9`` or any
Unicode equivalent (code points with the ``Nd`` property).

See http://www.unicode.org/Public/8.0.0/ucd/extracted/DerivedNumericType.txt
See http://www.unicode.org/Public/10.0.0/ucd/extracted/DerivedNumericType.txt
for a complete list of code points with the ``Nd`` property.


Expand Down
8 changes: 4 additions & 4 deletions Doc/library/unicodedata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@

This module provides access to the Unicode Character Database (UCD) which
defines character properties for all Unicode characters. The data contained in
this database is compiled from the `UCD version 9.0.0
<http://www.unicode.org/Public/9.0.0/ucd>`_.
this database is compiled from the `UCD version 10.0.0
<http://www.unicode.org/Public/10.0.0/ucd>`_.

The module uses the same names and symbols as defined by Unicode
Standard Annex #44, `"Unicode Character Database"
Expand Down Expand Up @@ -168,6 +168,6 @@ Examples:

.. rubric:: Footnotes

.. [#] http://www.unicode.org/Public/9.0.0/ucd/NameAliases.txt
.. [#] http://www.unicode.org/Public/10.0.0/ucd/NameAliases.txt

.. [#] http://www.unicode.org/Public/9.0.0/ucd/NamedSequences.txt
.. [#] http://www.unicode.org/Public/10.0.0/ucd/NamedSequences.txt
4 changes: 2 additions & 2 deletions Doc/reference/lexical_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ The Unicode category codes mentioned above stand for:
* *Nd* - decimal numbers
* *Pc* - connector punctuations
* *Other_ID_Start* - explicit list of characters in `PropList.txt
<http://www.unicode.org/Public/8.0.0/ucd/PropList.txt>`_ to support backwards
<http://www.unicode.org/Public/10.0.0/ucd/PropList.txt>`_ to support backwards
compatibility
* *Other_ID_Continue* - likewise

Expand Down Expand Up @@ -875,4 +875,4 @@ occurrence outside string literals and comments is an unconditional error:

.. rubric:: Footnotes

.. [#] http://www.unicode.org/Public/8.0.0/ucd/NameAliases.txt
.. [#] http://www.unicode.org/Public/10.0.0/ucd/NameAliases.txt
7 changes: 7 additions & 0 deletions Doc/whatsnew/3.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,13 @@ xmlrpc.server
its subclasses can be used as a decorator. (Contributed by Xiang Zhang in
:issue:`7769`.)

unicodedata
-----------

The internal :mod:`unicodedata` database has been upgraded to use `Unicode 10
<http://www.unicode.org/versions/Unicode10.0.0/>`_. (Contributed by Benjamin
Peterson.)

urllib.parse
------------

Expand Down
4 changes: 2 additions & 2 deletions Lib/test/test_unicodedata.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
class UnicodeMethodsTest(unittest.TestCase):

# update this, if the database changes
expectedchecksum = 'c1fa98674a683aa8a8d8dee0c84494f8d36346e6'
expectedchecksum = '727091e0fd5807eb41c72912ae95cdd74c795e27'

def test_method_checksum(self):
h = hashlib.sha1()
Expand Down Expand Up @@ -80,7 +80,7 @@ class UnicodeFunctionsTest(UnicodeDatabaseTest):

# Update this if the database changes. Make sure to do a full rebuild
# (e.g. 'make distclean && make') to get the correct checksum.
expectedchecksum = 'f891b1e6430c712531b9bc935a38e22d78ba1bf3'
expectedchecksum = 'db6f92bb5010f8e85000634b08e77233355ab37a'
def test_function_checksum(self):
data = []
h = hashlib.sha1()
Expand Down
3 changes: 3 additions & 0 deletions Misc/NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ What's New in Python 3.7.0 alpha 1?
Core and Builtins
-----------------

- bpo-30736: The internal unicodedata database has been upgraded to Unicode
10.0.

- bpo-30604: Move co_extra_freefuncs from per-thread to per-interpreter to
avoid crashes.

Expand Down
5 changes: 3 additions & 2 deletions Modules/unicodedata.c
Original file line number Diff line number Diff line change
Expand Up @@ -921,11 +921,12 @@ is_unified_ideograph(Py_UCS4 code)
{
return
(0x3400 <= code && code <= 0x4DB5) || /* CJK Ideograph Extension A */
(0x4E00 <= code && code <= 0x9FD5) || /* CJK Ideograph */
(0x4E00 <= code && code <= 0x9FEA) || /* CJK Ideograph */
(0x20000 <= code && code <= 0x2A6D6) || /* CJK Ideograph Extension B */
(0x2A700 <= code && code <= 0x2B734) || /* CJK Ideograph Extension C */
(0x2B740 <= code && code <= 0x2B81D) || /* CJK Ideograph Extension D */
(0x2B820 <= code && code <= 0x2CEA1); /* CJK Ideograph Extension E */
(0x2B820 <= code && code <= 0x2CEA1) || /* CJK Ideograph Extension E */
(0x2CEB0 <= code && code <= 0x2EBEF); /* CJK Ideograph Extension F */
}

/* macros used to determine if the given code point is in the PUA range that
Expand Down
13,198 changes: 6,630 additions & 6,568 deletions Modules/unicodedata_db.h

Large diffs are not rendered by default.

50,236 changes: 25,629 additions & 24,607 deletions Modules/unicodename_db.h

Large diffs are not rendered by default.

4,557 changes: 2,295 additions & 2,262 deletions Objects/unicodetype_db.h

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions Tools/unicode/makeunicodedata.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
# * Doc/library/stdtypes.rst, and
# * Doc/library/unicodedata.rst
# * Doc/reference/lexical_analysis.rst (two occurrences)
UNIDATA_VERSION = "9.0.0"
UNIDATA_VERSION = "10.0.0"
UNICODE_DATA = "UnicodeData%s.txt"
COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt"
EASTASIAN_WIDTH = "EastAsianWidth%s.txt"
Expand Down Expand Up @@ -99,11 +99,12 @@
# these ranges need to match unicodedata.c:is_unified_ideograph
cjk_ranges = [
('3400', '4DB5'),
('4E00', '9FD5'),
('4E00', '9FEA'),
('20000', '2A6D6'),
('2A700', '2B734'),
('2B740', '2B81D'),
('2B820', '2CEA1'),
('2CEB0', '2EBE0'),
]

def maketables(trace=0):
Expand Down Expand Up @@ -1262,12 +1263,12 @@ def dump(self, file, trace=0):
for item in self.data:
i = str(item) + ", "
if len(s) + len(i) > 78:
file.write(s + "\n")
file.write(s.rstrip() + "\n")
s = " " + i
else:
s = s + i
if s.strip():
file.write(s + "\n")
file.write(s.rstrip() + "\n")
file.write("};\n\n")

def getsize(data):
Expand Down