Skip to content

bpo-16285: Update urllib quoting to RFC 3986 #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 25, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion Doc/library/urllib.parse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -451,13 +451,17 @@ task isn't already covered by the URL parsing functions above.
.. function:: quote(string, safe='/', encoding=None, errors=None)

Replace special characters in *string* using the ``%xx`` escape. Letters,
digits, and the characters ``'_.-'`` are never quoted. By default, this
digits, and the characters ``'_.-~'`` are never quoted. By default, this
function is intended for quoting the path section of URL. The optional *safe*
parameter specifies additional ASCII characters that should not be quoted
--- its default value is ``'/'``.

*string* may be either a :class:`str` or a :class:`bytes`.

.. versionchanged:: 3.7
Moved from RFC 2396 to RFC 3986 for quoting URL strings. "~" is now
included in the set of reserved characters.

The optional *encoding* and *errors* parameters specify how to deal with
non-ASCII characters, as accepted by the :meth:`str.encode` method.
*encoding* defaults to ``'utf-8'``.
Expand Down
7 changes: 7 additions & 0 deletions Doc/whatsnew/3.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,13 @@ The :const:`~unittest.mock.sentinel` attributes now preserve their identity
when they are :mod:`copied <copy>` or :mod:`pickled <pickle>`.
(Contributed by Serhiy Storchaka in :issue:`20804`.)

urllib.parse
------------

:func:`urllib.parse.quote` has been updated to from RFC 2396 to RFC 3986,
adding `~` to the set of characters that is never quoted by default.
(Contributed by Christian Theune and Ratnadeep Debnath in :issue:`16285`.)


Optimizations
=============
Expand Down
4 changes: 2 additions & 2 deletions Lib/test/test_urllib.py
Original file line number Diff line number Diff line change
Expand Up @@ -733,7 +733,7 @@ def test_short_content_raises_ContentTooShortError_without_reporthook(self):
class QuotingTests(unittest.TestCase):
r"""Tests for urllib.quote() and urllib.quote_plus()

According to RFC 2396 (Uniform Resource Identifiers), to escape a
According to RFC 3986 (Uniform Resource Identifiers), to escape a
character you write it as '%' + <2 character US-ASCII hex value>.
The Python code of ``'%' + hex(ord(<character>))[2:]`` escapes a
character properly. Case does not matter on the hex letters.
Expand Down Expand Up @@ -761,7 +761,7 @@ def test_never_quote(self):
do_not_quote = '' .join(["ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"abcdefghijklmnopqrstuvwxyz",
"0123456789",
"_.-"])
"_.-~"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is the test case update I missed in my earlier review :)

result = urllib.parse.quote(do_not_quote)
self.assertEqual(do_not_quote, result,
"using quote(): %r != %r" % (do_not_quote, result))
Expand Down
9 changes: 6 additions & 3 deletions Lib/urllib/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,7 +704,7 @@ def unquote_plus(string, encoding='utf-8', errors='replace'):
_ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b'abcdefghijklmnopqrstuvwxyz'
b'0123456789'
b'_.-')
b'_.-~')
_ALWAYS_SAFE_BYTES = bytes(_ALWAYS_SAFE)
_safe_quoters = {}

Expand Down Expand Up @@ -736,15 +736,18 @@ def quote(string, safe='/', encoding=None, errors=None):
Each part of a URL, e.g. the path info, the query, etc., has a
different set of reserved characters that must be quoted.

RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
RFC 3986 Uniform Resource Identifiers (URI): Generic Syntax lists
the following reserved characters.

reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
"$" | "," | "~"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong: "~" is in the set of _UN_reserved chars in RFC 3986, please see https://bugs.python.org/issue12910 and its PR #2568


Each of these characters is reserved in some component of a URL,
but not necessarily in all of them.

Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings.
Now, "~" is included in the set of reserved characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this one earlier: there's no need to have the version change info in the docstring, so the change in the RFC reference and the addition of ~ to the set of reserved characters is sufficient here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also wrong as stated above.

By default, the quote function is intended for quoting the path
section of a URL. Thus, it will not encode '/'. This character
is reserved, but in typical usage the quote function is being
Expand Down
4 changes: 3 additions & 1 deletion Misc/ACKS
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ Kushal Das
Jonathan Dasteel
Pierre-Yves David
A. Jesse Jiryu Davis
Ratnadeep Debnath
Merlijn van Deen
John DeGood
Ned Deily
Expand Down Expand Up @@ -1518,6 +1519,7 @@ Mikhail Terekhov
Victor Terrón
Richard M. Tew
Tobias Thelen
Christian Theune
Févry Thibault
Lowe Thiderman
Nicolas M. Thiéry
Expand All @@ -1528,7 +1530,7 @@ Stephen Thorne
Jeremy Thurgood
Eric Tiedemann
July Tikhonov
Tracy Tims
Tracy Tims
Oren Tirosh
Tim Tisdall
Jason Tishler
Expand Down
4 changes: 4 additions & 0 deletions Misc/NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,10 @@ Extension Modules
Library
-------

- Issue #16285: urrlib.parse.quote is now based on RFC 3986 and hence includes
'~' in the set of characters that is not quoted by default. Patch by
Christian Theune and Ratnadeep Debnath.

- bpo-29532: Altering a kwarg dictionary passed to functools.partial()
no longer affects a partial object after creation.

Expand Down