bpo-39087: Add _PyUnicode_GetUTF8Buffer() #17659

methane · 2019-12-19T12:36:20Z

Ref: bpo-39087

Lib/test/test_unicode.py

Modules/_testcapimodule.c

Doc/c-api/unicode.rst

Objects/unicodeobject.c

Lib/test/test_unicode.py

serhiy-storchaka · 2019-12-21T18:30:00Z

Doc/c-api/unicode.rst

@@ -1061,6 +1061,28 @@ These are the UTF-8 codec APIs:
   raised by the codec.


+.. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view)


I am not sure about the order of parameters. There is a logic in placing the output parameter at the end, but in PyObject_GetBuffer() (for which I modelled the name) it is not the last parameter. Also, if we will add more similar functions in future, with multiple additional parameters (like flags and errors), it is better to place them at the end.

I like having view at the end.

serhiy-storchaka · 2019-12-21T18:35:45Z

Modules/_testcapimodule.c

+    }
+    Py_ssize_t refcnt = Py_REFCNT(str);
+
+    if (PyUnicode_GetUTF8Buffer(str, NULL,  &buf) < 0) {


It is not possible that PyUnicode_GetUTF8Buffer() fails for an ASCII string.

serhiy-storchaka · 2019-12-21T18:40:02Z

Lib/test/test_unicode.py

+    def test_getutf8buffer(self):
+        from _testcapi import unicode_getutf8buffer
+
+        ascii_ = "foo"


Why not inline these variables?

Follow other tests

Variable name is a hint why this value is tested.

serhiy-storchaka · 2019-12-21T18:42:15Z

Objects/unicodeobject.c

+        return PyBuffer_FillInfo(view, unicode,
+                PyUnicode_UTF8(unicode),
+                PyUnicode_UTF8_LENGTH(unicode),
+                1, PyBUF_SIMPLE);


Suggested change

1, PyBUF_SIMPLE);

1 /* readonly */, PyBUF_SIMPLE);

Objects/unicodeobject.c

vstinner · 2019-12-21T23:18:27Z

IMHO it's a bad old habit that test_capi runs "all" tests. I prefer that test_unicode tests the C API of Unicode objects. So rename the method instead.

serhiy-storchaka · 2019-12-25T08:57:43Z

Modules/_testcapimodule.c

+{
+    PyObject *unicode;
+    const char *errors = NULL;
+    if(!PyArg_ParseTuple(args, "U|s", &unicode, &errors)) {


U calls PyUnicode_READY(). This can affect the testing. I suggest to use O.

serhiy-storchaka · 2019-12-25T09:05:48Z

Modules/_testcapimodule.c

+                         "without exception set. (%s:%d)",
+                         __FILE__, __LINE__);
+        }
+        Py_DECREF(str);


Why decref twice?

Modules/_testcapimodule.c

serhiy-storchaka · 2019-12-25T09:10:10Z

Modules/_testcapimodule.c

+    // Test 3: There is a UTF-8 cache
+    // Reuse str of the previoss test.
+
+    const char *cache = PyUnicode_AsUTF8(str);


Would it be too difficult to test also that there was no cache before calling PyUnicode_AsUTF8()?

Modules/_testcapimodule.c

vstinner · 2020-01-09T09:22:31Z

Doc/c-api/unicode.rst

@@ -1061,6 +1061,28 @@ These are the UTF-8 codec APIs:
   raised by the codec.


+.. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view)


I like having view at the end.

vstinner · 2020-01-09T09:29:16Z

Doc/c-api/unicode.rst

+      :c:function:`PyUnicode_AsUTF8AndSize`, this function does not cache the
+      UTF-8 representation of the string in the *unicode* object.
+      So this API is faster and more efficient when the *unicode* object is
+      not ASCII string and it is encoded into UTF-8 only once.


It's non obvious why avoiding a cache is more efficient. It seems like these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy, whereas PyUnicode_GetUTF8Buffer() doesn't. Maybe this sentence can be more vague about the rationale:

When the *unicode* object is not ASCII string and it is encoded into UTF-8 only once, this API is more efficient than :c:function:`PyUnicode_AsUTF8` and :c:function:`PyUnicode_AsUTF8AndSize`.

It's non obvious why avoiding a cache is more efficient.

When the cache is never used, it consume memory with no benfit.

these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy,

It is too detailed and fixable issue. See #18327.

Include/cpython/unicodeobject.h

vstinner · 2020-01-09T09:31:06Z

Lib/test/test_unicode.py

+        # Run tests wrtten in C.  Raise an error when test failed.
+        unicode_test_getutf8buffer()
+
+        ascii_ = "foo"


I suggest to rename ascii_ to asciistr or ascii_str. The "_" suffix looks strange :-)

vstinner · 2020-01-09T09:33:12Z

Modules/_testcapimodule.c

+{
+    Py_buffer buf;
+
+    // Test 1: ASCII string


Suggestion. Would it be possible to factorize the code of each test with an helper function? It seems like the code of each test is basically copy/pasted. I don't think that it matters to provide accurate error message.

Add a paremeter "utf8_cache" for test 3 to call or not PyUnicode_AsUTF8().

Objects/unicodeobject.c

Co-Authored-By: Victor Stinner <[email protected]>

vstinner · 2020-03-02T17:37:18Z

@methane: What's the status of this issue?

methane · 2020-03-03T03:38:47Z

@vstinner Now I have doubts about how this API is useful since PyUnicode_AsUTF8AndSize is as fast as this API.

One of the merits of this API is efficiency in cross-interpreter. But HPy will provide better cross interpreter APIs. Will HPy have similar API?

vstinner · 2020-03-03T10:05:41Z

But HPy will provide better cross interpreter APIs. Will HPy have similar API?

No idea, you can ask at https://github.com/pyhandle/hpy/issues

serhiy-storchaka · 2020-03-10T06:57:58Z

What about adding this function first to private C API? If it significantly speeds up and simplifies the code we can rename it and make public.

methane · 2020-03-12T08:26:47Z

What about adding this function first to private C API? If it significantly speeds up and simplifies the code we can rename it and make public.

I did it.

vstinner

LGTM. Let's add a private function and see if it's used or not :-)

My only worry is that it's not used in Python itself which makes me feel that it's not really worth it to use it.

I also added minor comments but I don't think that it's worth it to hold the PR for that. It's up to you to apply suggested changes or not.

methane · 2020-03-14T04:09:31Z

My only worry is that it's not used in Python itself which makes me feel that it's not really worth it to use it.

GH-18984 is one example I found in the stdlib.

I found

cpython/Modules/_sqlite/connection.c

Line 510 in c7ad974

const char *str = PyUnicode_AsUTF8(py_val);

too.
This function used PyUnicode_AsUTF8 but utf8 cache of the py_val will not be used in the future.
But the py_val will be freed soon after this function. So this example demonstrates this new API is not worth enough...

I'm sorry about making garbage in commit log, but I will revert this pull request and abandon this new API. But I will spend some time to find more usage before the revert.

This reverts commit c7ad974.

* Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)" This reverts commit c7ad974. * Update unicodeobject.h

bpo-39087: Add PyUnicode_GetUTF8Buffer().

b923398

methane requested a review from a team as a code owner December 19, 2019 12:36

the-knights-who-say-ni added the CLA signed label Dec 19, 2019

bedevere-bot added the awaiting core review label Dec 19, 2019

methane requested review from vstinner and serhiy-storchaka December 19, 2019 12:36

vstinner reviewed Dec 19, 2019

View reviewed changes

methane added 3 commits December 20, 2019 19:45

Update doc

3fb6235

Write tests in C

3bac143

Add a comment

ec18bac

vstinner reviewed Dec 20, 2019

View reviewed changes

Lib/test/test_unicode.py Outdated Show resolved Hide resolved

Don't call test_unicode_getutf8buffer from test_unicode.

2f1f8ac

serhiy-storchaka reviewed Dec 21, 2019

View reviewed changes

methane mentioned this pull request Dec 23, 2019

bpo-39087: Make PyUnicode_AsUTF8AndSize faster. #17683

Closed

fixup

3a27b8b

serhiy-storchaka reviewed Dec 25, 2019

View reviewed changes

fixup

7cef9a1

vstinner reviewed Jan 9, 2020

View reviewed changes

Update Include/cpython/unicodeobject.h

d92ed64

Co-Authored-By: Victor Stinner <[email protected]>

Merge branch 'master' into utf8-buffer

a32837f

methane added 2 commits March 12, 2020 17:12

Merge remote-tracking branch 'origin/master' into utf8-buffer

f8f8a91

Make the API private.

98ec45f

vstinner approved these changes Mar 12, 2020

View reviewed changes

bedevere-bot removed the awaiting core review label Mar 12, 2020

bedevere-bot added the awaiting merge label Mar 12, 2020

methane changed the title ~~bpo-39087: Add PyUnicode_GetUTF8Buffer().~~ bpo-39087: Add _PyUnicode_GetUTF8Buffer() Mar 14, 2020

methane merged commit c7ad974 into python:master Mar 14, 2020

methane deleted the utf8-buffer branch March 14, 2020 03:43

bedevere-bot removed the awaiting merge label Mar 14, 2020

methane added a commit that referenced this pull request Mar 14, 2020

Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)"

fd84166

This reverts commit c7ad974.

methane mentioned this pull request Mar 14, 2020

Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" #18985

Merged

methane added a commit that referenced this pull request Mar 14, 2020

Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985)

3a8c562

* Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)" This reverts commit c7ad974. * Update unicodeobject.h

		@@ -1061,6 +1061,28 @@ These are the UTF-8 codec APIs:
		raised by the codec.


		.. c:function: int PyUnicode_GetUTF8Buffer(PyObject unicode, const char errors, Py_buffer view)

Uh oh!

bpo-39087: Add _PyUnicode_GetUTF8Buffer() #17659

bpo-39087: Add _PyUnicode_GetUTF8Buffer() #17659

Uh oh!

Conversation

methane commented Dec 19, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vstinner commented Dec 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vstinner commented Mar 2, 2020

Uh oh!

methane commented Mar 3, 2020

Uh oh!

vstinner commented Mar 3, 2020 via email

Uh oh!

serhiy-storchaka commented Mar 10, 2020

Uh oh!

methane commented Mar 12, 2020

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

methane commented Mar 14, 2020

Uh oh!

Uh oh!

methane commented Dec 19, 2019 •

edited by bedevere-bot

Loading