Skip to content

bpo-39087: Add _PyUnicode_GetUTF8Buffer() #17659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 14, 2020
Merged

Conversation

methane
Copy link
Member

@methane methane commented Dec 19, 2019

Ref: bpo-39087

@@ -1061,6 +1061,28 @@ These are the UTF-8 codec APIs:
raised by the codec.


.. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the order of parameters. There is a logic in placing the output parameter at the end, but in PyObject_GetBuffer() (for which I modelled the name) it is not the last parameter. Also, if we will add more similar functions in future, with multiple additional parameters (like flags and errors), it is better to place them at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having view at the end.

}
Py_ssize_t refcnt = Py_REFCNT(str);

if (PyUnicode_GetUTF8Buffer(str, NULL, &buf) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible that PyUnicode_GetUTF8Buffer() fails for an ASCII string.

def test_getutf8buffer(self):
from _testcapi import unicode_getutf8buffer

ascii_ = "foo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not inline these variables?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Follow other tests
  • Variable name is a hint why this value is tested.

return PyBuffer_FillInfo(view, unicode,
PyUnicode_UTF8(unicode),
PyUnicode_UTF8_LENGTH(unicode),
1, PyBUF_SIMPLE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1, PyBUF_SIMPLE);
1 /* readonly */, PyBUF_SIMPLE);

@vstinner
Copy link
Member

IMHO it's a bad old habit that test_capi runs "all" tests. I prefer that test_unicode tests the C API of Unicode objects. So rename the method instead.

{
PyObject *unicode;
const char *errors = NULL;
if(!PyArg_ParseTuple(args, "U|s", &unicode, &errors)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U calls PyUnicode_READY(). This can affect the testing. I suggest to use O.

"without exception set. (%s:%d)",
__FILE__, __LINE__);
}
Py_DECREF(str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why decref twice?

// Test 3: There is a UTF-8 cache
// Reuse str of the previoss test.

const char *cache = PyUnicode_AsUTF8(str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too difficult to test also that there was no cache before calling PyUnicode_AsUTF8()?

@@ -1061,6 +1061,28 @@ These are the UTF-8 codec APIs:
raised by the codec.


.. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having view at the end.

:c:function:`PyUnicode_AsUTF8AndSize`, this function does not cache the
UTF-8 representation of the string in the *unicode* object.
So this API is faster and more efficient when the *unicode* object is
not ASCII string and it is encoded into UTF-8 only once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non obvious why avoiding a cache is more efficient. It seems like these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy, whereas PyUnicode_GetUTF8Buffer() doesn't. Maybe this sentence can be more vague about the rationale:

When the *unicode* object is not ASCII string and it is encoded into UTF-8 only once,
this API is more efficient than :c:function:`PyUnicode_AsUTF8` and
:c:function:`PyUnicode_AsUTF8AndSize`.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non obvious why avoiding a cache is more efficient.

When the cache is never used, it consume memory with no benfit.

these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy,

It is too detailed and fixable issue. See #18327.

# Run tests wrtten in C. Raise an error when test failed.
unicode_test_getutf8buffer()

ascii_ = "foo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename ascii_ to asciistr or ascii_str. The "_" suffix looks strange :-)

{
Py_buffer buf;

// Test 1: ASCII string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion. Would it be possible to factorize the code of each test with an helper function? It seems like the code of each test is basically copy/pasted. I don't think that it matters to provide accurate error message.

Add a paremeter "utf8_cache" for test 3 to call or not PyUnicode_AsUTF8().

@vstinner
Copy link
Member

vstinner commented Mar 2, 2020

@methane: What's the status of this issue?

@methane
Copy link
Member Author

methane commented Mar 3, 2020

@vstinner Now I have doubts about how this API is useful since PyUnicode_AsUTF8AndSize is as fast as this API.

One of the merits of this API is efficiency in cross-interpreter. But HPy will provide better cross interpreter APIs. Will HPy have similar API?

@vstinner
Copy link
Member

vstinner commented Mar 3, 2020 via email

@serhiy-storchaka
Copy link
Member

What about adding this function first to private C API? If it significantly speeds up and simplifies the code we can rename it and make public.

@methane
Copy link
Member Author

methane commented Mar 12, 2020

What about adding this function first to private C API? If it significantly speeds up and simplifies the code we can rename it and make public.

I did it.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's add a private function and see if it's used or not :-)

My only worry is that it's not used in Python itself which makes me feel that it's not really worth it to use it.

I also added minor comments but I don't think that it's worth it to hold the PR for that. It's up to you to apply suggested changes or not.

@methane methane changed the title bpo-39087: Add PyUnicode_GetUTF8Buffer(). bpo-39087: Add _PyUnicode_GetUTF8Buffer() Mar 14, 2020
@methane methane merged commit c7ad974 into python:master Mar 14, 2020
@methane methane deleted the utf8-buffer branch March 14, 2020 03:43
@methane
Copy link
Member Author

methane commented Mar 14, 2020

My only worry is that it's not used in Python itself which makes me feel that it's not really worth it to use it.

GH-18984 is one example I found in the stdlib.

I found

const char *str = PyUnicode_AsUTF8(py_val);
too.
This function used PyUnicode_AsUTF8 but utf8 cache of the py_val will not be used in the future.
But the py_val will be freed soon after this function. So this example demonstrates this new API is not worth enough...

I'm sorry about making garbage in commit log, but I will revert this pull request and abandon this new API. But I will spend some time to find more usage before the revert.

methane added a commit that referenced this pull request Mar 14, 2020
methane added a commit that referenced this pull request Mar 14, 2020
* Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)"

This reverts commit c7ad974.

* Update unicodeobject.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants