-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
bpo-39087: Add _PyUnicode_GetUTF8Buffer() #17659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Doc/c-api/unicode.rst
Outdated
@@ -1061,6 +1061,28 @@ These are the UTF-8 codec APIs: | |||
raised by the codec. | |||
|
|||
|
|||
.. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the order of parameters. There is a logic in placing the output parameter at the end, but in PyObject_GetBuffer()
(for which I modelled the name) it is not the last parameter. Also, if we will add more similar functions in future, with multiple additional parameters (like flags
and errors
), it is better to place them at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having view at the end.
Modules/_testcapimodule.c
Outdated
} | ||
Py_ssize_t refcnt = Py_REFCNT(str); | ||
|
||
if (PyUnicode_GetUTF8Buffer(str, NULL, &buf) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible that PyUnicode_GetUTF8Buffer()
fails for an ASCII string.
def test_getutf8buffer(self): | ||
from _testcapi import unicode_getutf8buffer | ||
|
||
ascii_ = "foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not inline these variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Follow other tests
- Variable name is a hint why this value is tested.
Objects/unicodeobject.c
Outdated
return PyBuffer_FillInfo(view, unicode, | ||
PyUnicode_UTF8(unicode), | ||
PyUnicode_UTF8_LENGTH(unicode), | ||
1, PyBUF_SIMPLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1, PyBUF_SIMPLE); | |
1 /* readonly */, PyBUF_SIMPLE); |
IMHO it's a bad old habit that test_capi runs "all" tests. I prefer that test_unicode tests the C API of Unicode objects. So rename the method instead. |
Modules/_testcapimodule.c
Outdated
{ | ||
PyObject *unicode; | ||
const char *errors = NULL; | ||
if(!PyArg_ParseTuple(args, "U|s", &unicode, &errors)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U
calls PyUnicode_READY()
. This can affect the testing. I suggest to use O
.
Modules/_testcapimodule.c
Outdated
"without exception set. (%s:%d)", | ||
__FILE__, __LINE__); | ||
} | ||
Py_DECREF(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why decref twice?
// Test 3: There is a UTF-8 cache | ||
// Reuse str of the previoss test. | ||
|
||
const char *cache = PyUnicode_AsUTF8(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too difficult to test also that there was no cache before calling PyUnicode_AsUTF8()
?
Doc/c-api/unicode.rst
Outdated
@@ -1061,6 +1061,28 @@ These are the UTF-8 codec APIs: | |||
raised by the codec. | |||
|
|||
|
|||
.. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having view at the end.
Doc/c-api/unicode.rst
Outdated
:c:function:`PyUnicode_AsUTF8AndSize`, this function does not cache the | ||
UTF-8 representation of the string in the *unicode* object. | ||
So this API is faster and more efficient when the *unicode* object is | ||
not ASCII string and it is encoded into UTF-8 only once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's non obvious why avoiding a cache is more efficient. It seems like these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy, whereas PyUnicode_GetUTF8Buffer() doesn't. Maybe this sentence can be more vague about the rationale:
When the *unicode* object is not ASCII string and it is encoded into UTF-8 only once,
this API is more efficient than :c:function:`PyUnicode_AsUTF8` and
:c:function:`PyUnicode_AsUTF8AndSize`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's non obvious why avoiding a cache is more efficient.
When the cache is never used, it consume memory with no benfit.
these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy,
It is too detailed and fixable issue. See #18327.
# Run tests wrtten in C. Raise an error when test failed. | ||
unicode_test_getutf8buffer() | ||
|
||
ascii_ = "foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to rename ascii_ to asciistr or ascii_str. The "_" suffix looks strange :-)
{ | ||
Py_buffer buf; | ||
|
||
// Test 1: ASCII string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion. Would it be possible to factorize the code of each test with an helper function? It seems like the code of each test is basically copy/pasted. I don't think that it matters to provide accurate error message.
Add a paremeter "utf8_cache" for test 3 to call or not PyUnicode_AsUTF8().
Co-Authored-By: Victor Stinner <[email protected]>
@methane: What's the status of this issue? |
@vstinner Now I have doubts about how this API is useful since PyUnicode_AsUTF8AndSize is as fast as this API. One of the merits of this API is efficiency in cross-interpreter. But HPy will provide better cross interpreter APIs. Will HPy have similar API? |
But HPy will provide better cross interpreter APIs. Will HPy have similar API?
No idea, you can ask at https://github.com/pyhandle/hpy/issues
|
What about adding this function first to private C API? If it significantly speeds up and simplifies the code we can rename it and make public. |
I did it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's add a private function and see if it's used or not :-)
My only worry is that it's not used in Python itself which makes me feel that it's not really worth it to use it.
I also added minor comments but I don't think that it's worth it to hold the PR for that. It's up to you to apply suggested changes or not.
GH-18984 is one example I found in the stdlib. I found cpython/Modules/_sqlite/connection.c Line 510 in c7ad974
This function used PyUnicode_AsUTF8 but utf8 cache of the py_val will not be used in the future. But the py_val will be freed soon after this function. So this example demonstrates this new API is not worth enough... I'm sorry about making garbage in commit log, but I will revert this pull request and abandon this new API. But I will spend some time to find more usage before the revert. |
Ref: bpo-39087