-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
bpo-31993: Do not use memoryview when pickle large strings. #5154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-31993: Do not use memoryview when pickle large strings. #5154
Conversation
PyMemoryView_FromMemory() created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. I am concerned about the high memory usage incurred by the use of PyBytes_FromStringAndSize
but off-course I agree that the priority is to fix the safety issue.
Lib/test/pickletester.py
Outdated
@@ -2179,57 +2179,60 @@ def write(self, chunk): | |||
def concatenate_chunks(self): | |||
# Some chunks can be memoryview instances, we need to convert | |||
# them to bytes to be able to call join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should now be removed.
Lib/test/pickletester.py
Outdated
|
||
# Actually read the binary content of the chunks after the end | ||
# of the call to dump: ant memoryview passed to write should not | ||
# of the call to dump: and memoryview passed to write should not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should read "any memoryview" instead.
Lib/test/pickletester.py
Outdated
self.assertGreaterEqual(9, chunk_size) | ||
self.assertLess(chunk_size, 2 * self.FRAME_SIZE_TARGET, | ||
chunk_sizes) | ||
# There shouldn't bee too much small chunks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment: the protocol header, the frame headers and the large string headers are written in small chunks.
Modules/_pickle.c
Outdated
@@ -2184,8 +2184,7 @@ _Pickler_write_bytes(PicklerObject *self, | |||
/* Stream write the payload into the file without going through the | |||
output buffer. */ | |||
if (payload == NULL) { | |||
payload = mem = PyMemoryView_FromMemory((char *) data, data_size, | |||
PyBUF_READ); | |||
payload = mem = PyBytes_FromStringAndSize(data, data_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces a memory copy. Wouldn't it be possible to create a read-only memoryview that keeps a reference to the original ascii str
object to avoid the large memory overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.
Here I just try to fix a regression introduced by #4353.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review @ogrisel.
Modules/_pickle.c
Outdated
@@ -2184,8 +2184,7 @@ _Pickler_write_bytes(PicklerObject *self, | |||
/* Stream write the payload into the file without going through the | |||
output buffer. */ | |||
if (payload == NULL) { | |||
payload = mem = PyMemoryView_FromMemory((char *) data, data_size, | |||
PyBUF_READ); | |||
payload = mem = PyBytes_FromStringAndSize(data, data_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.
Here I just try to fix a regression introduced by #4353.
This looks fine to me. Does the previous NEWS file need updating? |
Misc/NEWS.d/3.7.0a4.rst
Outdated
The picklers no longer allocate temporary memory when dumping large | ||
``bytes`` and ``str`` objects into a file object. Instead the data is | ||
directly streamed into the underlying file object. | ||
The pickler now uses less memory when serialize large bytes and str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when serializing
PyMemoryView_FromMemory()
created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.https://bugs.python.org/issue31993