Skip to content

bpo-31993: Do not use memoryview when pickle large strings. #5154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Jan 11, 2018

PyMemoryView_FromMemory() created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.

https://bugs.python.org/issue31993

PyMemoryView_FromMemory() created a memoryview referring to
the internal data of the string.  When the string is destroyed
the memoryview become referring to a freed memory.
Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. I am concerned about the high memory usage incurred by the use of PyBytes_FromStringAndSize but off-course I agree that the priority is to fix the safety issue.

@@ -2179,57 +2179,60 @@ def write(self, chunk):
def concatenate_chunks(self):
# Some chunks can be memoryview instances, we need to convert
# them to bytes to be able to call join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should now be removed.


# Actually read the binary content of the chunks after the end
# of the call to dump: ant memoryview passed to write should not
# of the call to dump: and memoryview passed to write should not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read "any memoryview" instead.

self.assertGreaterEqual(9, chunk_size)
self.assertLess(chunk_size, 2 * self.FRAME_SIZE_TARGET,
chunk_sizes)
# There shouldn't bee too much small chunks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment: the protocol header, the frame headers and the large string headers are written in small chunks.

@@ -2184,8 +2184,7 @@ _Pickler_write_bytes(PicklerObject *self,
/* Stream write the payload into the file without going through the
output buffer. */
if (payload == NULL) {
payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
PyBUF_READ);
payload = mem = PyBytes_FromStringAndSize(data, data_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces a memory copy. Wouldn't it be possible to create a read-only memoryview that keeps a reference to the original ascii str object to avoid the large memory overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Copy link
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review @ogrisel.

@@ -2184,8 +2184,7 @@ _Pickler_write_bytes(PicklerObject *self,
/* Stream write the payload into the file without going through the
output buffer. */
if (payload == NULL) {
payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
PyBUF_READ);
payload = mem = PyBytes_FromStringAndSize(data, data_size);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

@pitrou
Copy link
Member

pitrou commented Jan 11, 2018

This looks fine to me. Does the previous NEWS file need updating?

The picklers no longer allocate temporary memory when dumping large
``bytes`` and ``str`` objects into a file object. Instead the data is
directly streamed into the underlying file object.
The pickler now uses less memory when serialize large bytes and str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when serializing

@serhiy-storchaka serhiy-storchaka merged commit 5b76bdb into python:master Jan 12, 2018
@serhiy-storchaka serhiy-storchaka deleted the pickle-large-str-memoryview branch January 12, 2018 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip news type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants