Skip to content

bpo-31993: Do not use memoryview when pickle large strings.#5154

Merged
serhiy-storchaka merged 4 commits into
python:masterfrom
serhiy-storchaka:pickle-large-str-memoryview
Jan 12, 2018
Merged

bpo-31993: Do not use memoryview when pickle large strings.#5154
serhiy-storchaka merged 4 commits into
python:masterfrom
serhiy-storchaka:pickle-large-str-memoryview

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jan 11, 2018

Copy link
Copy Markdown
Member

PyMemoryView_FromMemory() created a memoryview referring to the internal data of the string. When the string is destroyed the memoryview become referring to a freed memory.

https://bugs.python.org/issue31993

PyMemoryView_FromMemory() created a memoryview referring to
the internal data of the string.  When the string is destroyed
the memoryview become referring to a freed memory.

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. I am concerned about the high memory usage incurred by the use of PyBytes_FromStringAndSize but off-course I agree that the priority is to fix the safety issue.

Comment thread Lib/test/pickletester.py Outdated
@@ -2179,57 +2179,60 @@ def write(self, chunk):
def concatenate_chunks(self):
# Some chunks can be memoryview instances, we need to convert
# them to bytes to be able to call join

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should now be removed.

Comment thread Lib/test/pickletester.py Outdated

# Actually read the binary content of the chunks after the end
# of the call to dump: ant memoryview passed to write should not
# of the call to dump: and memoryview passed to write should not

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read "any memoryview" instead.

Comment thread Lib/test/pickletester.py Outdated
self.assertGreaterEqual(9, chunk_size)
self.assertLess(chunk_size, 2 * self.FRAME_SIZE_TARGET,
chunk_sizes)
# There shouldn't bee too much small chunks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment: the protocol header, the frame headers and the large string headers are written in small chunks.

Comment thread Modules/_pickle.c
if (payload == NULL) {
payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
PyBUF_READ);
payload = mem = PyBytes_FromStringAndSize(data, data_size);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces a memory copy. Wouldn't it be possible to create a read-only memoryview that keeps a reference to the original ascii str object to avoid the large memory overhead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@serhiy-storchaka serhiy-storchaka left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review @ogrisel.

Comment thread Modules/_pickle.c
if (payload == NULL) {
payload = mem = PyMemoryView_FromMemory((char *) data, data_size,
PyBUF_READ);
payload = mem = PyBytes_FromStringAndSize(data, data_size);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy. Currently there is no API for creating a memoryview to the area of memory with linked Python object. Designing such API is a separate issue. There should be other issues on the tracker that needs this API. For example bytes IO seems suffers from this issue.

Here I just try to fix a regression introduced by #4353.

@pitrou

pitrou commented Jan 11, 2018

Copy link
Copy Markdown
Member

This looks fine to me. Does the previous NEWS file need updating?

Comment thread Misc/NEWS.d/3.7.0a4.rst Outdated
The picklers no longer allocate temporary memory when dumping large
``bytes`` and ``str`` objects into a file object. Instead the data is
directly streamed into the underlying file object.
The pickler now uses less memory when serialize large bytes and str

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when serializing

@serhiy-storchaka serhiy-storchaka merged commit 5b76bdb into python:master Jan 12, 2018
@serhiy-storchaka serhiy-storchaka deleted the pickle-large-str-memoryview branch January 12, 2018 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip news type-bug An unexpected behavior, bug, or error

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants