-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
high memory usage caused by chunks being combined in-mem #64
Comments
Thanks for highlighting this issue, @trondhindenes, and taking the time to invesitgate it. When django-drf-filepond was originally created, I have to say that the handling of very large file uploads was not something that was really considered in detail. However, I now see from this and a couple of other recent issues that this is an important use case so it would be great to resolve this. I wonder if the "2x" memory usage that you're seeing is coming from the fact that the chunks are loaded into the More generally, this approach was originally taken so that a suitable file object can be passed in the creation of the For chunked uploads, as an alternative to the use of I think the general aim, where possible, was to avoid writing files to disk for both performance and security reasons. However, where we're looking at large uploads that need to be chunked, if the upload fails due to poor connectivity or other issues, it was decided that providing the ability to resume an upload rather than waste time/bandwidth re-uploading all the data that had already been successfully transferred was a primary aim since files are ultimately going to be stored on disk anyway when the If you'd like to open a PR to resolve this issue, please feel free to go ahead, all contributions much appreciated. Equally, if you'd like me to make updates, I can do this but there will be a bit of a delay due to other work. |
Hi, thanks for responding. I think your assumption about the 2x memory usage is correct. I'm not too familiar with the codebase to understand every nuance and consequence, but I can tell you what we've done to get around it short-term: instead of loading a bytesio object, we use a temporary file # Load each of the file parts into a tempfile
chunk_dir = os.path.join(storage.base_location, tuc.upload_dir)
fd, temp_file_path = tempfile.mkstemp()
with os.fdopen(fd, 'wb') as file_data:
for i in range(1, tuc.last_chunk+1):
chunk_file = os.path.join(chunk_dir, '%s_%s' % (tuc.file_id, i))
if not os.path.exists(chunk_file):
raise FileNotFoundError('Chunk file not found for chunk <%s>'
% (i))
with open(chunk_file, 'rb') as cf:
file_data.write(cf.read()) We don't create an assembled_tempfile_pth = os.path.join(chunk_dir, tuc.upload_id)
shutil.move(temp_file_path, assembled_tempfile_pth) This required us to make a few changes to |
Hi @trondhindenes, apologies for the long delay in implementing a comprehensive fix for this issue but I've now finished implementing a new As you observed when originally reporting this issue, previously, for a chunked upload, the separate file chunks stored to disk were being reconstituted into a complete file in memory and this InMemoryUploadedFile object was then being passed as the content for the The new The new functionality has just been merged in #82 and will released in a release candidate version of 0.5.0 in the next couple of days. If you're still working with this, feel free to go ahead and test the code in main (or try 0.5.0rc1 when this is available) and provide any feedback. |
thanks for letting me know! |
Just to confirm that this feature has been included in release v0.5.0 which has just been made available on PyPi. |
On large file uploads we're seeing heavy memory usage, roughly 2x the size of an upload while finalizing the upload.
After a lot of digging, it turns out that the culprit seems to be how django-drf-filepond pieces together chunks by using an in-mem
BytesIO
object. The "2x" usage seems to be related to copies of thatBytesIO
object.We've created a poc that's using python tempfiles instead of
BytesIO
, and moves files rather than copies them, in order to avoid slowing down the upload finalizer. We're prepared to spend time cleaning this up and creating a PR for that, but before doing so I was wondering if we should attempt to make the choice of in-mem or tempfile configurable using a setting? Doing so would cause some extra complexities, so it would take us some extra time but in terms of backwards compatability it would be safer I guess. For smaller uploads I guess the in-mem method is slightly faster too. Maybe we could even make it so that small uploads are pieced together in-mem while larger ones are built using tempfiles.In any case, would be good to get some feedback from the maintainers on this.
The text was updated successfully, but these errors were encountered: