Improve memory usage and execution time of listing objects with file system backend #556
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am using the GCS Fake Server to develop locally and it is mostly great. But I've noticed it is completely unable to list the objects in my fie system bucket. Even when I give it a prefix which ensures only one object will match. It consumes all of the machine's memory and never finishes, presumably because it spends all of its time swapping. Information about my bucket: 53017 files with overall size of 20.3G. Sadly, the nature of my work is such that this is a relatively small data set.
So I went in started poking around the code. It quickly became evident that two things are happening:
This PR fixes those two issues in its two commits. Previously the list object command was taking all of my machine's 32GB of ram and was not finishing even after I've waited on it for half an hour. Now such list commands take no memory at all (in the range of few KB) and finish instantaneously.
While the above is great, I suspect there are many more places where the emulator will be significantly faster. I just haven't clocked them. On top of my head I see that deleting the bucket will require no memory where before it had the same problem as listing objects.
Further Improvements
It would be great if it is possible to read only the meta data for blobs stored on the file system. Unfortunately with the JSON encoding I don't see how that would be possible. As it stands one have to load all of the file contents in order for the JSON parser to do its thing. This is sad, though. When we consider that in many situations we would want to get only the blob meta-data.
I think the only way to achieve this cleanly would be to drop the JSON altogether and find another way of storing the meta-data. Possible approaches are file headers similarly to the nginx file cache or separate ".attrs" files like what gocloud.dev/blob does.