Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3RepositoryBackend: Improve performance of has_objects #20

Open
sphuber opened this issue Feb 27, 2023 · 0 comments
Open

S3RepositoryBackend: Improve performance of has_objects #20

sphuber opened this issue Feb 27, 2023 · 0 comments

Comments

@sphuber
Copy link
Owner

sphuber commented Feb 27, 2023

The has_objects implementation uses list_objects to get the list of all existing objects to compare them against the list of keys whose existence to check. The problem is that listing objects typically is a very expensive operation for object stores, never mind listing all keys present in the storage.

The has_objects method is called by the AbstractRepositoryBackend.delete_objects, and indirectly the AbstractRepositoryBackend.delete_object, method. We should investigate if we can avoid using list_objects in has_objects. One approach would be to call HEAD for each object which allows to get the metadata of an object, without retrieving the object itself. This would probably be more efficient if there are few keys to check. But there should be a cross-over point where if the keys passed to has_objects is large enough, the sheer amount of requests that have to be made (one per key) would exceed the cost of the list_objects.

The trouble is that the best solution probably therefore does not just depend on the number of objects in the repository, but also on the number of keys whose existence needs to be checked.

Alternatively, since the method is now only used directly by the delete object methods, maybe these can change their implementation to not explicitly check before deleting, but simply delete and catch errors for non-existing keys. The boto3.delete_objects method supports this and will delete existing objects and return an error message for non-existing ones. The only problem is that currently the AbstractRepositoryBackend.delete_objects is implemented such that no files are deleted as long as one of the provided keys does not exist. It is not clear if this behavior can be changed to simply delete those exist and log a message or raise for keys that did not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant