-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Middleware to shard objects across a configurable number of buckets #351
Comments
A scheme like this would depend on hashing the object name and placing it in the appropriate shard. There are a few notable issues with this middleware:
|
Resharding is technically and operationally difficult. Perhaps you could defer it to an external mechanism, for example requiring users to duplicate objects into twice as many buckets? This would require a careful use of the hash function so that the mapping works. I believe the Swift ring only supports a fixed range of growth, e.g., starting with 256 virtual buckets in 8 physical buckets that can only grow to the virtual size. You might want to read about extendible hashing more sophisticated approaches. If you want to support listing buckets in lexicographical order it will limit your ability to distribute objects evenly between buckets. Maybe you can just raise HTTP 501 NotImplemented and distribute using a hash of the name instead? I worry that creating many buckets could trigger rate limiting on some providers. It might make more sense to defer this to an external program to provision. I recommend putting a superblock or other kind of metadata in each physical bucket that encodes the overall virtual scheme, e.g., number of buckets, hashing scheme, and a version number. |
Great points!
Agreed. I will defer the resharding to the consumer. I think the way this could work is by setting up a second "virtual" bucket with the new desired number of shards and doing server side copy from one "virtual bucket" to the other.
What do you think about listing all the shards in parallel and stitching the responses? This would amplify the number of requests by the number of shards, which may be undesirable.
I think both could be possible: CreateContainer could attempt to create the buckets in parallel, but nothing would preclude the end user from creating the shard buckets themselves, as long as the naming scheme is documented.
Good idea. |
Adds the sharded bucket middleware, which allows for splitting objects across multiple backend buckets for a given virtual bucket. The middleware should be configured as: s3proxy.sharded-blobstore.<bucket name>.shards=<number of shards> s3proxy.sharded-blobstore.<bucket name>.prefix=<prefix>. All shards are named <prefix>-<index>, where index is an integer from 0 to <number of shards> - 1. If the <prefix> is not supplied, the <bucket name> is used as the prefix. Listing the virtual bucket and multipart uploads are not supported. When listing all containers, the shards are elided from the result. Fixes gaul#325 Fixes gaul#351
Adds the sharded bucket middleware, which allows for splitting objects across multiple backend buckets for a given virtual bucket. The middleware should be configured as: s3proxy.sharded-blobstore.<bucket name>.shards=<number of shards> s3proxy.sharded-blobstore.<bucket name>.prefix=<prefix>. All shards are named <prefix>-<index>, where index is an integer from 0 to <number of shards> - 1. If the <prefix> is not supplied, the <bucket name> is used as the prefix. Listing the virtual bucket and multipart uploads are not supported. When listing all containers, the shards are elided from the result. Fixes gaul#325 Fixes gaul#351
Adds the sharded bucket middleware, which allows for splitting objects across multiple backend buckets for a given virtual bucket. The middleware should be configured as: s3proxy.sharded-blobstore.<bucket name>.shards=<number of shards> s3proxy.sharded-blobstore.<bucket name>.prefix=<prefix>. All shards are named <prefix>-<index>, where index is an integer from 0 to <number of shards> - 1. If the <prefix> is not supplied, the <bucket name> is used as the prefix. Listing the virtual bucket and multipart uploads are not supported. When listing all containers, the shards are elided from the result. Fixes gaul#325 Fixes gaul#351
Adds the sharded bucket middleware, which allows for splitting objects across multiple backend buckets for a given virtual bucket. The middleware should be configured as: s3proxy.sharded-blobstore.<bucket name>.shards=<number of shards> s3proxy.sharded-blobstore.<bucket name>.prefix=<prefix>. All shards are named <prefix>-<index>, where index is an integer from 0 to <number of shards> - 1. If the <prefix> is not supplied, the <bucket name> is used as the prefix. Listing the virtual bucket and multipart uploads are not supported. When listing all containers, the shards are elided from the result. Fixes gaul#325 Fixes gaul#351
Adds the sharded bucket middleware, which allows for splitting objects across multiple backend buckets for a given virtual bucket. The middleware should be configured as: s3proxy.sharded-blobstore.<bucket name>.shards=<number of shards> s3proxy.sharded-blobstore.<bucket name>.prefix=<prefix>. All shards are named <prefix>-<index>, where index is an integer from 0 to <number of shards> - 1. If the <prefix> is not supplied, the <bucket name> is used as the prefix. Listing the virtual bucket and multipart uploads are not supported. When listing all containers, the shards are elided from the result. Fixes gaul#325 Fixes gaul#351
Some object store providers impose rate limits or perform poorly as the number of objects in a bucket exceeds certain amounts. It would be great to have a middleware in S3Proxy that shards content across a configurable number of buckets and presents a virtual bucket to the end user.
The text was updated successfully, but these errors were encountered: