Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support for S3 version IDs #126

Closed
nlangellier opened this issue May 20, 2023 · 10 comments
Closed

Feature request: Support for S3 version IDs #126

nlangellier opened this issue May 20, 2023 · 10 comments

Comments

@nlangellier
Copy link
Contributor

nlangellier commented May 20, 2023

Hi @liormizr It would be nice to allow accessing specific versions of S3 objects for buckets that have enabled versioning.

This could be accomplished by adding an optional version_id=None argument to PureS3Path.from_uri and PureS3Path.from_bucket_key as well as defining S3Path.__new__ like the following:

def __new__(cls, *args, version_id=None, **kwargs):
    self = super().__new__(cls, *args, **kwargs)
    self.version_id = version_id
    return self

Then _S3Accessor.open would update the transport_params dict to include path.version_id if it is not None.

I already have a working prototype ready for pull request if you approve of the idea.

What do you think?

@liormizr
Copy link
Owner

Hi @nlangellier

Thanks for the feature request

Basically I don't have an issue adding S3 features as long that we are not changing the basic pathlib api and keeping our api as similar as posible

can you show me here what will be the change in the usage of s3path?

I'll give you example:
if we can add if versions are enable in the s3path.StatResult object that we get from S3Path.state
And add a feature to call the specific version without changing the api that can work
We need to check the syntax but maybe something like this:

path = S3Path('/<bucket>/<key>?VersionID=<version_id>')

Does that make sense for you?

@nlangellier
Copy link
Contributor Author

What I was suggesting would look like one of the following

path = S3Path('/<bucket>/<key>', version_id='<version_id>')
path = S3Path.from_uri('s3://<bucket>/<key>', version_id='<version_id>')
path = S3Path.from_bucket_key(bucket='<bucket>', key='<key>', version_id='<version_id>')

This would only be an addition on top of the current API so the following would still work as usual:

path = S3Path('/<bucket>/<key>')
path = S3Path.from_uri('s3://<bucket>/<key>')
path = S3Path.from_bucket_key(bucket='<bucket>', key='<key>')

But you are right that this would constitute a deviation from the pathlib API.

Are you implying with your suggestion of

path = S3Path('/<bucket>/<key>?VersionID=<version_id>')

that we would parse the input S3 path string in a similar way that query params are parsed in a URL? If so I could certainly make that work. I would call out that '?' is a valid S3 object name character, but it is specifically called out by AWS as a character that might need special treatment, so perhaps it's OK. Thoughts?

@liormizr
Copy link
Owner

Thats exactly what I mean yes
We need to do small research, and check if we can add "?" like URL parsing
And what happens when adding "?" to a key - how s3path/boto3 react to it.
If the answer from this research is clear and explicit behaviour, we can go forward.
If not, need to think on a different approach.

@nlangellier
Copy link
Contributor Author

In the class constructors it should be a simple matter to extract out the ?VersionId=<version id> part of the provided string before calling the parent class constructors from pathlib. So I think there should be no problem in making this compatible with s3path and boto3. The only thing we need to decide on is if we are OK with the edge case where someone has an S3 object whose bucket or key contains the string "?VersionId=". This would be an edge case that would fail with this implementation. However, as mentioned in the link in my previous comment, AWS lists the ? and = characters as special characters that may require special handling if they exist in key names. So maybe we are OK with this unlikely event?

If we are not OK with this edge case, I can think of some additional options for implementation:

  1. We could use my original idea of having an optional version=None argument in the constructors, with the change that this optional argument would only exist in PureS3Path.from_uri and PureS3Path.from_bucket_key since these methods don't have analogs in the pathlib API and thus already are a deviation from the pathlib API. i.e. versioned S3 objects would not be accessible through a direct call to S3Path like S3Path("/<bucket>/<key>").
  2. We could add some subset of he following constructors to PureS3Path:
  • from_version e.g. S3Path.from_version("/<bucket>/<key>", version_id="<version id>")
  • from_uri_version e.g. S3Path.from_uri_version("s3://<bucket>/<key>", version_id="<version id>")
  • from_bucket_key_version e.g. S3Path.from_bucket_key_version(bucket="<bucket>", key="<key>", version_id="<version id>")
  1. We could create a class VersionedS3Path that inherits from S3Path and whose constructor takes in a required version_id parameter. e.g. VersionedS3Path("/<bucket>/<key>", version_id="<version id>")

What do you think about the ?VersionId= edge case and the proposed alternate implementations?

@nlangellier
Copy link
Contributor Author

@liormizr When implementing the "?VersionId=" delimiter idea, it occurred to me that pathlib.Path has two methods of creating instances. pathlib.Path("a/b/c.d") produces the same object as pathlib.Path("a", "b", "c.d"). With the former, attaching ?VersionId=<version id> to the end of the path string seems natural. But with the latter, it seems rather awkward. My opinion is that with this awkward nature and combined with the fact that "/<bucket>/<key>?VersionId=<version id>" has an edge case when either <bucket> or <key> actually contain the string "?VersionId=", that we should not use the "?VersionId=" delimiter idea.

I would personally vote for the third option above (i.e. class VersionedS3Path(S3Path):). We already have precedent to extend the pathlib API in certain cases (e.g. PureS3Path.from_uri and PureS3Path.from_bucket_key) and this would just be one more such extension. As suggested in your earlier comment, I would make sure that VersionedS3Path.stat would behave as expected. WDYT?

@liormizr
Copy link
Owner

Hi @nlangellier
First thing I want to thank you for the interesting research that you did for this feature :-)

I agree with you, I think that for the first step to create a VersionedS3Path is the best approach.
It's clean and in the future it will be more flexible to add it to s3path if we will see the need.

So what I think that is the next steps:

  • add version to the general .stat method
  • create the new VersionedS3Path class
  • documentation & testing

What do you think?

@nlangellier
Copy link
Contributor Author

Perfect. I will put up a PR in the coming days fulfilling each of the bullet points you listed.

@nlangellier
Copy link
Contributor Author

Hi @liormizr #139 was just opened to address this feature request (co-authored between @chnpenny and myself) . Please have a look and review when you get a chance.

@liormizr
Copy link
Owner

liormizr commented Aug 7, 2023

We merged the PR
Tomorrow I'll upload a new version and update in this issue

@liormizr
Copy link
Owner

Deployed in version 0.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants