Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: VersionedS3Path #139

Merged
merged 4 commits into from
Aug 7, 2023
Merged

Feature: VersionedS3Path #139

merged 4 commits into from
Aug 7, 2023

Conversation

chnpenny
Copy link
Contributor

@chnpenny chnpenny commented Jun 30, 2023

This PR addresses issue #126 by

  1. Adding a VersionedS3Path class to the public API
  2. Adding a st_version_id property to StatResult
  3. Adding automated tests for the new functionality in tests/test_path_operations.py
  4. Adding appropriate documentation to README.rst and docs/interface.rst

Additional notes

  • VersionedS3Path inherits from S3Path but adds a version_id parameter to the various constructor methods. In order to follow the Liskov substitution principle, this parameter is made optional with a default value of None. We thought that if a user does not supply a value for this parameter, then the returned instance should be of type S3Path instead of VersionedS3Path. Hence the necessity for overriding the __new__ constructor.
  • We think that while relative paths and bucket-only paths conceptually make sense for S3Path objects, they do not make sense for VersionedS3Path objects as relative and bucket-only S3 paths have no concept of a version ID. Thus a ValueError is raised if a used attempts to instantiate such an object.
  • We overrode the __repr__ method as the default __repr__ provided by pathlib would omit the value of the version_id argument, leading to a misrepresentation of the construction of the object.
  • The added st_version_id parameter of StatResult has a default value of None and thus is fully backwards compatible with previous versions of s3path. In fact, we were careful to make all changes in this PR backwards compatible.
  • In the new "Versioned S3 Objects:" section of README.rst, we added the line \*New in S3Path version 0.5.0. Please feel free to ask us to remove this line or to change it to whatever version this will be added to.
  • Co-authored between @chnpenny and @nlangellier

* update existing code to prepare for VersionedS3Path

* add tests for VersionedS3Path

* refactor imports

* add tests

* refactor tests

* add documentation

* add exists() logic for versioned s3 objects

* update tests for versioned s3 buckets

* remove unused import

* versioneds3path P1

* from_uri from_bucket_key

* versioneds3path II

* versioneds3path III

* formatting

* debug

* debug

* debug

* doc-string

* uncomment tests

---------

Co-authored-by: Nick Langellier <nick@videa.ai>
Co-authored-by: Nick Langellier <78754494+nlangellier@users.noreply.github.com>
@chnpenny
Copy link
Contributor Author

@liormizr this PR is ready. Can we get a review please?

Copy link
Owner

@liormizr liormizr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chnpenny
Sorry for the delay with this review
First thing I'll like to thank you for your contribution here
I think that this is a great start for a versioned s3path object.

New about the review
I think that we need to start something better decoupled from s3path
For now I don't want to touch S3Path and all his objects

I suggest to create a new:

  • _VersionedS3Accessor (that is based on _S3Accessor)
  • PureVersionedS3Path (that is based on PureS3Path)
  • VersionedS3Path (that is based on PureVersionedS3Path & S3Path)

Maybe in the future we will join them to one with S3Path or maybe not but for the first version
I think that this is the right direction.

I added more comments in the code so you will understand the direction that I want us to go on

Thank you again for the work

s3path.py Outdated Show resolved Hide resolved
s3path.py Outdated
class StatResult(namedtuple('BaseStatResult', 'size, last_modified')):
class VersionedS3Path(S3Path):

def __new__(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I wouldn't recommend adding the logic if to return a S3Path or an VersionedS3Path object
I want first to see how heavily there will be the need for this from the community.
Maybe in the future we will do it from S3Path.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point is type hints
Type hints are very nice but we need to be careful not to abuse them too much

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if you want me to remove some type hints

s3path.py Outdated
else:
return super().__new__(cls, *args)

def __init__(self, *args: Union[str, Path], version_id: str) -> None:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will recommend for compatibility with pathlib to not implement the init method

You can do this instead

class VersionedS3Path(S3Path):
    """
    S3Path subclass for AWS S3 service Keys with Versions.

    >> from s3path import VersionedS3Path
    >> VersionedS3Path('/<bucket>/<key>', version_id='<version_id>')
    << VersionedS3Path('/<bucket>/<key>', version_id='<version_id>')
    """

    def __new__(cls, *args, version_id):
        self = cls._from_parts(args, version_id=version_id)
        if not self._flavour.is_supported:
            raise NotImplementedError("cannot instantiate %r on your system"
                                      % (cls.__name__,))
        return self

    @classmethod
    def _from_parts(cls, args, *, version_id):
        self = super()._from_parts(cls, args)
        self._version_id = version_id
        return self

    def __rtruediv__(self, key):
        try:
            return self._from_parts([key] + self._parts, version_id=self._version_id)
        except TypeError:
            return NotImplemented

This way we are steeking with the pathlib methodologies and will stay compatible better for future versions.
The _from_parts class method is like the "init" method here.
Not about the VersionedS3Path version_id attribute, it's your choice if do you want it privet/public attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion. I used most of you suggestion but I did not include the __rtruediv__ because I think it makes more sense that relative paths are not allowed for versioned paths. let me know if you disagree.

Copy link
Owner

@liormizr liormizr Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nlangellier
Why do you don't want to support relative paths?
what do you think about this use case for example:

bucket = VersionedS3Path('/bucket/')
for file_name in ['f0', 'f1', ..., 'fn']:
    path = bucket / VersionedS3Path(file_name, version_id=...)
    # OR
   path = bucket.joinpath(VersionedS3Path(file_name, version_id=...))
   ...

according to pathlib this is a valid api
In S3Path we check if the path is absolute only in the level of methods that do api calls for AWS
So all methods in Pure classes doesn't really need this check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right. that should be allowed. i added both __truediv__ and __rtruediv__ methods to properly handle all possible combinations:

  • VersionedS3Path / VersionedS3Path
  • VersionedS3Path / S3Path
  • VersionedS3Path / str
  • S3Path / VersionedS3Path
  • str / VersionedS3Path

s3path.py Outdated Show resolved Hide resolved
s3path.py Outdated
"""

self = S3Path.from_uri(uri)
return cls._from_s3_path(s3_path=self, version_id=version_id)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_from_s3_path is a little to complicated for this situation.
I think we need to think on a simpler solution

First think this is a "Pure" path object class method.
Why not add a PureVersionedS3Path as well?

I recommend something like this:

class PureVersionedS3Path(PureS3Path):
    def __new__(...):
        ...

       @classmethod
    def from_uri(cls, uri: str, *, version_id: str) -> PureVersionedS3Path:
        """
        ...
        """
        if not uri.startswith('s3://'):
            raise ValueError('Provided uri seems to be no S3 URI!')
        return cls(uri[4:], version_id=version_id)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. good idea 👍

s3path.py Outdated Show resolved Hide resolved
@liormizr
Copy link
Owner

liormizr commented Aug 3, 2023

@chnpenny how do you want to continue?
I'm waiting for this PR for a big version
Do you want to continue or do you want me to take when you did as an example and write my own version of this feature?

@nlangellier
Copy link
Contributor

@chnpenny how do you want to continue? I'm waiting for this PR for a big version Do you want to continue or do you want me to take when you did as an example and write my own version of this feature?

@liormizr apologies. we have been quite busy until now. We will work on this now.

* address PR comments

* update versioned s3 bucket test

* remove unused type hint

* remove unused line

* add VersionedS3Path._init method
@nlangellier
Copy link
Contributor

@liormizr thanks for the review. i implemented your suggestions and this is ready for another round of review.

@nlangellier
Copy link
Contributor

Thanks for the comment about relative paths @liormizr. This is again ready for review.

@liormizr
Copy link
Owner

liormizr commented Aug 6, 2023

thanks @chnpenny & @nlangellier
I think that we are almost done
Two notes

  1. after adding PureVersionedS3Path and VersionedS3Path can we update the doc's accordingly? I mean in the docs/interface.rst file
  2. In PureVersionedS3Path why didn't we in the end go with changing the _from_parts method?

@nlangellier
Copy link
Contributor

nlangellier commented Aug 7, 2023

thanks @chnpenny & @nlangellier I think that we are almost done Two notes

  1. after adding PureVersionedS3Path and VersionedS3Path can we update the doc's accordingly? I mean in the docs/interface.rst file
  2. In PureVersionedS3Path why didn't we in the end go with changing the _from_parts method?

Thank you @liormizr

Regarding the first question: Yes forgot to update that. Will do that today.

Regarding the second point: We felt it was unnecessary to do so, but perhaps we missed an important detail as to why you suggested to override the _from_parts method? One of our goals in this PR was to avoid code duplication when possible. _VersionedS3Accessor required a fair bit of code duplication, which is OK. But as another example, you had originally suggested to define PureVersionedS3Path.from_uri as:

def from_uri(cls, uri: str, *, version_id: str) -> PureVersionedS3Path:
    if not uri.startswith('s3://'):
        raise ValueError('Provided uri seems to be no S3 URI!')
    return cls(uri[4:], version_id=version_id)

but we made the decision to recycle the code from PureS3Path.from_uri by implementing PureVersionedS3Path.from_uri as:

def from_uri(cls, uri: str, *, version_id: str) -> PureVersionedS3Path:
    self = PureS3Path.from_uri(uri)
    return cls(self, version_id=version_id)

The idea there was that if a future change is made to PureS3Path.from_uri, the same change will very likely be needed in PureVersionedS3Path.from_uri as well and thus the change will only need to be made in one place. In a similar manner of thinking we implemented PureVersionedS3Path.__rtruediv__ by trying to minimize code duplication and recycle code where possible. By defining it as

def __rtruediv__(self, key):
    if not isinstance(key, (PureS3Path, str)):
        return NotImplemented
    new_path = super().__rtruediv__(key)
    new_path.version_id = self.version_id
    return new_path

then we let pathlib.PurePath.__rtruediv__ "do the heavy lifting" so to speak and we only add the additional functionality needed on top of that. If you don't like this approach or there is some subtle bug we are not thinking of, please let us know and we'll be happy to implement your idea of additionally overriding _from_parts.

@liormizr
Copy link
Owner

liormizr commented Aug 7, 2023

For the first version we can go with your approach
Lets see maybe it's just my preference

Well done and thank you for the hard work

@liormizr liormizr merged commit f560b78 into liormizr:master Aug 7, 2023
@nlangellier
Copy link
Contributor

@liormizr we hadn't finished updating the documentation yet. We'll open a separate PR for that later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants