Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotations - flyte directory #291

Merged
merged 9 commits into from
Dec 24, 2020
Merged

Conversation

wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Dec 23, 2020

This adds a FlyteDirectory type that maps to multipart Blobs. It's pretty similar to FlyteFile.

  • Found a small issue with FlyteFile, calling FlyteFile(s3://bucket) would prepend an os abspath to it. Added a test.
  • Separate out local and mock_remote directories. If not, when trying to cp -R from one to the other, you get exceptions
  • Change the new_execution_context function to take an existing working dir so that the spark pre_execute stuff doesn't nest working dirs

@wild-endeavor wild-endeavor changed the title [WIP] Annotations - flyte directory Annotations - flyte directory Dec 23, 2020
def extension(cls) -> str:
return ""

def __class_getitem__(cls, item: typing.Type) -> typing.Type[FlyteDirectory]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH when I am using these, I am not loving them to debug. They show weird errors. But, lets move with it, I want to make sure we go through and re-think this interface

return self._path


class FlyteDirectoryTransformer(TypeTransformer[FlyteDirectory]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments

self._remote_directory = remote_directory
self._remote_source = None

def __fspath__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this invoked? For a directory, should we not support list-dir?
And should we download all at once, or allow a streaming download so that you can read as some are downloaded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it is invoked for listdir as well. I like that too.
So lets keep this simple, I agree. Something that downloads at once a very simple interface. We can later build a streamingDirectory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment about list dir. Yeah a streaming directory of some kind in the future sounds good. I think we'll have a better idea of how to implement things when the streaming story is clearer.

The contents of this local directory will be uploaded to the Flyte store.
return FlyteDirectory("/path/to/dir/")

return FlyteDirectory["svg"]("/path/to/dir/", remote_path="s3://special/output/location")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean for a FlyteDirectory to have a type like svg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all files are svg? I deally I think I want to create something called as a flyteDataset, where there is a dataset index file, for every file in the dataset, each with some properties etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not really clear, but I wanted to support it cuz the IDL supports it.

@kumare3
Copy link
Contributor

kumare3 commented Dec 23, 2020

@wild-endeavor can we add a comment / NOTE on this type that it should not be used for very large datasets, as merely listing the dataset will cause the entire dataset to be downloaded. Honestly, listing on S3 and other backend object stores is not consistent and should not need data to be downloaded, but the current structure of FlyteDirectory does not support that

@wild-endeavor
Copy link
Contributor Author

Will add comment... but do you think we can one day implement things correctly changing user code? Should be able to right?

@wild-endeavor
Copy link
Contributor Author

We can maybe add list() and other functions that are a bit smarter in the future at least. I don't know if we'll ever be able to differentiate between the system open() call and the os.listdir() call though, that sounds too deep in python internals.

# a subfolder), unless remote_path=False was given
else:
if remote_directory is None:
remote_directory = ctx.file_access.get_random_remote_directory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is get_random_remote_directory deterministic? should you save it once you assign it so to_literal() returns the same value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is random. I'm gonna think about that as part of another PR though if that's okay...

@wild-endeavor wild-endeavor merged commit d68ffbe into annotations Dec 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants