-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotations - flyte directory #291
Conversation
def extension(cls) -> str: | ||
return "" | ||
|
||
def __class_getitem__(cls, item: typing.Type) -> typing.Type[FlyteDirectory]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH when I am using these, I am not loving them to debug. They show weird errors. But, lets move with it, I want to make sure we go through and re-think this interface
flytekit/types/flyte_directory.py
Outdated
return self._path | ||
|
||
|
||
class FlyteDirectoryTransformer(TypeTransformer[FlyteDirectory]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comments
self._remote_directory = remote_directory | ||
self._remote_source = None | ||
|
||
def __fspath__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is this invoked? For a directory, should we not support list-dir?
And should we download all at once, or allow a streaming download so that you can read as some are downloaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it is invoked for listdir as well. I like that too.
So lets keep this simple, I agree. Something that downloads at once a very simple interface. We can later build a streamingDirectory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment about list dir. Yeah a streaming directory of some kind in the future sounds good. I think we'll have a better idea of how to implement things when the streaming story is clearer.
The contents of this local directory will be uploaded to the Flyte store. | ||
return FlyteDirectory("/path/to/dir/") | ||
|
||
return FlyteDirectory["svg"]("/path/to/dir/", remote_path="s3://special/output/location") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean for a FlyteDirectory to have a type like svg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all files are svg? I deally I think I want to create something called as a flyteDataset, where there is a dataset index file, for every file in the dataset, each with some properties etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not really clear, but I wanted to support it cuz the IDL supports it.
@wild-endeavor can we add a comment / NOTE on this type that it should not be used for very large datasets, as merely listing the dataset will cause the entire dataset to be downloaded. Honestly, listing on S3 and other backend object stores is not consistent and should not need data to be downloaded, but the current structure of FlyteDirectory does not support that |
Will add comment... but do you think we can one day implement things correctly changing user code? Should be able to right? |
We can maybe add |
# a subfolder), unless remote_path=False was given | ||
else: | ||
if remote_directory is None: | ||
remote_directory = ctx.file_access.get_random_remote_directory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is get_random_remote_directory deterministic? should you save it once you assign it so to_literal() returns the same value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is random. I'm gonna think about that as part of another PR though if that's okay...
This adds a FlyteDirectory type that maps to multipart Blobs. It's pretty similar to FlyteFile.
FlyteFile(s3://bucket)
would prepend an os abspath to it. Added a test.cp -R
from one to the other, you get exceptionsnew_execution_context
function to take an existing working dir so that the sparkpre_execute
stuff doesn't nest working dirs