-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File refactor #102
File refactor #102
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
The author of this PR, dberenbaum, is not an activated member of this organization on Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! It simplifies a lot.
PS: I also started using f.read() if isinstance(f, File) else ??
instead of get_value()
.
@@ -1288,7 +1288,7 @@ def export_files( | |||
|
|||
def shuffle(self) -> "Self": | |||
"""Shuffle the rows of the chain deterministically.""" | |||
return super().shuffle() | |||
return self.order_by("sys.rand") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
@@ -200,6 +200,11 @@ def open(self): | |||
) as f: | |||
yield f | |||
|
|||
def read(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be read_text()
and read_bytes()
?
I'd expect read()
to also support size=
arg, similar to RawIOBase.read()
.
The API name also makes it unclear what it returns: bytes or str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to consider whether we can use a single File
class that conforms with pathlib and uses more typical conventions like .open(mode="rb")
and .read_text()
. In that case we would need a separate method to read the data as an image, like file.read_image()
.
We would still need a way to tell other methods like to_pytorch()
which read type to use or set a default read type. Not sure if it's feasible before release, but good to discuss it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge since I don't think your comments are a blocker or that this PR makes the situation any worse. We can keep discussing and follow up on top.
This is a bit close to release 😅 , but writing up the docs, the
File
handling still feels slightly awkward. This makes a couple updates to hopefully clean up the API a bit:FileBasic
since it's redundant withFile
and not used anywhere except inFile
itself. I think it's enough to have the other classes inherit fromFile
.get_value()
.read()
is more familiar and they are somewhat redundant. Inpytorch()
, check if it is aFile
type and useFile.read()
.Examples have been updated to reflect the latest API.
Let me know what you think.