Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File refactor #102

Merged
merged 4 commits into from
Jul 20, 2024
Merged

File refactor #102

merged 4 commits into from
Jul 20, 2024

Conversation

dberenbaum
Copy link
Contributor

This is a bit close to release 😅 , but writing up the docs, the File handling still feels slightly awkward. This makes a couple updates to hopefully clean up the API a bit:

  • Dropped FileBasic since it's redundant with File and not used anywhere except in File itself. I think it's enough to have the other classes inherit from File.
  • Dropped get_value(). read() is more familiar and they are somewhat redundant. In pytorch(), check if it is a File type and use File.read().

Examples have been updated to reflect the latest API.

Let me know what you think.

Copy link

cloudflare-workers-and-pages bot commented Jul 19, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9fcaf41
Status: ✅  Deploy successful!
Preview URL: https://12e3091e.datachain-documentation.pages.dev
Branch Preview URL: https://file-refactor.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Jul 19, 2024

The author of this PR, dberenbaum, is not an activated member of this organization on Codecov.
Please activate this user on Codecov to display this PR comment.
Coverage data is still being uploaded to Codecov.io for purposes of overall coverage calculations.
Please don't hesitate to email us at support@codecov.io with any questions.

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! It simplifies a lot.

PS: I also started using f.read() if isinstance(f, File) else ?? instead of get_value().

@@ -1288,7 +1288,7 @@ def export_files(

def shuffle(self) -> "Self":
"""Shuffle the rows of the chain deterministically."""
return super().shuffle()
return self.order_by("sys.rand")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@@ -200,6 +200,11 @@ def open(self):
) as f:
yield f

def read(self):
Copy link
Member

@skshetry skshetry Jul 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be read_text() and read_bytes()?

I'd expect read() to also support size= arg, similar to RawIOBase.read().

The API name also makes it unclear what it returns: bytes or str.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to consider whether we can use a single File class that conforms with pathlib and uses more typical conventions like .open(mode="rb") and .read_text(). In that case we would need a separate method to read the data as an image, like file.read_image().

We would still need a way to tell other methods like to_pytorch() which read type to use or set a default read type. Not sure if it's feasible before release, but good to discuss it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge since I don't think your comments are a blocker or that this PR makes the situation any worse. We can keep discussing and follow up on top.

@dberenbaum dberenbaum merged commit eeb972f into main Jul 20, 2024
19 checks passed
@dberenbaum dberenbaum deleted the file_refactor branch July 20, 2024 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants