Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_parquet: use virtual filesystem to preserve partition information when using cache #745

Merged
merged 1 commit into from
Dec 26, 2024

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Dec 24, 2024

Alternative proposal to #744, that uses a ReferenceFileSystem (which acts as a pointer) to preserve partitioned information.

@skshetry skshetry requested a review from a team December 24, 2024 16:40
Copy link

cloudflare-workers-and-pages bot commented Dec 24, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 097cac3
Status: ✅  Deploy successful!
Preview URL: https://39257ee1.datachain-documentation.pages.dev
Branch Preview URL: https://from-parquet-referencefs.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Dec 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.51%. Comparing base (60256d6) to head (097cac3).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #745      +/-   ##
==========================================
+ Coverage   87.47%   87.51%   +0.04%     
==========================================
  Files         114      114              
  Lines       10941    10945       +4     
  Branches     1504     1501       -3     
==========================================
+ Hits         9571     9579       +8     
+ Misses        992      991       -1     
+ Partials      378      375       -3     
Flag Coverage Δ
datachain 87.45% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skshetry skshetry force-pushed the from-parquet-referencefs branch from 7b10173 to 17ef8fc Compare December 24, 2024 16:57
@skshetry skshetry changed the title from_parquet: use virtual filesystem to preserve partition information from_parquet: use virtual filesystem to preserve partition information when using cache Dec 24, 2024
@skshetry skshetry force-pushed the from-parquet-referencefs branch from 17ef8fc to 785dfab Compare December 24, 2024 19:15
@shcheklein
Copy link
Member

@skshetry so, which PR is better to review? :)

@skshetry
Copy link
Member Author

so, which PR is better to review? :)

I have closed the other PR as I think using a virtual fs is a much cleaner and less fragile solution.

@@ -190,18 +225,6 @@ def arrow_type_mapper(col_type: pa.DataType, column: str = "") -> type: # noqa:
raise TypeError(f"{col_type!r} datatypes not supported, column: {column}")


def _nrows_file(file: File, nrows: int) -> str:
Copy link
Member Author

@skshetry skshetry Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this hack. If nrows is a significant number, this is likely to be slower.

We now iterate up to self.nrows with the pyarrow dataset.
I tried running json-csv-reader example, and the new method is slightly faster for small amount of rows.

Comment on lines +560 to +561
# disable prefetch if nrows is set
settings = {"prefetch": 0} if nrows else {}
Copy link
Member Author

@skshetry skshetry Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabled prefetch when nrows is set. Prefetching a large file is going to be counter-productive when you only want to read some rows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just let users disable it though?

Copy link
Member Author

@skshetry skshetry Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we are pre-downloading/pre-fetching is not going to be obvious to the users. So, finding the right settings to disable/enable them is going to be difficult.

Similarly, the default settings with prefetch is going to be suboptimal when you only want to read a few rows. In fact, I'd have to adjust example tests to disable prefetch as well, as we use laion dataset to read just a couple of rows.

That said, I do not have any strong opinion, though — please let me know what you think.

Comment on lines -1979 to -1980
elif nrows:
nrows += 1
Copy link
Member Author

@skshetry skshetry Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a workaround to make nrows works if the data contains headers. This is no longer needed as we count the actual rows internally.

@@ -55,57 +67,80 @@ def __init__(
def process(self, file: File):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also split up this function into 2-3 smaller functions for readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall 👍 Went through all the changes carefully.
Same time I am not really familiar yet with this part of the codebase 😢

Also did:

* minor refactor,
* removes `nrows` _hack_ and,
* disables prefetching when `nrows` is set, so that we don't download
    the whole dataset.
@skshetry skshetry force-pushed the from-parquet-referencefs branch from 43580ed to 097cac3 Compare December 26, 2024 09:01
@skshetry skshetry enabled auto-merge (squash) December 26, 2024 09:09
@skshetry skshetry merged commit 195199e into main Dec 26, 2024
34 checks passed
@skshetry skshetry deleted the from-parquet-referencefs branch December 26, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants