-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
from_parquet: use virtual filesystem to preserve partition information when using cache #745
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #745 +/- ##
==========================================
+ Coverage 87.47% 87.51% +0.04%
==========================================
Files 114 114
Lines 10941 10945 +4
Branches 1504 1501 -3
==========================================
+ Hits 9571 9579 +8
+ Misses 992 991 -1
+ Partials 378 375 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
7b10173
to
17ef8fc
Compare
17ef8fc
to
785dfab
Compare
@skshetry so, which PR is better to review? :) |
785dfab
to
a138a47
Compare
a138a47
to
3a7e7d2
Compare
3a7e7d2
to
f87f0a8
Compare
f87f0a8
to
43580ed
Compare
I have closed the other PR as I think using a virtual fs is a much cleaner and less fragile solution. |
@@ -190,18 +225,6 @@ def arrow_type_mapper(col_type: pa.DataType, column: str = "") -> type: # noqa: | |||
raise TypeError(f"{col_type!r} datatypes not supported, column: {column}") | |||
|
|||
|
|||
def _nrows_file(file: File, nrows: int) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this hack. If nrows
is a significant number, this is likely to be slower.
We now iterate up to self.nrows
with the pyarrow dataset.
I tried running json-csv-reader
example, and the new method is slightly faster for small amount of rows.
# disable prefetch if nrows is set | ||
settings = {"prefetch": 0} if nrows else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disabled prefetch
when nrows
is set. Prefetching a large file is going to be counter-productive when you only want to read some rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just let users disable it though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we are pre-downloading/pre-fetching is not going to be obvious to the users. So, finding the right settings to disable/enable them is going to be difficult.
Similarly, the default settings with prefetch is going to be suboptimal when you only want to read a few rows. In fact, I'd have to adjust example tests to disable prefetch
as well, as we use laion
dataset to read just a couple of rows.
That said, I do not have any strong opinion, though — please let me know what you think.
elif nrows: | ||
nrows += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a workaround to make nrows
works if the data contains headers. This is no longer needed as we count the actual rows internally.
@@ -55,57 +67,80 @@ def __init__( | |||
def process(self, file: File): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also split up this function into 2-3 smaller functions for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall 👍 Went through all the changes carefully.
Same time I am not really familiar yet with this part of the codebase 😢
Also did: * minor refactor, * removes `nrows` _hack_ and, * disables prefetching when `nrows` is set, so that we don't download the whole dataset.
43580ed
to
097cac3
Compare
Alternative proposal to #744, that uses a
ReferenceFileSystem
(which acts as a pointer) to preserve partitioned information.