from_parquet: use virtual filesystem to preserve partition information when using cache #745

skshetry · 2024-12-24T16:40:05Z

Alternative proposal to #744, that uses a ReferenceFileSystem (which acts as a pointer) to preserve partitioned information.

cloudflare-workers-and-pages · 2024-12-24T16:40:57Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`097cac3`
Status:	✅ Deploy successful!
Preview URL:	https://39257ee1.datachain-documentation.pages.dev
Branch Preview URL:	https://from-parquet-referencefs.datachain-documentation.pages.dev

View logs

codecov · 2024-12-24T16:49:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.51%. Comparing base (60256d6) to head (097cac3).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #745      +/-   ##
==========================================
+ Coverage   87.47%   87.51%   +0.04%     
==========================================
  Files         114      114              
  Lines       10941    10945       +4     
  Branches     1504     1501       -3     
==========================================
+ Hits         9571     9579       +8     
+ Misses        992      991       -1     
+ Partials      378      375       -3

Flag	Coverage Δ
datachain	`87.45% <100.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shcheklein · 2024-12-25T00:08:32Z

@skshetry so, which PR is better to review? :)

skshetry · 2024-12-25T12:07:04Z

so, which PR is better to review? :)

I have closed the other PR as I think using a virtual fs is a much cleaner and less fragile solution.

skshetry · 2024-12-25T12:09:52Z

src/datachain/lib/arrow.py

@@ -190,18 +225,6 @@ def arrow_type_mapper(col_type: pa.DataType, column: str = "") -> type:  # noqa:
    raise TypeError(f"{col_type!r} datatypes not supported, column: {column}")


-def _nrows_file(file: File, nrows: int) -> str:


I have removed this hack. If nrows is a significant number, this is likely to be slower.

We now iterate up to self.nrows with the pyarrow dataset.
I tried running json-csv-reader example, and the new method is slightly faster for small amount of rows.

skshetry · 2024-12-25T12:10:18Z

src/datachain/lib/dc.py

+        # disable prefetch if nrows is set
+        settings = {"prefetch": 0} if nrows else {}


disabled prefetch when nrows is set. Prefetching a large file is going to be counter-productive when you only want to read some rows.

should we just let users disable it though?

The fact that we are pre-downloading/pre-fetching is not going to be obvious to the users. So, finding the right settings to disable/enable them is going to be difficult.

Similarly, the default settings with prefetch is going to be suboptimal when you only want to read a few rows. In fact, I'd have to adjust example tests to disable prefetch as well, as we use laion dataset to read just a couple of rows.

That said, I do not have any strong opinion, though — please let me know what you think.

skshetry · 2024-12-25T12:11:03Z

src/datachain/lib/dc.py

-        elif nrows:
-            nrows += 1


This was a workaround to make nrows works if the data contains headers. This is no longer needed as we count the actual rows internally.

skshetry · 2024-12-25T12:12:13Z

src/datachain/lib/arrow.py

@@ -55,57 +67,80 @@ def __init__(
    def process(self, file: File):


I have also split up this function into 2-3 smaller functions for readability.

src/datachain/lib/arrow.py

dreadatour

Looks good to me overall 👍 Went through all the changes carefully.
Same time I am not really familiar yet with this part of the codebase 😢

Also did: * minor refactor, * removes `nrows` _hack_ and, * disables prefetching when `nrows` is set, so that we don't download the whole dataset.

skshetry temporarily deployed to internal December 24, 2024 16:40 — with GitHub Actions Inactive

skshetry requested a review from a team December 24, 2024 16:40

skshetry force-pushed the from-parquet-referencefs branch from 7b10173 to 17ef8fc Compare December 24, 2024 16:57

skshetry temporarily deployed to internal December 24, 2024 16:57 — with GitHub Actions Inactive

skshetry changed the title ~~from_parquet: use virtual filesystem to preserve partition information~~ from_parquet: use virtual filesystem to preserve partition information when using cache Dec 24, 2024

skshetry force-pushed the from-parquet-referencefs branch from 17ef8fc to 785dfab Compare December 24, 2024 19:15

skshetry temporarily deployed to internal December 24, 2024 19:15 — with GitHub Actions Inactive

skshetry force-pushed the from-parquet-referencefs branch from 785dfab to a138a47 Compare December 25, 2024 07:52

skshetry temporarily deployed to internal December 25, 2024 07:52 — with GitHub Actions Inactive

skshetry force-pushed the from-parquet-referencefs branch from a138a47 to 3a7e7d2 Compare December 25, 2024 07:54

skshetry temporarily deployed to internal December 25, 2024 07:54 — with GitHub Actions Inactive

skshetry force-pushed the from-parquet-referencefs branch from 3a7e7d2 to f87f0a8 Compare December 25, 2024 08:01

skshetry temporarily deployed to internal December 25, 2024 08:01 — with GitHub Actions Inactive

skshetry force-pushed the from-parquet-referencefs branch from f87f0a8 to 43580ed Compare December 25, 2024 09:45

skshetry temporarily deployed to internal December 25, 2024 09:45 — with GitHub Actions Inactive

skshetry commented Dec 25, 2024

View reviewed changes

shcheklein reviewed Dec 25, 2024

View reviewed changes

src/datachain/lib/arrow.py Outdated Show resolved Hide resolved

dreadatour approved these changes Dec 25, 2024

View reviewed changes

shcheklein approved these changes Dec 25, 2024

View reviewed changes

from_parquet: use virtual filesystem to preserve partition information

097cac3

Also did: * minor refactor, * removes `nrows` _hack_ and, * disables prefetching when `nrows` is set, so that we don't download the whole dataset.

skshetry force-pushed the from-parquet-referencefs branch from 43580ed to 097cac3 Compare December 26, 2024 09:01

skshetry temporarily deployed to internal December 26, 2024 09:01 — with GitHub Actions Inactive

skshetry enabled auto-merge (squash) December 26, 2024 09:09

skshetry merged commit 195199e into main Dec 26, 2024
34 checks passed

skshetry deleted the from-parquet-referencefs branch December 26, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

from_parquet: use virtual filesystem to preserve partition information when using cache #745

from_parquet: use virtual filesystem to preserve partition information when using cache #745

skshetry commented Dec 24, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 24, 2024 •

edited

Loading

codecov bot commented Dec 24, 2024 •

edited

Loading

shcheklein commented Dec 25, 2024

skshetry commented Dec 25, 2024

skshetry Dec 25, 2024 •

edited

Loading

skshetry Dec 25, 2024 •

edited

Loading

shcheklein Dec 25, 2024

skshetry Dec 26, 2024 •

edited

Loading

skshetry Dec 25, 2024 •

edited

Loading

skshetry Dec 25, 2024

shcheklein Dec 25, 2024

dreadatour left a comment

		@@ -190,18 +225,6 @@ def arrow_type_mapper(col_type: pa.DataType, column: str = "") -> type: # noqa:
		raise TypeError(f"{col_type!r} datatypes not supported, column: {column}")


		def _nrows_file(file: File, nrows: int) -> str:

		# disable prefetch if nrows is set
		settings = {"prefetch": 0} if nrows else {}

		@@ -55,57 +67,80 @@ def __init__(
		def process(self, file: File):

from_parquet: use virtual filesystem to preserve partition information when using cache #745

from_parquet: use virtual filesystem to preserve partition information when using cache #745

Conversation

skshetry commented Dec 24, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Dec 24, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

codecov bot commented Dec 24, 2024 • edited Loading

Codecov Report

shcheklein commented Dec 25, 2024

skshetry commented Dec 25, 2024

skshetry Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

skshetry Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

shcheklein Dec 25, 2024

Choose a reason for hiding this comment

skshetry Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

skshetry Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

skshetry Dec 25, 2024

Choose a reason for hiding this comment

shcheklein Dec 25, 2024

Choose a reason for hiding this comment

dreadatour left a comment

Choose a reason for hiding this comment

skshetry commented Dec 24, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 24, 2024 •

edited

Loading

codecov bot commented Dec 24, 2024 •

edited

Loading

skshetry Dec 25, 2024 •

edited

Loading

skshetry Dec 25, 2024 •

edited

Loading

skshetry Dec 26, 2024 •

edited

Loading

skshetry Dec 25, 2024 •

edited

Loading