dvcfs: support caching remote streams; use that in plots #9183

skshetry · 2023-03-15T07:02:30Z

Closes #9030.
Depends on iterative/dvc-data#333.

skshetry · 2023-03-15T07:05:27Z

dvc/repo/plots/__init__.py

+        header = props.get("header", True)
+        delim = "\t" if extension == ".tsv" else ","
+        return _load_sv(contents, delimiter=delim, header=header)
+    return PARSERS[extension](contents, path)


The change here was required, because:
a) I did not want to keep passing fs_kwargs everywhere, and
b) passing fs_kwargs to LOADERS is not possible as it already uses **kwargs as kwargs to the parser. The design of dvc.utils.serialize needs to be reconsidered.

skshetry · 2023-03-15T07:06:56Z

dvc/repo/plots/__init__.py

@@ -196,7 +200,7 @@ def show(
            onerror=onerror,
            props=props,
        ):
-            _resolve_data_sources(data)
+            _resolve_data_sources(data, cache_remote_stream=True)


show() is only used by DVC, not by Studio. So we only cache remote streams during dvc plots show/diff commands.

codecov · 2023-03-15T07:19:41Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (e6419b9) 92.91% compared to head (4e7a6d0) 92.91%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #9183   +/-   ##
=======================================
  Coverage   92.91%   92.91%           
=======================================
  Files         456      456           
  Lines       36884    36899   +15     
  Branches     5324     5329    +5     
=======================================
+ Hits        34269    34286   +17     
+ Misses       2091     2090    -1     
+ Partials      524      523    -1

Impacted Files	Coverage Δ
dvc/fs/dvc.py	`96.42% <100.00%> (+0.04%)`	⬆️
dvc/repo/plots/__init__.py	`91.81% <100.00%> (+1.04%)`	⬆️
dvc/utils/serialize/__init__.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

efiop

See comments in iterative/dvc-data#333

efiop · 2023-03-15T12:49:19Z

dvc/fs/dvc.py

+        kw = {}
+        if kwargs.get("cache_remote_stream", False):
+            kw["cache_odb"] = repo.cache.local
+        return dvc_fs.open(dvc_path, mode=mode, **kw)


Plots should just dvc.fetch with revisions they are going to be visiting so that all required data is already in cache. That also allows us to optimize fetching in the most efficient way (the same way we do it in actuall fetch/pull)

Not with revisions as that would involve too many passes. It still has to go through each revision to collect definitions and data (from git and dvc), etc.

This will also require having a good filtering mechanism equivalent to what is being done in plots here (and be kept in sync with one another).

I agree that it's how it should work ideally, and I have been saying that for years, but on short term, this is a very simple/straightforward solution.

@skshetry Passes can be optimized and cached for fetch, so we will be able to only check for one hash and be done with it.

This will also require having a good filtering mechanism equivalent to what is being done in plots here (and be kept in sync with one another).

Yes, we do need that even for that studio fetch_cache, so I expect to have that sooner than later.

I agree that it's how it should work ideally, and I have been saying that for years, but on short term, this is a very simple/straightforward solution.

👍 Alright, then I'm alright with merging this approach for now. We'll get back to it in the near future.

Yes, we do need that even for that studio fetch_cache, so I expect to have that sooner than later.

I don't know what the priority of that is. I am going to ignore Studio till then.

Regarding filtering, one thing that plots support is a directory, which could be a mix of both Git + DVC tracked files.

I am a bit worried that we would have to do two levels of filtering, one for fetch and the other when actually parsing plots, which can easily go out of sync. So even though I am not a fan of DVCFileSystem for plots parsing, it does that job really well in a synchronized manner.

Regarding filtering, one thing that plots support is a directory, which could be a mix of both Git + DVC tracked files.

I guess what you mean is that index.data doesn't have git files and that's true, but thats only for dvc-tracked data. We clearly need smth like index.files that won't ignoro no-cache stuff.

yeah, and not just limited to git files. Top-level plots are by default no-cache, unless they are also recorded in stages or as outputs.

Also, at the moment, index is built from stages/outputs, so top-level plots may or may not be there. So using the present index's definition, they are not just limited to dvc's index (they might be in the workspace or in the git index).

Anyway, you can ignore those in fetch_cache, and only care about dvc's index.
My point was that you'd have to use a different index when parsing plots (i.e. mixing Git + DVC's index).

index is built from stages/outputs right now, but it is not meant to stay that way. So all the stuff that you are talking about should be in the index (we might have data and files, but that's a minor detail).

skshetry · 2023-03-15T15:19:28Z

@efiop, are you okay with merging this then? You still have a Change request.

skshetry · 2023-03-17T02:06:25Z

ping @efiop

skshetry · 2023-03-17T10:19:13Z

Merging for now, please revert if you disagree strongly. :)

skshetry requested a review from efiop March 15, 2023 07:02

skshetry mentioned this pull request Mar 15, 2023

datafs: allow caching remote streams to odb iterative/dvc-data#333

Merged

skshetry requested a review from pmrowla March 15, 2023 07:02

skshetry commented Mar 15, 2023

View reviewed changes

skshetry force-pushed the cache-dvcfs-remote-streams branch from 1b5a455 to 911625e Compare March 15, 2023 07:09

skshetry force-pushed the cache-dvcfs-remote-streams branch 2 times, most recently from 23acc6e to 44eb0ef Compare March 15, 2023 08:20

efiop suggested changes Mar 15, 2023

View reviewed changes

efiop reviewed Mar 15, 2023

View reviewed changes

efiop self-requested a review March 15, 2023 12:58

skshetry added 3 commits March 15, 2023 19:23

dvcfs: support caching remote streams; use that in plots

4d03576

plots: show progress bar, cap max_workers to 16

7df1350

bump dvc-data to 0.44.1

4e7a6d0

skshetry force-pushed the cache-dvcfs-remote-streams branch from 44eb0ef to 4e7a6d0 Compare March 15, 2023 13:39

skshetry mentioned this pull request Mar 16, 2023

plots: improve performance of plots collection #8786

Closed

skshetry merged commit c52b12a into iterative:main Mar 17, 2023

skshetry deleted the cache-dvcfs-remote-streams branch March 17, 2023 10:19

skshetry mentioned this pull request Jul 12, 2023

plots diff/show: caching is broken for files in legacy cache format #9727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvcfs: support caching remote streams; use that in plots #9183

dvcfs: support caching remote streams; use that in plots #9183

skshetry commented Mar 15, 2023 •

edited

Loading

skshetry Mar 15, 2023

skshetry Mar 15, 2023

codecov bot commented Mar 15, 2023 •

edited

Loading

efiop left a comment

efiop Mar 15, 2023 •

edited

Loading

skshetry Mar 15, 2023 •

edited

Loading

efiop Mar 15, 2023

skshetry Mar 15, 2023 •

edited

Loading

efiop Mar 15, 2023

skshetry Mar 15, 2023

skshetry Mar 15, 2023

efiop Mar 15, 2023

skshetry commented Mar 15, 2023

skshetry commented Mar 17, 2023

skshetry commented Mar 17, 2023

dvcfs: support caching remote streams; use that in plots #9183

dvcfs: support caching remote streams; use that in plots #9183

Conversation

skshetry commented Mar 15, 2023 • edited Loading

skshetry Mar 15, 2023

Choose a reason for hiding this comment

skshetry Mar 15, 2023

Choose a reason for hiding this comment

codecov bot commented Mar 15, 2023 • edited Loading

Codecov Report

efiop left a comment

Choose a reason for hiding this comment

efiop Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

skshetry Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

efiop Mar 15, 2023

Choose a reason for hiding this comment

skshetry Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

efiop Mar 15, 2023

Choose a reason for hiding this comment

skshetry Mar 15, 2023

Choose a reason for hiding this comment

skshetry Mar 15, 2023

Choose a reason for hiding this comment

efiop Mar 15, 2023

Choose a reason for hiding this comment

skshetry commented Mar 15, 2023

skshetry commented Mar 17, 2023

skshetry commented Mar 17, 2023

skshetry commented Mar 15, 2023 •

edited

Loading

codecov bot commented Mar 15, 2023 •

edited

Loading

efiop Mar 15, 2023 •

edited

Loading

skshetry Mar 15, 2023 •

edited

Loading

skshetry Mar 15, 2023 •

edited

Loading