Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvcfs: support caching remote streams; use that in plots #9183

Merged
merged 3 commits into from
Mar 17, 2023

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Mar 15, 2023

Closes #9030.
Depends on iterative/dvc-data#333.

header = props.get("header", True)
delim = "\t" if extension == ".tsv" else ","
return _load_sv(contents, delimiter=delim, header=header)
return PARSERS[extension](contents, path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here was required, because:
a) I did not want to keep passing fs_kwargs everywhere, and
b) passing fs_kwargs to LOADERS is not possible as it already uses **kwargs as kwargs to the parser. The design of dvc.utils.serialize needs to be reconsidered.

@@ -196,7 +200,7 @@ def show(
onerror=onerror,
props=props,
):
_resolve_data_sources(data)
_resolve_data_sources(data, cache_remote_stream=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show() is only used by DVC, not by Studio. So we only cache remote streams during dvc plots show/diff commands.

@skshetry skshetry force-pushed the cache-dvcfs-remote-streams branch from 1b5a455 to 911625e Compare March 15, 2023 07:09
@codecov
Copy link

codecov bot commented Mar 15, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (e6419b9) 92.91% compared to head (4e7a6d0) 92.91%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9183   +/-   ##
=======================================
  Coverage   92.91%   92.91%           
=======================================
  Files         456      456           
  Lines       36884    36899   +15     
  Branches     5324     5329    +5     
=======================================
+ Hits        34269    34286   +17     
+ Misses       2091     2090    -1     
+ Partials      524      523    -1     
Impacted Files Coverage Δ
dvc/fs/dvc.py 96.42% <100.00%> (+0.04%) ⬆️
dvc/repo/plots/__init__.py 91.81% <100.00%> (+1.04%) ⬆️
dvc/utils/serialize/__init__.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@skshetry skshetry force-pushed the cache-dvcfs-remote-streams branch 2 times, most recently from 23acc6e to 44eb0ef Compare March 15, 2023 08:20
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments in iterative/dvc-data#333

Comment on lines +259 to +262
kw = {}
if kwargs.get("cache_remote_stream", False):
kw["cache_odb"] = repo.cache.local
return dvc_fs.open(dvc_path, mode=mode, **kw)
Copy link
Contributor

@efiop efiop Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plots should just dvc.fetch with revisions they are going to be visiting so that all required data is already in cache. That also allows us to optimize fetching in the most efficient way (the same way we do it in actuall fetch/pull)

Copy link
Member Author

@skshetry skshetry Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not with revisions as that would involve too many passes. It still has to go through each revision to collect definitions and data (from git and dvc), etc.

This will also require having a good filtering mechanism equivalent to what is being done in plots here (and be kept in sync with one another).

I agree that it's how it should work ideally, and I have been saying that for years, but on short term, this is a very simple/straightforward solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry Passes can be optimized and cached for fetch, so we will be able to only check for one hash and be done with it.

This will also require having a good filtering mechanism equivalent to what is being done in plots here (and be kept in sync with one another).

Yes, we do need that even for that studio fetch_cache, so I expect to have that sooner than later.

I agree that it's how it should work ideally, and I have been saying that for years, but on short term, this is a very simple/straightforward solution.

👍 Alright, then I'm alright with merging this approach for now. We'll get back to it in the near future.

Copy link
Member Author

@skshetry skshetry Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do need that even for that studio fetch_cache, so I expect to have that sooner than later.

I don't know what the priority of that is. I am going to ignore Studio till then.

Regarding filtering, one thing that plots support is a directory, which could be a mix of both Git + DVC tracked files.

I am a bit worried that we would have to do two levels of filtering, one for fetch and the other when actually parsing plots, which can easily go out of sync. So even though I am not a fan of DVCFileSystem for plots parsing, it does that job really well in a synchronized manner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding filtering, one thing that plots support is a directory, which could be a mix of both Git + DVC tracked files.

I guess what you mean is that index.data doesn't have git files and that's true, but thats only for dvc-tracked data. We clearly need smth like index.files that won't ignoro no-cache stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and not just limited to git files. Top-level plots are by default no-cache, unless they are also recorded in stages or as outputs.

Also, at the moment, index is built from stages/outputs, so top-level plots may or may not be there. So using the present index's definition, they are not just limited to dvc's index (they might be in the workspace or in the git index).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, you can ignore those in fetch_cache, and only care about dvc's index.
My point was that you'd have to use a different index when parsing plots (i.e. mixing Git + DVC's index).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index is built from stages/outputs right now, but it is not meant to stay that way. So all the stuff that you are talking about should be in the index (we might have data and files, but that's a minor detail).

@efiop efiop self-requested a review March 15, 2023 12:58
@skshetry skshetry force-pushed the cache-dvcfs-remote-streams branch from 44eb0ef to 4e7a6d0 Compare March 15, 2023 13:39
@skshetry
Copy link
Member Author

@efiop, are you okay with merging this then? You still have a Change request.

@skshetry
Copy link
Member Author

ping @efiop

@skshetry
Copy link
Member Author

Merging for now, please revert if you disagree strongly. :)

@skshetry skshetry merged commit c52b12a into iterative:main Mar 17, 2023
@skshetry skshetry deleted the cache-dvcfs-remote-streams branch March 17, 2023 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc does not cache stream opened files
2 participants