-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvcfs: support caching remote streams; use that in plots #9183
Conversation
header = props.get("header", True) | ||
delim = "\t" if extension == ".tsv" else "," | ||
return _load_sv(contents, delimiter=delim, header=header) | ||
return PARSERS[extension](contents, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here was required, because:
a) I did not want to keep passing fs_kwargs
everywhere, and
b) passing fs_kwargs
to LOADERS
is not possible as it already uses **kwargs
as kwargs to the parser. The design of dvc.utils.serialize
needs to be reconsidered.
dvc/repo/plots/__init__.py
Outdated
@@ -196,7 +200,7 @@ def show( | |||
onerror=onerror, | |||
props=props, | |||
): | |||
_resolve_data_sources(data) | |||
_resolve_data_sources(data, cache_remote_stream=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show()
is only used by DVC, not by Studio. So we only cache remote streams during dvc plots show/diff
commands.
1b5a455
to
911625e
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #9183 +/- ##
=======================================
Coverage 92.91% 92.91%
=======================================
Files 456 456
Lines 36884 36899 +15
Branches 5324 5329 +5
=======================================
+ Hits 34269 34286 +17
+ Misses 2091 2090 -1
+ Partials 524 523 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
23acc6e
to
44eb0ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments in iterative/dvc-data#333
kw = {} | ||
if kwargs.get("cache_remote_stream", False): | ||
kw["cache_odb"] = repo.cache.local | ||
return dvc_fs.open(dvc_path, mode=mode, **kw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plots should just dvc.fetch
with revisions they are going to be visiting so that all required data is already in cache. That also allows us to optimize fetching in the most efficient way (the same way we do it in actuall fetch/pull)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not with revisions
as that would involve too many passes. It still has to go through each revision to collect definitions and data (from git and dvc), etc.
This will also require having a good filtering mechanism equivalent to what is being done in plots here (and be kept in sync with one another).
I agree that it's how it should work ideally, and I have been saying that for years, but on short term, this is a very simple/straightforward solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry Passes can be optimized and cached for fetch, so we will be able to only check for one hash and be done with it.
This will also require having a good filtering mechanism equivalent to what is being done in plots here (and be kept in sync with one another).
Yes, we do need that even for that studio fetch_cache
, so I expect to have that sooner than later.
I agree that it's how it should work ideally, and I have been saying that for years, but on short term, this is a very simple/straightforward solution.
👍 Alright, then I'm alright with merging this approach for now. We'll get back to it in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do need that even for that studio fetch_cache, so I expect to have that sooner than later.
I don't know what the priority of that is. I am going to ignore Studio till then.
Regarding filtering, one thing that plots support is a directory, which could be a mix of both Git + DVC tracked files.
I am a bit worried that we would have to do two levels of filtering, one for fetch and the other when actually parsing plots, which can easily go out of sync. So even though I am not a fan of DVCFileSystem for plots parsing, it does that job really well in a synchronized manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding filtering, one thing that plots support is a directory, which could be a mix of both Git + DVC tracked files.
I guess what you mean is that index.data doesn't have git files and that's true, but thats only for dvc-tracked data. We clearly need smth like index.files that won't ignoro no-cache stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, and not just limited to git files. Top-level plots are by default no-cache, unless they are also recorded in stages or as outputs.
Also, at the moment, index is built from stages/outputs, so top-level plots may or may not be there. So using the present index's definition, they are not just limited to dvc's index (they might be in the workspace or in the git index).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, you can ignore those in fetch_cache
, and only care about dvc's index.
My point was that you'd have to use a different index when parsing plots (i.e. mixing Git + DVC's index).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index is built from stages/outputs right now, but it is not meant to stay that way. So all the stuff that you are talking about should be in the index (we might have data
and files
, but that's a minor detail).
44eb0ef
to
4e7a6d0
Compare
@efiop, are you okay with merging this then? You still have a Change request. |
ping @efiop |
Merging for now, please revert if you disagree strongly. :) |
Closes #9030.
Depends on iterative/dvc-data#333.