Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plots: improve performance of plots collection #8786

Closed
dberenbaum opened this issue Jan 10, 2023 · 34 comments
Closed

plots: improve performance of plots collection #8786

dberenbaum opened this issue Jan 10, 2023 · 34 comments
Labels
A: plots Related to the plots p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks product: VSCode Integration with VSCode extension

Comments

@dberenbaum
Copy link
Collaborator

Important for VS Code to be able to work smoothly. Also could help Studio rely on less custom logic?

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks product: VSCode Integration with VSCode extension A: plots Related to the plots labels Jan 10, 2023
@shcheklein
Copy link
Member

Related #8777 ... do you have any specific scenarios in mind to test?

@dberenbaum
Copy link
Collaborator Author

I don't yet. Maybe you do 🙏 ?

@shcheklein
Copy link
Member

@dberenbaum got it, it's more of a placeholder then rather than something coming from your recent experience? (just clarifying this). For me the simplest test is open vscode-demo project, select 5-7 experiments and run plots. That was enough to get 50seconds+ time last time I checked (project evolves a bit), but even after my fix it's still 7-9seconds.

I think at this point it will be about optimizing things like:

  • caching collection (don't collection .dvc. yaml files, etc for the same commits (my estimate for this - should be around 1 week at most to my mind if we don't overcomplicate things). Can affect tremendously user experience in many places.
  • ability to parallelize index collection (some work on brancher, repo obejcts). Can we do some decent first steps in this direction within a week cc @efiop ?
  • make sure that we cache data that we download during these operations (I'm not 100% sure)

Then we can run the profiler again, and see where bottlenecks are.

@efiop @skshetry please correct me on these items.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 14, 2023

This is related to #8787. In both cases, what we need to be doing is serializing repo.index for already collected commits and then caching it somewhere. And then all of our read-only commands (whether its dvc plots or dvc exp show) would just have to load the cached index instead of doing the complete filesystem or git tree walk() based collection every time.

@skshetry
Copy link
Member

skshetry commented Feb 14, 2023

yeah, I am waiting on that persistent index to see what improvement it makes. It's hard to say at this time how much it'll save. Persistent index will still require us to read plots, parse etc.

Another idea that I am thinking of is caching the per-commit dictionary result of all operations so that we can reuse the old computed result (if there are no errors). But if persistent index is enough, we don't have to do that.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 14, 2023

I think there also may be some issues related to error handling, as the way we handle errors across different commands is kind of a mess.

For one example, exp show currently supports generating much more granular error information for vscode. If we fail to parse individual params/metrics files, we generate an error at that file level, and continue trying to load the rest of a stage/pipeline/repo. But in the current repo index, if we fail to load index.params/index.metrics we just skip loading the entire stage (and generate a stage level error message), even if the stage contained other valid params/metrics/plots/etc.

We also have problems where all of the error handling has been added piecemeal over time to meet specific studio or vscode requirements, so the internal APIs for handling errors in one place is not the same as in other places.

So getting a quick/easy caching implementation that works for plots show may not necessarily work out of the box for exp show. But ideally we would come up with a solution that actually works in the general DVC case (and unify all of the fragmented ways that we handle collection/parsing errors)

@pmrowla
Copy link
Contributor

pmrowla commented Feb 14, 2023

yeah, I am waiting on that persistent index to see what improvement it makes. It's hard to say at this time how much it'll save. Persistent index will still require us to read plots, parse etc.

I'm not sure persistent data index will be enough here, since a lot of the things we need for plots/metrics/params are not DVC-tracked data. I think what we will still need is the serialized "dvc commit" object, that is really just a json dump of the aggregate .dvc+dvc.lock files in the repo, saved somewhere like .dvc/cache/commits using git commit SHA as object IDs and not the hash of the json data (similar to run cache). At a minimum I think we will need to dump all of the params/metrics values (since that includes git-tracked data), we may be able to get away with not needing to dump dvc-data hashes/meta here, since persistent data-index should take care of that for us.

@skshetry
Copy link
Member

skshetry commented Feb 14, 2023

We had something like that before, but we never used it as it was not very smart, and the serialization/deserialization was slow. Loading all the cached results back to our in-memory format is also not always easy.

dvc/dvc/repo/index.py

Lines 307 to 316 in 1180a2f

def dumpd(self) -> Dict[str, Dict]:
def dump(stage: "Stage"):
key = stage.path_in_repo
try:
key += ":" + stage.name # type: ignore[attr-defined]
except AttributeError:
pass
return key, stage.dumpd()
return dict(dump(stage) for stage in self)

I think it'd be faster/simpler (but a bit fragile) to just cache results at high level.

@skshetry
Copy link
Member

skshetry commented Mar 15, 2023

In main, I see ~4.5s out of total ~8s runtime being spent on json.dumps() on dvc-render side.

Repro for the benchmark:

git clone https://github.com/iterative/vscode-dvc-demo.git
cd vscode-dvc-demo
dvc fetch --all-commits
time dvc plots diff workspace $(git log --oneline --pretty=format:"%h" -n 10) --json --cprofile-dump dump.prof > /dev/null

Screenshot from 2023-03-15 19-50-06

@daavoo
Copy link
Contributor

daavoo commented Mar 15, 2023

In main, I see ~4.5s out of total ~8s runtime being spent on json.dumps() on dvc-render side.

🙏 I can take that one

@skshetry
Copy link
Member

@daavoo, do you know what's the effort to get rid of json.dumps()?

@daavoo
Copy link
Contributor

daavoo commented Mar 16, 2023

Given that we are making breaking changes to exp show --json in #9170, why don't we include plots in exp show and make VSCode rely on a single command exp show --json instead of also requiring plots diff --json?

We would benefit from the existing exp show cache and VSCode only calls plots diff with revisions already present in exp show

@daavoo
Copy link
Contributor

daavoo commented Mar 16, 2023

@daavoo, do you know what's the effort to get rid of json.dumps()?

taking a look today

@skshetry
Copy link
Member

As you can see above, the rendering/_show_json takes 6.5 s out of 8s. Half of the rest is just imports. That leaves <1s to parse and load data from 10 commits.

I am not sure if vscode-dvc-demo is a good representative, but that's what we are using as a benchmark at the moment. Even though if we cache, we still have to do the rendering.

Given vscode only compares two commits and since we don't have a way to differentiate retryable/non-retryable errors, I don't find caching to be worth it at this point.

@daavoo
Copy link
Contributor

daavoo commented Mar 16, 2023

Even though if we cache, we still have to do the rendering.

Not if we cache the rendered plot

@daavoo
Copy link
Contributor

daavoo commented Mar 16, 2023

Anyhow, already have a patch removing all the overhead from dumps, will send it

@daavoo
Copy link
Contributor

daavoo commented Mar 16, 2023

As you can see above, the rendering/_show_json takes 6.5 s out of 8s. Half of the rest is just imports. That leaves <1s to parse and load data from 10 commits.

This is how it looks with iterative/dvc-render#124.

  • Before:

Captura de pantalla 2023-03-16 a las 12 17 12

  • After:

Captura de pantalla 2023-03-16 a las 12 16 21

@dberenbaum
Copy link
Collaborator Author

Given vscode only compares two commits

Not sure it makes a meaningful difference, but to clarify, VS Code can compare up to 7 commits.

@skshetry
Copy link
Member

@dberenbaum, I only saw it invoke dvc plots diff with two revisions. Could you please check in Output tab?

@dberenbaum
Copy link
Collaborator Author

Screenshot 2023-03-16 at 10 13 37 AM

make sure that we cache data that we download during these operations (I'm not 100% sure)

Is this an issue? That worries me most.

@skshetry
Copy link
Member

skshetry commented Mar 16, 2023

@dberenbaum, what happens when you reselect some of those experiments? Does it cache or re-invokes dvc plots diff with same arguments?

Is this an issue? That worries me most.

I have a patch for that. See #9183

@dberenbaum
Copy link
Collaborator Author

@dberenbaum, what happens when you reselect some of those experiments? Does it cache or re-invokes dvc plots diff with same arguments?

It looks like VS Code tries to cache them. cc @mattseddon

By the way, I don't think 2 vs 7 commits is likely to be the difference in whether to spend more time on this issue, I just thought it was worth clarifying VS Code behavior.

@skshetry
Copy link
Member

After merging #9197 and #9183, I think we can close this issue.
I am not against caching if we want to do it in the exp show, but I don't find it worth this time. I'll if someone complains before we reach for that solution.

@skshetry
Copy link
Member

skshetry commented Mar 16, 2023

It looks like VS Code tries to cache them. cc @mattseddon

We discussed this today regarding caching. The way we squash definitions/properties prevent us from caching the rendered plots, and the way VSCode does caching, that squashing won't work anyway. Studio similarly has a different behaviour.

We'll have different output on VSCode/Studio/DVC, and it's unclear what the expected behaviour here should be.

@mattseddon
Copy link
Member

Given that we are making breaking changes to exp show --json in #9170, why don't we include plots in exp show and make VSCode rely on a single command exp show --json instead of also requiring plots diff --json?

We would benefit from the existing exp show cache and VSCode only calls plots diff with revisions already present in exp show

When considering moving plots data into exp show we need to remember things like confusion matrices can be present, this will lead to massive amount of redundant data being collected/sent every time exp show is called. The extension can only show 7 revisions but there could be 100s/1000s in the data sent for exp show. With 10k+ datapoints per revision we could end up in OOM territory... but maybe we should try and cross that bridge when we come to it.

It looks like VS Code tries to cache them. cc @mattseddon

Relevant parts of #9025 (comment)

The extension caches revision data (images + datapoints) which means that we should only call plots diff once for each revision (except workspace). It is looking less likely that this approach (caching on the extension side) actually works well. For example, after reading this from #9025 (comment)

It does seem like it would be much simpler to use a single set of properties (either the leftmost revision or workspace in cases where it's implied). Seems like that would resolve some related issues like #9188 (since it would not try to render the old custom template) and #7193.

It seems like revision data is actually mutable depending on the combination of revisions passed to plots diff. This becomes even more error-prone when we throw FileNotFound errors into the mix and whether or not these errors are important/should be displayed. In order to cache data on the VS Code side we need revisions to be static.
...
Ideally, we would get plots diff performance to a place where we can call it with up to 7 revisions and it returns in a timely enough manner to drop all of the caching (mess) that is currently in the extension (will mention this in #8786 too).

If we can agree on the baseline behaviour and simplify the problem then maybe (hopefully) we can move caching out of the extension and back into DVC. WDYT?

@efiop efiop removed their assignment Mar 17, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2023

When considering moving plots data into exp show we need to remember things like confusion matrices can be present, this will lead to massive amount of redundant data being collected/sent every time exp show is called. The extension can only show 7 revisions but there could be 100s/1000s in the data sent for exp show. With 10k+ datapoints per revision we could end up in OOM territory... but maybe we should try and cross that bridge when we come to it.

We can account for this with the potential exp show changes. I was thinking we could have a separate internal command that would just return data for specifically requested revisions. The returned result would not account for any nested structuring/grouping by baseline revision/experiment/executor/etc, and would only return params/metrics/plots data.

So you could have something like

$ dvc exp show-data main my-exp abc123 def456
{
  "main": {"params": {}, "metrics": {}, "plots": {}},
  "my-exp": {...},
  "abc123": {...},
  "def456: {"error": {"msg": "invalid revision/not found"}}
}

where the result is just an unordered dictionary mapping the requested named revision to the data field you would normally see in exp show with no other arguments (and where plots is datapoints only, without plots template/configuration). We could also have flags like --metrics/--no-metrics, --params/--no-params, --plots/--no-plots to specify what exactly vscode wants (this could also be extended to include the other data columns for a revision, like outs/deps, but that is probably less useful for vscode).

In this scenario, for a named branch/exp DVC would return the latest/tip matching the given name. So for running experiments, it would return the latest/live value. For a finished experiment it would return the tip commit for checkpoints or only/single commit for regular experiments.

So when rendering plots, vscode would call some dvc plots command to get the workspace template/configuration, and then something like exp show-data --plots [revs to diff], and DVC would then only return the plots datapoints for those revisions.

This way, we can exclude the plots datapoints from the regular exp show --json output to avoid the potential OOM errors, and internally we can still collect/cache plots data at the same time as params/metrics, using the existing exp show caching mechanism.

This may also be useful if/when vscode wants to refresh data for a specific regular table row without requesting the entire exp show output.


Essentially, this is how a hypothetical dvc.api.exp_show() would probably work (returning only data for specific revisions, and not the nested/grouped exp show --json format), and I think it probably makes sense for vscode to also be able to request only data in the same way.

@skshetry
Copy link
Member

@mattseddon, a question that I have, how much time does the extension take to render plots?
What is the end-to-end performance like?

@mattseddon
Copy link
Member

Times can vary greatly but cached in the extension vs un-cached does show a significant time difference. Here is an example using the demo project (not much data):

Screen.Recording.2023-03-17.at.2.52.56.pm.mov

When a revision is first selected we request the data from dvc plots diff we then have to process that revision before combining it with the existing datapoints and sending it to the webvview. After we've done that initial fetch things are much quicker.

For reference in the above project for 2 revisions dvc plots diff takes ~1000ms. Internal collection/transformation takes ~50ms. That time (50ms) goes up for the more revisions that we want to display.

@dberenbaum
Copy link
Collaborator Author

@mattseddon Is that example pulling the data from the remote, or is it already pulled? I think #9183 was already intended to make sure we avoid re-downloading.

I think caching the plots data on the DVC side seems nice to have, but I'm not that clear on if it's worth prioritizing right now since it sounds like it will require work from both products to replicate what VS Code is already doing, right?

@skshetry
Copy link
Member

I'll suggest opening a new issue for caching. @daavoo's last PR halved (3x?) the runtime of plots show/diff, and the above 8s runtime now sits at a more comfortable ~2.6s runtime.

Screenshot from 2023-03-17 19-58-38

@dberenbaum
Copy link
Collaborator Author

@skshetry Good idea, but let's try to decide whether it's worth opening a new issue first.

@mattseddon @shcheklein Thoughts?

@shcheklein
Copy link
Member

Yes, when I was initially complaining about plots performance it was about dealing with 1 minute + delays because of some issues (bugs, luck of clear parallelism, etc). I think we already went beyond and above (thanks!) and solved more than I personally anticipated and expected (at least from reading these threads). I agree that removing cache from VS Code and caching on the DVC is nice to have (unless Matt has a strong opinion about this, .e.g if it becomes very painful for us to deal with errors etc - we need to evaluate the effort then, agree about the separate issue for this).

@mattseddon
Copy link
Member

All sounds reasonable. Caching is a bit broken on the extension side for the reasons listed here: #9025 (comment)

The Tl;dr is that revisions are mutable depending on the combination passed to plots diff. I'll work on improving the implementation on the extension side for now. Hopefully, we can agree on the proper behaviour in #9025.

Happy to close this and open a new issue for moving caching at a later date.

@dberenbaum
Copy link
Collaborator Author

Closing this one. Let's open a new issue if there are specific issues like caching we need to follow up on.

@github-project-automation github-project-automation bot moved this from Backlog to Done in DVC Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: plots Related to the plots p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks product: VSCode Integration with VSCode extension
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants