plots: improve performance of plots collection #8786

dberenbaum · 2023-01-10T12:02:07Z

Important for VS Code to be able to work smoothly. Also could help Studio rely on less custom logic?

shcheklein · 2023-01-10T18:54:13Z

Related #8777 ... do you have any specific scenarios in mind to test?

dberenbaum · 2023-01-10T19:13:25Z

I don't yet. Maybe you do 🙏 ?

shcheklein · 2023-01-10T21:16:50Z

@dberenbaum got it, it's more of a placeholder then rather than something coming from your recent experience? (just clarifying this). For me the simplest test is open vscode-demo project, select 5-7 experiments and run plots. That was enough to get 50seconds+ time last time I checked (project evolves a bit), but even after my fix it's still 7-9seconds.

I think at this point it will be about optimizing things like:

caching collection (don't collection .dvc. yaml files, etc for the same commits (my estimate for this - should be around 1 week at most to my mind if we don't overcomplicate things). Can affect tremendously user experience in many places.
ability to parallelize index collection (some work on brancher, repo obejcts). Can we do some decent first steps in this direction within a week cc @efiop ?
make sure that we cache data that we download during these operations (I'm not 100% sure)

Then we can run the profiler again, and see where bottlenecks are.

@efiop @skshetry please correct me on these items.

pmrowla · 2023-02-14T07:45:27Z

This is related to #8787. In both cases, what we need to be doing is serializing repo.index for already collected commits and then caching it somewhere. And then all of our read-only commands (whether its dvc plots or dvc exp show) would just have to load the cached index instead of doing the complete filesystem or git tree walk() based collection every time.

skshetry · 2023-02-14T07:57:34Z

yeah, I am waiting on that persistent index to see what improvement it makes. It's hard to say at this time how much it'll save. Persistent index will still require us to read plots, parse etc.

Another idea that I am thinking of is caching the per-commit dictionary result of all operations so that we can reuse the old computed result (if there are no errors). But if persistent index is enough, we don't have to do that.

pmrowla · 2023-02-14T07:58:36Z

I think there also may be some issues related to error handling, as the way we handle errors across different commands is kind of a mess.

For one example, exp show currently supports generating much more granular error information for vscode. If we fail to parse individual params/metrics files, we generate an error at that file level, and continue trying to load the rest of a stage/pipeline/repo. But in the current repo index, if we fail to load index.params/index.metrics we just skip loading the entire stage (and generate a stage level error message), even if the stage contained other valid params/metrics/plots/etc.

We also have problems where all of the error handling has been added piecemeal over time to meet specific studio or vscode requirements, so the internal APIs for handling errors in one place is not the same as in other places.

So getting a quick/easy caching implementation that works for plots show may not necessarily work out of the box for exp show. But ideally we would come up with a solution that actually works in the general DVC case (and unify all of the fragmented ways that we handle collection/parsing errors)

pmrowla · 2023-02-14T08:03:49Z

yeah, I am waiting on that persistent index to see what improvement it makes. It's hard to say at this time how much it'll save. Persistent index will still require us to read plots, parse etc.

I'm not sure persistent data index will be enough here, since a lot of the things we need for plots/metrics/params are not DVC-tracked data. I think what we will still need is the serialized "dvc commit" object, that is really just a json dump of the aggregate .dvc+dvc.lock files in the repo, saved somewhere like .dvc/cache/commits using git commit SHA as object IDs and not the hash of the json data (similar to run cache). At a minimum I think we will need to dump all of the params/metrics values (since that includes git-tracked data), we may be able to get away with not needing to dump dvc-data hashes/meta here, since persistent data-index should take care of that for us.

skshetry · 2023-02-14T08:19:49Z

We had something like that before, but we never used it as it was not very smart, and the serialization/deserialization was slow. Loading all the cached results back to our in-memory format is also not always easy.

dvc/dvc/repo/index.py

Lines 307 to 316 in 1180a2f

    
           def dumpd(self) -> Dict[str, Dict]: 
        
               def dump(stage: "Stage"): 
        
                   key = stage.path_in_repo 
        
                   try: 
        
                       key += ":" + stage.name  # type: ignore[attr-defined] 
        
                   except AttributeError: 
        
                       pass 
        
                   return key, stage.dumpd() 
        
               return dict(dump(stage) for stage in self)

I think it'd be faster/simpler (but a bit fragile) to just cache results at high level.

skshetry · 2023-03-15T14:15:02Z

In main, I see ~4.5s out of total ~8s runtime being spent on json.dumps() on dvc-render side.

Repro for the benchmark:

git clone https://github.com/iterative/vscode-dvc-demo.git
cd vscode-dvc-demo
dvc fetch --all-commits
time dvc plots diff workspace $(git log --oneline --pretty=format:"%h" -n 10) --json --cprofile-dump dump.prof > /dev/null

daavoo · 2023-03-15T14:33:49Z

In main, I see ~4.5s out of total ~8s runtime being spent on json.dumps() on dvc-render side.

🙏 I can take that one

skshetry · 2023-03-15T16:03:04Z

@daavoo, do you know what's the effort to get rid of json.dumps()?

daavoo · 2023-03-16T08:56:35Z

Given that we are making breaking changes to exp show --json in #9170, why don't we include plots in exp show and make VSCode rely on a single command exp show --json instead of also requiring plots diff --json?

We would benefit from the existing exp show cache and VSCode only calls plots diff with revisions already present in exp show

daavoo · 2023-03-16T09:01:23Z

@daavoo, do you know what's the effort to get rid of json.dumps()?

taking a look today

skshetry · 2023-03-16T09:53:45Z

As you can see above, the rendering/_show_json takes 6.5 s out of 8s. Half of the rest is just imports. That leaves <1s to parse and load data from 10 commits.

I am not sure if vscode-dvc-demo is a good representative, but that's what we are using as a benchmark at the moment. Even though if we cache, we still have to do the rendering.

Given vscode only compares two commits and since we don't have a way to differentiate retryable/non-retryable errors, I don't find caching to be worth it at this point.

daavoo · 2023-03-16T09:59:12Z

Even though if we cache, we still have to do the rendering.

Not if we cache the rendered plot

daavoo · 2023-03-16T10:05:03Z

Anyhow, already have a patch removing all the overhead from dumps, will send it

daavoo · 2023-03-16T11:43:12Z

As you can see above, the rendering/_show_json takes 6.5 s out of 8s. Half of the rest is just imports. That leaves <1s to parse and load data from 10 commits.

This is how it looks with iterative/dvc-render#124.

Before:

After:

dberenbaum · 2023-03-16T14:10:06Z

Given vscode only compares two commits

Not sure it makes a meaningful difference, but to clarify, VS Code can compare up to 7 commits.

skshetry · 2023-03-16T14:12:24Z

@dberenbaum, I only saw it invoke dvc plots diff with two revisions. Could you please check in Output tab?

dberenbaum · 2023-03-16T14:15:28Z

make sure that we cache data that we download during these operations (I'm not 100% sure)

Is this an issue? That worries me most.

skshetry · 2023-03-16T14:20:59Z

@dberenbaum, what happens when you reselect some of those experiments? Does it cache or re-invokes dvc plots diff with same arguments?

Is this an issue? That worries me most.

I have a patch for that. See #9183

dberenbaum · 2023-03-16T14:25:11Z

@dberenbaum, what happens when you reselect some of those experiments? Does it cache or re-invokes dvc plots diff with same arguments?

It looks like VS Code tries to cache them. cc @mattseddon

By the way, I don't think 2 vs 7 commits is likely to be the difference in whether to spend more time on this issue, I just thought it was worth clarifying VS Code behavior.

skshetry · 2023-03-16T14:29:08Z

After merging #9197 and #9183, I think we can close this issue.
I am not against caching if we want to do it in the exp show, but I don't find it worth this time. I'll if someone complains before we reach for that solution.

skshetry · 2023-03-16T14:31:48Z

It looks like VS Code tries to cache them. cc @mattseddon

We discussed this today regarding caching. The way we squash definitions/properties prevent us from caching the rendered plots, and the way VSCode does caching, that squashing won't work anyway. Studio similarly has a different behaviour.

We'll have different output on VSCode/Studio/DVC, and it's unclear what the expected behaviour here should be.

mattseddon · 2023-03-17T01:04:41Z

Given that we are making breaking changes to exp show --json in #9170, why don't we include plots in exp show and make VSCode rely on a single command exp show --json instead of also requiring plots diff --json?

We would benefit from the existing exp show cache and VSCode only calls plots diff with revisions already present in exp show

When considering moving plots data into exp show we need to remember things like confusion matrices can be present, this will lead to massive amount of redundant data being collected/sent every time exp show is called. The extension can only show 7 revisions but there could be 100s/1000s in the data sent for exp show. With 10k+ datapoints per revision we could end up in OOM territory... but maybe we should try and cross that bridge when we come to it.

It looks like VS Code tries to cache them. cc @mattseddon

Relevant parts of #9025 (comment)

The extension caches revision data (images + datapoints) which means that we should only call plots diff once for each revision (except workspace). It is looking less likely that this approach (caching on the extension side) actually works well. For example, after reading this from #9025 (comment)

It does seem like it would be much simpler to use a single set of properties (either the leftmost revision or workspace in cases where it's implied). Seems like that would resolve some related issues like #9188 (since it would not try to render the old custom template) and #7193.

It seems like revision data is actually mutable depending on the combination of revisions passed to plots diff. This becomes even more error-prone when we throw FileNotFound errors into the mix and whether or not these errors are important/should be displayed. In order to cache data on the VS Code side we need revisions to be static.
...
Ideally, we would get plots diff performance to a place where we can call it with up to 7 revisions and it returns in a timely enough manner to drop all of the caching (mess) that is currently in the extension (will mention this in #8786 too).

If we can agree on the baseline behaviour and simplify the problem then maybe (hopefully) we can move caching out of the extension and back into DVC. WDYT?

pmrowla · 2023-03-17T02:06:31Z

When considering moving plots data into exp show we need to remember things like confusion matrices can be present, this will lead to massive amount of redundant data being collected/sent every time exp show is called. The extension can only show 7 revisions but there could be 100s/1000s in the data sent for exp show. With 10k+ datapoints per revision we could end up in OOM territory... but maybe we should try and cross that bridge when we come to it.

We can account for this with the potential exp show changes. I was thinking we could have a separate internal command that would just return data for specifically requested revisions. The returned result would not account for any nested structuring/grouping by baseline revision/experiment/executor/etc, and would only return params/metrics/plots data.

So you could have something like

$ dvc exp show-data main my-exp abc123 def456
{
  "main": {"params": {}, "metrics": {}, "plots": {}},
  "my-exp": {...},
  "abc123": {...},
  "def456: {"error": {"msg": "invalid revision/not found"}}
}

where the result is just an unordered dictionary mapping the requested named revision to the data field you would normally see in exp show with no other arguments (and where plots is datapoints only, without plots template/configuration). We could also have flags like --metrics/--no-metrics, --params/--no-params, --plots/--no-plots to specify what exactly vscode wants (this could also be extended to include the other data columns for a revision, like outs/deps, but that is probably less useful for vscode).

In this scenario, for a named branch/exp DVC would return the latest/tip matching the given name. So for running experiments, it would return the latest/live value. For a finished experiment it would return the tip commit for checkpoints or only/single commit for regular experiments.

So when rendering plots, vscode would call some dvc plots command to get the workspace template/configuration, and then something like exp show-data --plots [revs to diff], and DVC would then only return the plots datapoints for those revisions.

This way, we can exclude the plots datapoints from the regular exp show --json output to avoid the potential OOM errors, and internally we can still collect/cache plots data at the same time as params/metrics, using the existing exp show caching mechanism.

This may also be useful if/when vscode wants to refresh data for a specific regular table row without requesting the entire exp show output.

Essentially, this is how a hypothetical dvc.api.exp_show() would probably work (returning only data for specific revisions, and not the nested/grouped exp show --json format), and I think it probably makes sense for vscode to also be able to request only data in the same way.

skshetry · 2023-03-17T03:15:55Z

@mattseddon, a question that I have, how much time does the extension take to render plots?
What is the end-to-end performance like?

mattseddon · 2023-03-17T04:10:10Z

Times can vary greatly but cached in the extension vs un-cached does show a significant time difference. Here is an example using the demo project (not much data):

Screen.Recording.2023-03-17.at.2.52.56.pm.mov

When a revision is first selected we request the data from dvc plots diff we then have to process that revision before combining it with the existing datapoints and sending it to the webvview. After we've done that initial fetch things are much quicker.

For reference in the above project for 2 revisions dvc plots diff takes ~1000ms. Internal collection/transformation takes ~50ms. That time (50ms) goes up for the more revisions that we want to display.

dberenbaum · 2023-03-17T12:26:43Z

@mattseddon Is that example pulling the data from the remote, or is it already pulled? I think #9183 was already intended to make sure we avoid re-downloading.

I think caching the plots data on the DVC side seems nice to have, but I'm not that clear on if it's worth prioritizing right now since it sounds like it will require work from both products to replicate what VS Code is already doing, right?

skshetry · 2023-03-17T14:16:37Z

I'll suggest opening a new issue for caching. @daavoo's last PR halved (3x?) the runtime of plots show/diff, and the above 8s runtime now sits at a more comfortable ~2.6s runtime.

dberenbaum · 2023-03-17T18:46:14Z

@skshetry Good idea, but let's try to decide whether it's worth opening a new issue first.

@mattseddon @shcheklein Thoughts?

shcheklein · 2023-03-17T23:08:59Z

Yes, when I was initially complaining about plots performance it was about dealing with 1 minute + delays because of some issues (bugs, luck of clear parallelism, etc). I think we already went beyond and above (thanks!) and solved more than I personally anticipated and expected (at least from reading these threads). I agree that removing cache from VS Code and caching on the DVC is nice to have (unless Matt has a strong opinion about this, .e.g if it becomes very painful for us to deal with errors etc - we need to evaluate the effort then, agree about the separate issue for this).

mattseddon · 2023-03-19T21:40:05Z

All sounds reasonable. Caching is a bit broken on the extension side for the reasons listed here: #9025 (comment)

The Tl;dr is that revisions are mutable depending on the combination passed to plots diff. I'll work on improving the implementation on the extension side for now. Hopefully, we can agree on the proper behaviour in #9025.

Happy to close this and open a new issue for moving caching at a later date.

dberenbaum · 2023-03-29T11:20:58Z

Closing this one. Let's open a new issue if there are specific issues like caching we need to follow up on.

dberenbaum added p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks product: VSCode Integration with VSCode extension A: plots Related to the plots labels Jan 10, 2023

shcheklein mentioned this issue Jan 10, 2023

perf: remove fs exists check in plots, parallel data collect #8777

Merged

2 tasks

skshetry mentioned this issue Feb 15, 2023

dvc does not cache stream opened files #9030

Closed

efiop self-assigned this Mar 6, 2023

efiop added this to DVC Mar 6, 2023

github-project-automation bot moved this to Backlog in DVC Mar 6, 2023

mattseddon mentioned this issue Mar 8, 2023

Plot updates are slow and are fetching information that is not needed iterative/vscode-dvc#2581

Closed

mattseddon mentioned this issue Mar 17, 2023

plots diff: don't make one plot break all the others #9025

Closed

efiop removed their assignment Mar 17, 2023

dberenbaum closed this as completed Mar 29, 2023

github-project-automation bot moved this from Backlog to Done in DVC Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plots: improve performance of plots collection #8786

plots: improve performance of plots collection #8786

dberenbaum commented Jan 10, 2023

shcheklein commented Jan 10, 2023

dberenbaum commented Jan 10, 2023

shcheklein commented Jan 10, 2023

pmrowla commented Feb 14, 2023

skshetry commented Feb 14, 2023 •

edited

Loading

pmrowla commented Feb 14, 2023

pmrowla commented Feb 14, 2023 •

edited

Loading

skshetry commented Feb 14, 2023 •

edited

Loading

skshetry commented Mar 15, 2023 •

edited

Loading

daavoo commented Mar 15, 2023

skshetry commented Mar 15, 2023

daavoo commented Mar 16, 2023 •

edited

Loading

daavoo commented Mar 16, 2023

skshetry commented Mar 16, 2023

daavoo commented Mar 16, 2023

daavoo commented Mar 16, 2023

daavoo commented Mar 16, 2023

dberenbaum commented Mar 16, 2023

skshetry commented Mar 16, 2023

dberenbaum commented Mar 16, 2023

skshetry commented Mar 16, 2023 •

edited

Loading

dberenbaum commented Mar 16, 2023

skshetry commented Mar 16, 2023

skshetry commented Mar 16, 2023 •

edited

Loading

mattseddon commented Mar 17, 2023

pmrowla commented Mar 17, 2023 •

edited

Loading

skshetry commented Mar 17, 2023

mattseddon commented Mar 17, 2023

dberenbaum commented Mar 17, 2023

skshetry commented Mar 17, 2023

dberenbaum commented Mar 17, 2023

shcheklein commented Mar 17, 2023

mattseddon commented Mar 19, 2023

dberenbaum commented Mar 29, 2023

plots: improve performance of plots collection #8786

plots: improve performance of plots collection #8786

Comments

dberenbaum commented Jan 10, 2023

shcheklein commented Jan 10, 2023

dberenbaum commented Jan 10, 2023

shcheklein commented Jan 10, 2023

pmrowla commented Feb 14, 2023

skshetry commented Feb 14, 2023 • edited Loading

pmrowla commented Feb 14, 2023

pmrowla commented Feb 14, 2023 • edited Loading

skshetry commented Feb 14, 2023 • edited Loading

skshetry commented Mar 15, 2023 • edited Loading

daavoo commented Mar 15, 2023

skshetry commented Mar 15, 2023

daavoo commented Mar 16, 2023 • edited Loading

daavoo commented Mar 16, 2023

skshetry commented Mar 16, 2023

daavoo commented Mar 16, 2023

daavoo commented Mar 16, 2023

daavoo commented Mar 16, 2023

dberenbaum commented Mar 16, 2023

skshetry commented Mar 16, 2023

dberenbaum commented Mar 16, 2023

skshetry commented Mar 16, 2023 • edited Loading

dberenbaum commented Mar 16, 2023

skshetry commented Mar 16, 2023

skshetry commented Mar 16, 2023 • edited Loading

mattseddon commented Mar 17, 2023

pmrowla commented Mar 17, 2023 • edited Loading

skshetry commented Mar 17, 2023

mattseddon commented Mar 17, 2023

dberenbaum commented Mar 17, 2023

skshetry commented Mar 17, 2023

dberenbaum commented Mar 17, 2023

shcheklein commented Mar 17, 2023

mattseddon commented Mar 19, 2023

dberenbaum commented Mar 29, 2023

skshetry commented Feb 14, 2023 •

edited

Loading

pmrowla commented Feb 14, 2023 •

edited

Loading

skshetry commented Feb 14, 2023 •

edited

Loading

skshetry commented Mar 15, 2023 •

edited

Loading

daavoo commented Mar 16, 2023 •

edited

Loading

skshetry commented Mar 16, 2023 •

edited

Loading

skshetry commented Mar 16, 2023 •

edited

Loading

pmrowla commented Mar 17, 2023 •

edited

Loading