exp show: cache collected experiments by git revision #9069

pmrowla · 2023-02-22T10:06:35Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

related to #8787

exp show data is cached in .dvc/tmp/exps/cache/... by git SHA
exp show will prefer loading cached data rather than collecting from git when possible
- workspace data is never cached
- when collecting active/running experiments cached data is ignored and the exp will be re-collected
metrics field is now always returned for consistency, previously metrics field was omitted in certain cases (like queued exp commits)
- for queued exps metrics will just contain an empty dictionary (this should not affect vscode as far as I can tell)
adds exp show -f/--force option to force exp collection (instead of loading from cache when possible)

codecov · 2023-02-23T06:37:28Z

Codecov Report

Base: 93.03% // Head: 92.99% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (c10d745) compared to base (37761a2).
Patch coverage: 86.03% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9069      +/-   ##
==========================================
- Coverage   93.03%   92.99%   -0.05%     
==========================================
  Files         457      459       +2     
  Lines       36912    37062     +150     
  Branches     5339     5358      +19     
==========================================
+ Hits        34342    34465     +123     
- Misses       2051     2071      +20     
- Partials      519      526       +7

Impacted Files	Coverage Δ
tests/unit/command/test_experiments.py	`99.59% <ø> (ø)`
dvc/commands/experiments/show.py	`93.56% <25.00%> (-1.06%)`	⬇️
dvc/repo/experiments/cache.py	`81.39% <81.39%> (ø)`
dvc/repo/experiments/serialize.py	`86.07% <86.07%> (ø)`
dvc/repo/experiments/show.py	`90.79% <93.18%> (-1.07%)`	⬇️
dvc/repo/experiments/__init__.py	`87.09% <100.00%> (+0.18%)`	⬆️
tests/func/experiments/test_show.py	`99.40% <100.00%> (+<0.01%)`	⬆️
tests/unit/test_analytics.py	`98.57% <0.00%> (-1.43%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

pmrowla · 2023-02-23T08:51:10Z

On my machine, testing with all-commits in the vscode demo repo:

dvc main:

$ time dvc exp show -A --show-json
...
dvc exp show -A --show-json  24.83s user 11.66s system 99% cpu 36.670 total

With PR before anything is cached (i.e. first time you run dvc exp show -A or run with --force)

$ dvc exp show -A --show-json --force
...
dvc exp show -A --show-json --force  21.84s user 9.63s system 99% cpu 31.550 total

With PR after commits have been cached:

$ time dvc exp show -A --show-json
...
dvc exp show -A --show-json  5.29s user 8.34s system 107% cpu 12.725 total

pmrowla · 2023-02-23T08:56:50Z

cc: @iterative/vs-code would appreciate it if you can test with this PR and make sure it doesn't break anything before we merge it into dvc main

mattseddon · 2023-02-23T09:08:53Z

cc: @iterative/vs-code would appreciate it if you can test with this PR and make sure it doesn't break anything before we merge it into dvc main

I will give it a run tomorrow and get back to you.

mattseddon · 2023-02-24T00:01:02Z

My first observation was that a checkpoint experiment got stuck in the running state after it had finished:

exp show -f fixed the issue but I then had to trigger a refresh of our experiments table. I only managed to get the table into this state once but it was not obvious how to get out.

Things that seem fine:

Running normal experiments
Running the queue
Removing experiments (current and previous commits)
Apply and branch
Removing a branch created from an experiment triggers the experiment's name to be updated (current and previous commits).

pmrowla · 2023-02-24T04:32:08Z

The checkpoint Running state bug should be resolved.

Also decided to move this cache to .dvc/tmp/exps/cache for now. This way the general troubleshooting suggestion of rm -r .dvc/tmp/exps for exp related problems still applies to exp show caching. If we decide we want this to be shareable in the future we can easily move it to .dvc/cache

dberenbaum · 2023-02-24T14:10:33Z

Also decided to move this cache to .dvc/tmp/exps/cache for now. This way the general troubleshooting suggestion of rm -r .dvc/tmp/exps for exp related problems still applies to exp show caching. If we decide we want this to be shareable in the future we can easily move it to .dvc/cache

Makes perfect sense, thanks @pmrowla!

skshetry · 2023-02-24T15:20:28Z

dvc/repo/experiments/cache.py

+            try:
+                self.delete(rev)
+            except FileNotFoundError:
+                pass


Is there a need to delete this? IIRC add_bytes overwrites.

it overwrites on linux/mac but not windows. We end up hitting LocalFileSystem.upload_fobj which uses os.rename
https://github.com/iterative/dvc-objects/blob/00ec978f5c55944471fcbf35e47272e4401c5193/src/dvc_objects/fs/local.py#L207

skshetry · 2023-02-24T15:25:47Z

dvc/repo/experiments/cache.py

+        self,
+        exp: Union[SerializableExp, SerializableError],
+        rev: Optional[str] = None,
+        force: bool = False,


Do we need force?

The PR uses force=True all the time right now in show but there's some additional scenarios that aren't covered yet where we won't want force (when collecting checkpoints from active task queue runs)

skshetry · 2023-02-25T04:57:02Z

dvc/repo/experiments/serialize.py

+
+
+@dataclass(frozen=True)
+class SerializableExp:


The nature of our JSON format is arbitrary. It might be easier to just save json format directly and return that directly.

(I do like structure, but our format is arbitrary)

This is one of those places I'd like to move away from using json/yaml as our default serialization format (when it doesn't need to be human readable) in favor of something that's not text based and is faster. Keeping it structured makes it easier to do that, and also dealing with nested dicts of dicts everywhere in params/metrics/exp show is a lot harder to follow vs having proper dataclasses.

I am wondering if we want to start moving to attrs in core dvc though? We use attrs in some subprojects and dataclasses in some other subprojects and core dvc, but I wasn't sure if we ever made a final decision on using one vs the other.

skshetry · 2023-02-25T05:02:34Z

dvc/repo/experiments/show.py

+        if exp_rev != "workspace" and not param_deps:
+            cache.put(exp, force=True)
+        return _format_exp(exp)
+    except Exception as exc:  # noqa: BLE001, pylint: disable=broad-except


I don't think we should save cache if there are any errors in it. We don't know if it was just a transient network error like FileNotFoundError/RemoteDepsMissingError, etc or a corrupted dvc.yaml file.

Given that we are only caching experiments by git commit (and not uncommitted workspace states), for things like corrupted dvc.yaml, if we get the error once, we know we will get the error every time (since the git committed dvc.yaml is never going to change).

For things like dvc-tracked metrics files, I suppose there is a possibility that they could be corrupted by a bad dvc fetch, in which case the cached exp would contain the error in the nested metrics, but in this case the user can resolve it with something like

git checkout <affected commit> dvc fetch dvc exp show --force

The main point of this PR is to cache fixed git commit states, and if we are not caching commits that contain invalid dvc.yaml files, we will try (and fail) to recollect that commit every time. This is made even worse when users run something like exp show -A, given that the affected dvc.yaml file probably exists in a range of commits that all need to be re-collected each time.

For reference, the initial version of this PR did not cache entries when we hit errors, but testing with the vscode demo repo made it apparent that if we aren't caching errors, there was almost no overall performance gain due to problematic git commits vastly outnumbering valid ones.

Users should not care about caching.

If we think this is important, then we should actually improve error handling on repo + metrics/params collection, and differentiate between errors that we know are permanent (a git-committed file is invalid and will never parse) and errors that should be retry-able (a network error occurs trying to fetch something during collection), and then only cache certain cases appropriately

skshetry · 2023-02-25T06:05:29Z

dvc/repo/experiments/cache.py

+    SerializableError.
+    """
+
+    CACHE_DIR = os.path.join(EXEC_TMP_DIR, "cache")


Should we version the cache? In case we add some fields, or make backward incompatible change?

This is something that I considered, but the current implementation just invalidates the entry if it contains something unexpected, so in the event that the serialized fields change, the newer version of dvc will just re-collect the commit and overwrite the old cache entry.

pmrowla · 2023-02-25T06:30:47Z

One other thing I'd like to consider for a follow up at some point is just doing the collection + caching when we generate an exp commit (instead of waiting for the user to run exp show for a given exp commit). Doing that + supporting push/pull like we do with run cache would reduce the run time for initial exp show -A/--all-branches/--all-tags when the user has no exp cache yet.

(But this is low priority and probably premature optimization given that there's other bigger performance issues to address still.)
cc @dberenbaum

pmrowla added performance improvement over resource / time consuming tasks A: experiments Related to dvc exp labels Feb 22, 2023

pmrowla self-assigned this Feb 22, 2023

pmrowla force-pushed the exp-show-cache branch 2 times, most recently from 34cb6d3 to 162ba2f Compare February 23, 2023 06:25

exp: add SerializableExp dataclasses

61fa088

pmrowla force-pushed the exp-show-cache branch from 162ba2f to a77fddf Compare February 23, 2023 08:42

pmrowla marked this pull request as ready for review February 23, 2023 08:51

pmrowla changed the title ~~[WIP] exp show: cache collected experiments by git revision~~ exp show: cache collected experiments by git revision Feb 23, 2023

This comment was marked as outdated.

Sign in to view

pmrowla mentioned this pull request Feb 23, 2023

exp show: improve performance and stability of collection for vs code #8787

Closed

3 tasks

pmrowla force-pushed the exp-show-cache branch from a77fddf to 0688de9 Compare February 23, 2023 09:19

pmrowla force-pushed the exp-show-cache branch from 0688de9 to 2e90fc7 Compare February 24, 2023 04:17

pmrowla added 2 commits February 24, 2023 13:29

exp: support caching collected exp data

44a3c81

exp show: cache collected git revs

c10d745

pmrowla force-pushed the exp-show-cache branch from 2e90fc7 to c10d745 Compare February 24, 2023 04:29

pmrowla merged commit ad5b91e into iterative:main Feb 24, 2023

pmrowla deleted the exp-show-cache branch February 24, 2023 05:19

pmrowla mentioned this pull request Feb 24, 2023

exp show: cache git refs per show() call #9077

Merged

2 tasks

skshetry reviewed Feb 24, 2023

View reviewed changes

skshetry reviewed Feb 25, 2023

View reviewed changes

pmrowla mentioned this pull request Mar 2, 2023

ref: document dvc exp show --force iterative/dvc.org#4346

Merged

skshetry mentioned this pull request Mar 3, 2023

caching per-revision exp show output #8748

Closed

mattseddon mentioned this pull request Mar 7, 2023

exp show: checkpoint experiments not marked as successful after completing #9125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp show: cache collected experiments by git revision #9069

exp show: cache collected experiments by git revision #9069

pmrowla commented Feb 22, 2023 •

edited

Loading

codecov bot commented Feb 23, 2023 •

edited

Loading

pmrowla commented Feb 23, 2023 •

edited

Loading

pmrowla commented Feb 23, 2023

This comment was marked as outdated.

mattseddon commented Feb 23, 2023

mattseddon commented Feb 24, 2023

pmrowla commented Feb 24, 2023 •

edited

Loading

dberenbaum commented Feb 24, 2023

skshetry Feb 24, 2023 •

edited

Loading

pmrowla Feb 25, 2023

skshetry Feb 24, 2023

pmrowla Feb 25, 2023

skshetry Feb 25, 2023

skshetry Feb 25, 2023

pmrowla Feb 25, 2023 •

edited

Loading

skshetry Feb 25, 2023

pmrowla Feb 25, 2023 •

edited

Loading

pmrowla Feb 25, 2023 •

edited

Loading

skshetry Feb 25, 2023

pmrowla Feb 25, 2023

skshetry Feb 25, 2023

pmrowla Feb 25, 2023

pmrowla commented Feb 25, 2023



		@dataclass(frozen=True)
		class SerializableExp:

exp show: cache collected experiments by git revision #9069

exp show: cache collected experiments by git revision #9069

Conversation

pmrowla commented Feb 22, 2023 • edited Loading

codecov bot commented Feb 23, 2023 • edited Loading

Codecov Report

pmrowla commented Feb 23, 2023 • edited Loading

pmrowla commented Feb 23, 2023

This comment was marked as outdated.

mattseddon commented Feb 23, 2023

mattseddon commented Feb 24, 2023

pmrowla commented Feb 24, 2023 • edited Loading

dberenbaum commented Feb 24, 2023

skshetry Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla Feb 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla Feb 25, 2023 • edited Loading

Choose a reason for hiding this comment

pmrowla Feb 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla commented Feb 25, 2023

pmrowla commented Feb 22, 2023 •

edited

Loading

codecov bot commented Feb 23, 2023 •

edited

Loading

pmrowla commented Feb 23, 2023 •

edited

Loading

pmrowla commented Feb 24, 2023 •

edited

Loading

skshetry Feb 24, 2023 •

edited

Loading

pmrowla Feb 25, 2023 •

edited

Loading

pmrowla Feb 25, 2023 •

edited

Loading

pmrowla Feb 25, 2023 •

edited

Loading