Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp show: cache collected experiments by git revision #9069

Merged
merged 3 commits into from
Feb 24, 2023

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Feb 22, 2023

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

related to #8787

  • exp show data is cached in .dvc/tmp/exps/cache/... by git SHA
  • exp show will prefer loading cached data rather than collecting from git when possible
    • workspace data is never cached
    • when collecting active/running experiments cached data is ignored and the exp will be re-collected
  • metrics field is now always returned for consistency, previously metrics field was omitted in certain cases (like queued exp commits)
    • for queued exps metrics will just contain an empty dictionary (this should not affect vscode as far as I can tell)
  • adds exp show -f/--force option to force exp collection (instead of loading from cache when possible)

@pmrowla pmrowla added performance improvement over resource / time consuming tasks A: experiments Related to dvc exp labels Feb 22, 2023
@pmrowla pmrowla self-assigned this Feb 22, 2023
@pmrowla pmrowla force-pushed the exp-show-cache branch 2 times, most recently from 34cb6d3 to 162ba2f Compare February 23, 2023 06:25
@codecov
Copy link

codecov bot commented Feb 23, 2023

Codecov Report

Base: 93.03% // Head: 92.99% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (c10d745) compared to base (37761a2).
Patch coverage: 86.03% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9069      +/-   ##
==========================================
- Coverage   93.03%   92.99%   -0.05%     
==========================================
  Files         457      459       +2     
  Lines       36912    37062     +150     
  Branches     5339     5358      +19     
==========================================
+ Hits        34342    34465     +123     
- Misses       2051     2071      +20     
- Partials      519      526       +7     
Impacted Files Coverage Ξ”
tests/unit/command/test_experiments.py 99.59% <ΓΈ> (ΓΈ)
dvc/commands/experiments/show.py 93.56% <25.00%> (-1.06%) ⬇️
dvc/repo/experiments/cache.py 81.39% <81.39%> (ΓΈ)
dvc/repo/experiments/serialize.py 86.07% <86.07%> (ΓΈ)
dvc/repo/experiments/show.py 90.79% <93.18%> (-1.07%) ⬇️
dvc/repo/experiments/__init__.py 87.09% <100.00%> (+0.18%) ⬆️
tests/func/experiments/test_show.py 99.40% <100.00%> (+<0.01%) ⬆️
tests/unit/test_analytics.py 98.57% <0.00%> (-1.43%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

β˜” View full report at Codecov.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

@pmrowla
Copy link
Contributor Author

pmrowla commented Feb 23, 2023

On my machine, testing with all-commits in the vscode demo repo:

dvc main:

$ time dvc exp show -A --show-json
...
dvc exp show -A --show-json  24.83s user 11.66s system 99% cpu 36.670 total

With PR before anything is cached (i.e. first time you run dvc exp show -A or run with --force)

$ dvc exp show -A --show-json --force
...
dvc exp show -A --show-json --force  21.84s user 9.63s system 99% cpu 31.550 total

With PR after commits have been cached:

$ time dvc exp show -A --show-json
...
dvc exp show -A --show-json  5.29s user 8.34s system 107% cpu 12.725 total

@pmrowla pmrowla marked this pull request as ready for review February 23, 2023 08:51
@pmrowla pmrowla changed the title [WIP] exp show: cache collected experiments by git revision exp show: cache collected experiments by git revision Feb 23, 2023
@pmrowla
Copy link
Contributor Author

pmrowla commented Feb 23, 2023

cc: @iterative/vs-code would appreciate it if you can test with this PR and make sure it doesn't break anything before we merge it into dvc main

@pmrowla

This comment was marked as outdated.

@mattseddon
Copy link
Member

cc: @iterative/vs-code would appreciate it if you can test with this PR and make sure it doesn't break anything before we merge it into dvc main

I will give it a run tomorrow and get back to you.

@mattseddon
Copy link
Member

My first observation was that a checkpoint experiment got stuck in the running state after it had finished:

image

exp show -f fixed the issue but I then had to trigger a refresh of our experiments table. I only managed to get the table into this state once but it was not obvious how to get out.

Things that seem fine:

  • Running normal experiments
  • Running the queue
  • Removing experiments (current and previous commits)
  • Apply and branch
  • Removing a branch created from an experiment triggers the experiment's name to be updated (current and previous commits).

@pmrowla
Copy link
Contributor Author

pmrowla commented Feb 24, 2023

The checkpoint Running state bug should be resolved.

Also decided to move this cache to .dvc/tmp/exps/cache for now. This way the general troubleshooting suggestion of rm -r .dvc/tmp/exps for exp related problems still applies to exp show caching. If we decide we want this to be shareable in the future we can easily move it to .dvc/cache

@pmrowla pmrowla merged commit ad5b91e into iterative:main Feb 24, 2023
@pmrowla pmrowla deleted the exp-show-cache branch February 24, 2023 05:19
@dberenbaum
Copy link
Collaborator

Also decided to move this cache to .dvc/tmp/exps/cache for now. This way the general troubleshooting suggestion of rm -r .dvc/tmp/exps for exp related problems still applies to exp show caching. If we decide we want this to be shareable in the future we can easily move it to .dvc/cache

Makes perfect sense, thanks @pmrowla!

Comment on lines +43 to +46
try:
self.delete(rev)
except FileNotFoundError:
pass
Copy link
Member

@skshetry skshetry Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to delete this? IIRC add_bytes overwrites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it overwrites on linux/mac but not windows. We end up hitting LocalFileSystem.upload_fobj which uses os.rename
https://github.com/iterative/dvc-objects/blob/00ec978f5c55944471fcbf35e47272e4401c5193/src/dvc_objects/fs/local.py#L207

self,
exp: Union[SerializableExp, SerializableError],
rev: Optional[str] = None,
force: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need force?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR uses force=True all the time right now in show but there's some additional scenarios that aren't covered yet where we won't want force (when collecting checkpoints from active task queue runs)



@dataclass(frozen=True)
class SerializableExp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nature of our JSON format is arbitrary. It might be easier to just save json format directly and return that directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I do like structure, but our format is arbitrary)

Copy link
Contributor Author

@pmrowla pmrowla Feb 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those places I'd like to move away from using json/yaml as our default serialization format (when it doesn't need to be human readable) in favor of something that's not text based and is faster. Keeping it structured makes it easier to do that, and also dealing with nested dicts of dicts everywhere in params/metrics/exp show is a lot harder to follow vs having proper dataclasses.

I am wondering if we want to start moving to attrs in core dvc though? We use attrs in some subprojects and dataclasses in some other subprojects and core dvc, but I wasn't sure if we ever made a final decision on using one vs the other.

if exp_rev != "workspace" and not param_deps:
cache.put(exp, force=True)
return _format_exp(exp)
except Exception as exc: # noqa: BLE001, pylint: disable=broad-except
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should save cache if there are any errors in it. We don't know if it was just a transient network error like FileNotFoundError/RemoteDepsMissingError, etc or a corrupted dvc.yaml file.

Copy link
Contributor Author

@pmrowla pmrowla Feb 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are only caching experiments by git commit (and not uncommitted workspace states), for things like corrupted dvc.yaml, if we get the error once, we know we will get the error every time (since the git committed dvc.yaml is never going to change).

For things like dvc-tracked metrics files, I suppose there is a possibility that they could be corrupted by a bad dvc fetch, in which case the cached exp would contain the error in the nested metrics, but in this case the user can resolve it with something like

git checkout <affected commit>
dvc fetch
dvc exp show --force

The main point of this PR is to cache fixed git commit states, and if we are not caching commits that contain invalid dvc.yaml files, we will try (and fail) to recollect that commit every time. This is made even worse when users run something like exp show -A, given that the affected dvc.yaml file probably exists in a range of commits that all need to be re-collected each time.

Copy link
Contributor Author

@pmrowla pmrowla Feb 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the initial version of this PR did not cache entries when we hit errors, but testing with the vscode demo repo made it apparent that if we aren't caching errors, there was almost no overall performance gain due to problematic git commits vastly outnumbering valid ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should not care about caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think this is important, then we should actually improve error handling on repo + metrics/params collection, and differentiate between errors that we know are permanent (a git-committed file is invalid and will never parse) and errors that should be retry-able (a network error occurs trying to fetch something during collection), and then only cache certain cases appropriately

SerializableError.
"""

CACHE_DIR = os.path.join(EXEC_TMP_DIR, "cache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we version the cache? In case we add some fields, or make backward incompatible change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I considered, but the current implementation just invalidates the entry if it contains something unexpected, so in the event that the serialized fields change, the newer version of dvc will just re-collect the commit and overwrite the old cache entry.

@pmrowla
Copy link
Contributor Author

pmrowla commented Feb 25, 2023

One other thing I'd like to consider for a follow up at some point is just doing the collection + caching when we generate an exp commit (instead of waiting for the user to run exp show for a given exp commit). Doing that + supporting push/pull like we do with run cache would reduce the run time for initial exp show -A/--all-branches/--all-tags when the user has no exp cache yet.

(But this is low priority and probably premature optimization given that there's other bigger performance issues to address still.)
cc @dberenbaum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp performance improvement over resource / time consuming tasks
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants