-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp show: cache collected experiments by git revision #9069
Conversation
34cb6d3
to
162ba2f
Compare
Codecov ReportBase: 93.03% // Head: 92.99% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #9069 +/- ##
==========================================
- Coverage 93.03% 92.99% -0.05%
==========================================
Files 457 459 +2
Lines 36912 37062 +150
Branches 5339 5358 +19
==========================================
+ Hits 34342 34465 +123
- Misses 2051 2071 +20
- Partials 519 526 +7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. β View full report at Codecov. |
162ba2f
to
a77fddf
Compare
On my machine, testing with all-commits in the vscode demo repo: dvc main:
With PR before anything is cached (i.e. first time you run
With PR after commits have been cached:
|
cc: @iterative/vs-code would appreciate it if you can test with this PR and make sure it doesn't break anything before we merge it into dvc main |
This comment was marked as outdated.
This comment was marked as outdated.
I will give it a run tomorrow and get back to you. |
a77fddf
to
0688de9
Compare
My first observation was that a checkpoint experiment got stuck in the running state after it had finished:
Things that seem fine:
|
0688de9
to
2e90fc7
Compare
2e90fc7
to
c10d745
Compare
The checkpoint Also decided to move this cache to |
Makes perfect sense, thanks @pmrowla! |
try: | ||
self.delete(rev) | ||
except FileNotFoundError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to delete this? IIRC add_bytes
overwrites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it overwrites on linux/mac but not windows. We end up hitting LocalFileSystem.upload_fobj
which uses os.rename
https://github.com/iterative/dvc-objects/blob/00ec978f5c55944471fcbf35e47272e4401c5193/src/dvc_objects/fs/local.py#L207
self, | ||
exp: Union[SerializableExp, SerializableError], | ||
rev: Optional[str] = None, | ||
force: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need force
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR uses force=True
all the time right now in show
but there's some additional scenarios that aren't covered yet where we won't want force (when collecting checkpoints from active task queue runs)
|
||
|
||
@dataclass(frozen=True) | ||
class SerializableExp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nature of our JSON format is arbitrary. It might be easier to just save json format directly and return that directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I do like structure, but our format is arbitrary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of those places I'd like to move away from using json/yaml as our default serialization format (when it doesn't need to be human readable) in favor of something that's not text based and is faster. Keeping it structured makes it easier to do that, and also dealing with nested dicts of dicts everywhere in params/metrics/exp show is a lot harder to follow vs having proper dataclasses.
I am wondering if we want to start moving to attrs
in core dvc
though? We use attrs in some subprojects and dataclasses in some other subprojects and core dvc, but I wasn't sure if we ever made a final decision on using one vs the other.
if exp_rev != "workspace" and not param_deps: | ||
cache.put(exp, force=True) | ||
return _format_exp(exp) | ||
except Exception as exc: # noqa: BLE001, pylint: disable=broad-except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should save cache if there are any errors in it. We don't know if it was just a transient network error like FileNotFoundError
/RemoteDepsMissingError
, etc or a corrupted dvc.yaml file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are only caching experiments by git commit (and not uncommitted workspace states), for things like corrupted dvc.yaml, if we get the error once, we know we will get the error every time (since the git committed dvc.yaml is never going to change).
For things like dvc-tracked metrics files, I suppose there is a possibility that they could be corrupted by a bad dvc fetch
, in which case the cached exp would contain the error in the nested metrics, but in this case the user can resolve it with something like
git checkout <affected commit>
dvc fetch
dvc exp show --force
The main point of this PR is to cache fixed git commit states, and if we are not caching commits that contain invalid dvc.yaml files, we will try (and fail) to recollect that commit every time. This is made even worse when users run something like exp show -A
, given that the affected dvc.yaml file probably exists in a range of commits that all need to be re-collected each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, the initial version of this PR did not cache entries when we hit errors, but testing with the vscode demo repo made it apparent that if we aren't caching errors, there was almost no overall performance gain due to problematic git commits vastly outnumbering valid ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should not care about caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think this is important, then we should actually improve error handling on repo + metrics/params collection, and differentiate between errors that we know are permanent (a git-committed file is invalid and will never parse) and errors that should be retry-able (a network error occurs trying to fetch something during collection), and then only cache certain cases appropriately
SerializableError. | ||
""" | ||
|
||
CACHE_DIR = os.path.join(EXEC_TMP_DIR, "cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we version the cache? In case we add some fields, or make backward incompatible change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that I considered, but the current implementation just invalidates the entry if it contains something unexpected, so in the event that the serialized fields change, the newer version of dvc will just re-collect the commit and overwrite the old cache entry.
One other thing I'd like to consider for a follow up at some point is just doing the collection + caching when we generate an exp commit (instead of waiting for the user to run (But this is low priority and probably premature optimization given that there's other bigger performance issues to address still.) |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
related to #8787
exp show
data is cached in.dvc/tmp/exps/cache/...
by git SHAexp show
will prefer loading cached data rather than collecting from git when possiblemetrics
field is now always returned for consistency, previouslymetrics
field was omitted in certain cases (like queued exp commits)metrics
will just contain an empty dictionary (this should not affect vscode as far as I can tell)exp show -f/--force
option to force exp collection (instead of loading from cache when possible)