-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc queue
: unexpected behaviour
#8014
Comments
dvc queue
: unexpected behaviour
Also realise that I am working with something that isn't even officially released, so if this isn't the correct way to give early feedback then please LMK. |
Is it happening for you consistently? I'm not able to reproduce:
This is expected although we have discussed trying to address it. Queued experiments have always (even before
Please give any feedback you have so we can try to resolve before real users get their hands on it 🙏 . |
Yesterday I tried to recreate 4 times and hit this issue 3 times. Today I tried twice and hit on the second time so 4/6 times. Screen.Recording.2022-07-14.at.10.52.00.am.movNotice in the video that I start queue execution with 2 workers but every time I call |
Based on the above and iterative/vscode-dvc#1995 (comment) I think we are going to need the relevant queue task sha against each experiment in the |
@karajan1001 Shouldn't the "Name" column be populated with the final experiment name? I know this was true at some point (see #7591 (comment)). |
Running experiments from the queue and then making a commit leads to a change of behaviour in Details: ❯ git rev-parse --short HEAD
30a82db1
❯ git rev-parse --short HEAD~1
26c5bbed
❯ dvc exp show
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Experiment Created step loss acc lr weight_decay epochs data/MNIST train.py
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
workspace - 14 0.95962 0.7735 0.003 0 15 0aed307 90f29a9
26c5bbe 11:21 AM 14 0.95962 0.7735 0.003 0 15 0aed307 90f29a9
│ ╓ 2a83d89 [exp-50d00] 11:35 AM 14 0.9642 0.675 0.0045 0 15 0aed307 90f29a9
│ ╟ e0d0f6c 11:35 AM 13 1.0903 0.6801 0.0045 0 15 0aed307 90f29a9
│ ╟ 814009f 11:35 AM 12 1.0301 0.6605 0.0045 0 15 0aed307 90f29a9
│ ╟ fe2c348 11:35 AM 11 1.1769 0.6408 0.0045 0 15 0aed307 90f29a9
... The git log shows that both
By running a new experiment in the workspace both refs are moved. Do you want a separate issue for this? |
Two individual queue tasks collapse into one single result in I looked into the run log of these two tasks and found that both of them are
They just retrieved the cache and gave the same result, and resulted in the same |
For the |
Moved #8014 (comment) to #8031 because I was getting confused. I'm marking #8031 as p0 since that seems like a critical issue, although we can still make releases and hold off on merging the queue docs until it's resolved. |
Yes, I have previously recreated this by having completed experiments in the workspace and then upgrading the CLI, queuing a new experiment and running the queue. The project that I've been testing with has checkpoints as well. It should also be noted here that after upgrading the CLI there is no way back without removing Hope this helps 👍🏻. |
Sorry, I tried to run some experiments using what I see from
Of course, here '{entry.stash_rev[:7]}' is a message bug, we forgot to use an f-string to parse it. But it is not a critical one |
Excuse me , any more guidance on how to reproduce this? I had already tried:
But the problem is that the previous queued experiments are stored in |
Sorry, @karajan1001 I've been 100% occupied with integrating |
@mattseddon No rush, but could you clarify how critical/frequent this issue is? Like @karajan1001, I was unable to reproduce this one, and it's unclear whether it should be a priority. |
Tl;dr - I can recreate the issue by using
I can definitely recreate it. I just ran into it again: When trying to clean up experiments after getting that warning: ❯ dvc exp gc -f --all-tags
WARNING: This will remove all experiments except those derived from the workspace and all git tags of the current repo. Run queued experiments will be removed.
ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This will be an issue in the extension because of errors generate a popup that the user sees. Deleting Repro steps:
Even these repro steps are a bit hit or miss. From 3 attempts I hit the error and with a missing experiment 2 times. I can also recreate just by using steps 4-8 (no upgrade needed). The error is probably caused by 5+6. As j > 1 is a known issue we can probably close this. |
Sound like related to iterative/dvc-task#73. I tried several times but didn't meet this. I guess it is not related to the experiments in old versions. And related to 1. concurrency 2. checkpoint. I can repair the error message '{entry.stash_rev[:7]}' first to see what |
@mattseddon No need to close it. We will need to stabilize behavior here and with j > 1 in general, but good to know the context and that it doesn't seem like a release blocker. |
fix: iterative#8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_message": ["Cannot fast-forward HEAD to '05100047a341f2fa4a02421289d48f84d8c45e86'"], "exc_module": "dvc.scm"}. And the failed task neither create a infofile nor create a fail_stash. 1. Didn't raise DvcException if no infofile found for failed tasks. (success tasks still raise it) 2. Add a new unit test for this.
Finally, I accidentally reproduced it. It happened when the queue task failed because some SCM error, that didn't generate a normal fail_stash and a result info file. |
fix: iterative#8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_message": ["Cannot fast-forward HEAD to '05100047a341f2fa4a02421289d48f84d8c45e86'"], "exc_module": "dvc.scm"}. And the failed task neither create a infofile nor create a fail_stash. 1. Didn't raise DvcException if no infofile found for failed tasks. (success tasks still raise it) 2. Add a new unit test for this.
fix: #8014 > ERROR: Invalid experiment '{entry.stash_rev[:7]}'. This happens when the queue task failed from a scm error. {"exc_type": "GitMergeError", "exc_message": ["Cannot fast-forward HEAD to '05100047a341f2fa4a02421289d48f84d8c45e86'"], "exc_module": "dvc.scm"}. And the failed task neither create a infofile nor create a fail_stash. 1. Didn't raise DvcException if no infofile found for failed tasks. (success tasks still raise it) 2. Add a new unit test for this.
This fixed it for me, thank you. dvc 3.54.0. |
Bug Report
Description
Whilst checking out the new
dvc queue
command I have run into some unexpected behaviour. I won't duplicate the steps to reproduce here but after queueing and running experiments I have run in to two different issues.VS Code demo project:
dvc queue status
returningERROR: Invalid experiment '{entry.stash_rev[:7]}'.
(produced when running with the extension)example-get-started
:dvc queue status
returning(produced without having the extension involved).
In both instances this resulted in the HEAD baseline entry being dropped from the
exp show
data:example-get-started example
Reproduce
example-get-started
git+https://github.com/iterative/dvc
tosrc/requirements.txt
dvc pull
dvc exp run --queue
dvc queue start -j 2
dvc exp show
dvc queue status
dvc exp show
When recreating this I can see that both experiments were successful in
dvc queue status
but the second one has not made it into the table. Final results:First column of
exp show
:and the shas don't match?
Expected
Should be able to run
exp show
&queue status
in parallel with the execution of tasks from the queue.Environment information
Output of
dvc doctor
:Additional Information (if any):
Please let me know if you need anything else from me. Thank you.
The text was updated successfully, but these errors were encountered: