Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc queue: unexpected behaviour #8014

Closed
mattseddon opened this issue Jul 13, 2022 · 19 comments · Fixed by #8099
Closed

dvc queue: unexpected behaviour #8014

mattseddon opened this issue Jul 13, 2022 · 19 comments · Fixed by #8099
Assignees
Labels
A: experiments Related to dvc exp p2-medium Medium priority, should be done, but less important product: VSCode Integration with VSCode extension

Comments

@mattseddon
Copy link
Contributor

Bug Report

Description

Whilst checking out the new dvc queue command I have run into some unexpected behaviour. I won't duplicate the steps to reproduce here but after queueing and running experiments I have run in to two different issues.

VS Code demo project: dvc queue status returning ERROR: Invalid experiment '{entry.stash_rev[:7]}'. (produced when running with the extension)
example-get-started: dvc queue status returning

Task     Name    Created    Status
f3d69ee          02:17 PM   Success
08ccb05          02:17 PM   Success

ERROR: unexpected error - Extra data: line 1 column 56 (char 55)

(produced without having the extension involved).

In both instances this resulted in the HEAD baseline entry being dropped from the exp show data:

example-get-started example
❯ dvc exp show --show-json
{
  "workspace": {
    "baseline": {
      "data": {
        "timestamp": null,
        "params": {
          "params.yaml": {
            "data": {
              "prepare": {
                "split": 0.21,
                "seed": 20170428
              },
              "featurize": {
                "max_features": 200,
                "ngrams": 2
              },
              "train": {
                "seed": 20170428,
                "n_est": 50,
                "min_split": 0.01
              }
            }
          }
        },
        "deps": {
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null
          },
          "src/prepare.py": {
            "hash": "f09ea0c15980b43010257ccb9f0055e2",
            "size": 1576,
            "nfiles": null
          },
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2
          },
          "src/featurization.py": {
            "hash": "e0265fc22f056a4b86d85c3056bc2894",
            "size": 2490,
            "nfiles": null
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2
          },
          "src/train.py": {
            "hash": "c3961d777cfbd7727f9fde4851896006",
            "size": 967,
            "nfiles": null
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null
          },
          "src/evaluate.py": {
            "hash": "44e714021a65edf881b1716e791d7f59",
            "size": 2346,
            "nfiles": null
          }
        },
        "outs": {
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          },
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": true
          }
        },
        "queued": false,
        "running": false,
        "executor": null,
        "metrics": {
          "evaluation.json": {
            "data": {
              "avg_prec": 0.9249974999612706,
              "roc_auc": 0.9460213440787918
            }
          }
        }
      }
    }
  },
  "f3d69eedda6b1c051b115523cf5c6c210490d0ea": {
    "baseline": {
      "data": {
        "timestamp": "2022-07-13T14:17:20",
        "params": {
          "params.yaml": {
            "data": {
              "prepare": {
                "split": 0.21,
                "seed": 20170428
              },
              "featurize": {
                "max_features": 200,
                "ngrams": 2
              },
              "train": {
                "seed": 20170428,
                "n_est": 50,
                "min_split": 0.01
              }
            }
          }
        },
        "deps": {
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null
          },
          "src/prepare.py": {
            "hash": "f09ea0c15980b43010257ccb9f0055e2",
            "size": 1576,
            "nfiles": null
          },
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2
          },
          "src/featurization.py": {
            "hash": "e0265fc22f056a4b86d85c3056bc2894",
            "size": 2490,
            "nfiles": null
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2
          },
          "src/train.py": {
            "hash": "c3961d777cfbd7727f9fde4851896006",
            "size": 967,
            "nfiles": null
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null
          },
          "src/evaluate.py": {
            "hash": "44e714021a65edf881b1716e791d7f59",
            "size": 2346,
            "nfiles": null
          }
        },
        "outs": {
          "data/prepared": {
            "hash": "153aad06d376b6595932470e459ef42a.dir",
            "size": 8437363,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "data/features": {
            "hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
            "size": 2232588,
            "nfiles": 2,
            "use_cache": true,
            "is_data_source": false
          },
          "model.pkl": {
            "hash": "46865edbf3d62fc5c039dd9d2b0567a4",
            "size": 1763725,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          },
          "data/data.xml": {
            "hash": "22a1a2931c8370d3aeedd7183606fd7f",
            "size": 14445097,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": true
          }
        },
        "queued": false,
        "running": false,
        "executor": null,
        "metrics": {
          "evaluation.json": {
            "data": {
              "avg_prec": 0.9249974999612706,
              "roc_auc": 0.9460213440787918
            }
          }
        }
      }
    }
  }
}

Reproduce

  1. clone example-get-started
  2. add git+https://github.com/iterative/dvc to src/requirements.txt
  3. create venv, source activate script and install requirements
  4. dvc pull
  5. change params.yaml and queue x2 with dvc exp run --queue
  6. dvc queue start -j 2
  7. dvc exp show
  8. dvc queue status
  9. dvc exp show

When recreating this I can see that both experiments were successful in dvc queue status but the second one has not made it into the table. Final results:

❯ dvc queue status 
Task     Name    Created    Status
9d22751          02:50 PM   Success
962c834          02:50 PM   Success

Worker status: 0 active, 0 idle

First column of exp show:

  workspace
  bigrams-experiment
  └── 65584bd [exp-c88e8]

and the shas don't match?

Expected

Should be able to run exp show & queue status in parallel with the execution of tasks from the queue.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.13.1.dev87+gc2668110 
---------------------------------
Platform: Python 3.8.9 on macOS-12.2.1-arm64-arm-64bit
Supports:
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Please let me know if you need anything else from me. Thank you.

@mattseddon mattseddon added product: VSCode Integration with VSCode extension A: experiments Related to dvc exp A: task-queue labels Jul 13, 2022
@mattseddon mattseddon changed the title dvc queue: unexpected behaviour dvc queue: unexpected behaviour Jul 13, 2022
@mattseddon
Copy link
Contributor Author

Also realise that I am working with something that isn't even officially released, so if this isn't the correct way to give early feedback then please LMK.

@dberenbaum
Copy link
Collaborator

Is it happening for you consistently? I'm not able to reproduce:

$ dvc queue status
Task     Name    Created    Status
16508a7          04:48 PM   Success
94aa5f6          04:48 PM   Success

Worker status: 0 active, 1 idle
$ dvc exp show
 ──────────────────────────────────────────────────────────────────────────────────────>
  Experiment                Created        avg_prec   roc_auc   prepare.split   prepare>
 ──────────────────────────────────────────────────────────────────────────────────────>
  workspace                 -                 0.925   0.94602   0.2             2017042>
  bigrams-experiment        May 19, 2022      0.925   0.94602   0.2             2017042>
  ├── 32a003b [exp-36288]   04:49 PM        0.92017   0.94309   0.2             2017042>
  └── 991e27b [exp-9e5c5]   04:49 PM        0.91783    0.9407   0.2             2017042>
 ──────────────────────────────────────────────────────────────────────────────────────>

and the shas don't match?

This is expected although we have discussed trying to address it. Queued experiments have always (even before dvc queue) had different shas than the completed experiments.

Also realise that I am working with something that isn't even officially released, so if this isn't the correct way to give early feedback then please LMK.

Please give any feedback you have so we can try to resolve before real users get their hands on it 🙏 .

@mattseddon
Copy link
Contributor Author

Yesterday I tried to recreate 4 times and hit this issue 3 times.

Today I tried twice and hit on the second time so 4/6 times.

Screen.Recording.2022-07-14.at.10.52.00.am.mov

Notice in the video that I start queue execution with 2 workers but every time I call queue status it only shows 1 (yesterday I consistently saw 2). The queue tasks both show success but the second one has not made it into the experiments table.

@mattseddon
Copy link
Contributor Author

This is expected although we have discussed trying to address it. Queued experiments have always (even before dvc queue) had different shas than the completed experiments.

Based on the above and iterative/vscode-dvc#1995 (comment) I think we are going to need the relevant queue task sha against each experiment in the exp show data. Will that be possible?

@karajan1001 karajan1001 self-assigned this Jul 14, 2022
@dberenbaum
Copy link
Collaborator

This is expected although we have discussed trying to address it. Queued experiments have always (even before dvc queue) had different shas than the completed experiments.

Based on the above and iterative/vscode-dvc#1995 (comment) I think we are going to need the relevant queue task sha against each experiment in the exp show data. Will that be possible?

@karajan1001 Shouldn't the "Name" column be populated with the final experiment name? I know this was true at some point (see #7591 (comment)).

@mattseddon
Copy link
Contributor Author

mattseddon commented Jul 18, 2022

Running experiments from the queue and then making a commit leads to a change of behaviour in exp show. The previous commit is not returned by default.

Details:

❯ git rev-parse --short HEAD
30a82db1
❯ git rev-parse --short HEAD~1
26c5bbed
❯ dvc exp show
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
  Experiment                Created    step      loss      acc   lr       weight_decay   epochs   data/MNIST   train.py  
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
  workspace                 -            14   0.95962   0.7735   0.003    0              15       0aed307      90f29a9   
  26c5bbe                   11:21 AM     14   0.95962   0.7735   0.003    0              15       0aed307      90f29a9   
  │ ╓ 2a83d89 [exp-50d00]   11:35 AM     14    0.9642    0.675   0.0045   0              15       0aed307      90f29a9   
  │ ╟ e0d0f6c               11:35 AM     13    1.0903   0.6801   0.0045   0              15       0aed307      90f29a9   
  │ ╟ 814009f               11:35 AM     12    1.0301   0.6605   0.0045   0              15       0aed307      90f29a9   
  │ ╟ fe2c348               11:35 AM     11    1.1769   0.6408   0.0045   0              15       0aed307      90f29a9  
... 

The git log shows that both EXEC_HEAD & EXEC_BASELINE are still against the previous commit:

commit 26c5bbeddece45965cf518bde132180f7edbe7f2 (refs/exps/exec/EXEC_HEAD, refs/exps/exec/EXEC_BASELINE)

By running a new experiment in the workspace both refs are moved.

Do you want a separate issue for this?

@dberenbaum dberenbaum added this to DVC Jul 18, 2022
@dberenbaum dberenbaum moved this to Todo in DVC Jul 18, 2022
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do release-blocker Blocks a release labels Jul 19, 2022
@karajan1001
Copy link
Contributor

Two individual queue tasks collapse into one single result in exp show, it is because the output of both experiments are the same

I looked into the run log of these two tasks and found that both of them are

'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping

They just retrieved the cache and gave the same result, and resulted in the same hash and name(first 7 chars of hash).
This phenomenon also exists in dvc exp run without a --queue flag.

@karajan1001
Copy link
Contributor

karajan1001 commented Jul 20, 2022

For the ERROR: Invalid experiment '{entry.stash_rev[:7]}'. I can't reproduce it, is there any reproducing steps? or it's a compatible issue that only exists in dealing with previously generated exps?

@dberenbaum
Copy link
Collaborator

Moved #8014 (comment) to #8031 because I was getting confused. I'm marking #8031 as p0 since that seems like a critical issue, although we can still make releases and hold off on merging the queue docs until it's resolved.

@mattseddon
Copy link
Contributor Author

For the ERROR: Invalid experiment '{entry.stash_rev[:7]}'. I can't reproduce it, is there any reproducing steps? or it's a compatible issue that only exists in dealing with previously generated exps?

Yes, I have previously recreated this by having completed experiments in the workspace and then upgrading the CLI, queuing a new experiment and running the queue. The project that I've been testing with has checkpoints as well.

It should also be noted here that after upgrading the CLI there is no way back without removing .dvc/tmp/exps. That is definitely less of a concern than being able to seamlessly upgrade.

Hope this helps 👍🏻.

@karajan1001
Copy link
Contributor

karajan1001 commented Jul 22, 2022

Yes, I have previously recreated this by having completed experiments in the workspace and then upgrading the CLI, queuing a new experiment and running the queue. The project that I've been testing with has checkpoints as well.

Sorry, I tried to run some experiments using --queue and --run-all in 2.12, in both checkpoints-tutorial and example-get-started repo. Then upgrade dvc to the latest main version, still can not reproduce ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

what I see from dvc queue status is

No experiment tasks in the queue.

Worker status: 0 active, 0 idle

Of course, here '{entry.stash_rev[:7]}' is a message bug, we forgot to use an f-string to parse it. But it is not a critical one

@karajan1001 karajan1001 moved this from Todo to In Progress in DVC Jul 27, 2022
@karajan1001
Copy link
Contributor

karajan1001 commented Jul 27, 2022

@mattseddon

Excuse me , any more guidance on how to reproduce this? I had already tried:

  1. Switch to 2.12.1, queue an experiment and finish it, return to the latest main use queue status.
  2. Switch to 2.12.1, queue an experiment and run it, before it is finished return to the latest main use queue status.
  3. queue an experiment and switch to 2.12.1, try to run it. but it was missing in the exp show.

But the problem is that the previous queued experiments are stored in refs/exps/stash while the newly celery version is stored in refs/exps/celery/stash. This means that if we switched between different versions, dvc exp show/queue status will not show the experiments created from the different side.

@mattseddon
Copy link
Contributor Author

Sorry, @karajan1001 I've been 100% occupied with integrating data status. I will get back to you soon.

@dberenbaum
Copy link
Collaborator

@mattseddon No rush, but could you clarify how critical/frequent this issue is? Like @karajan1001, I was unable to reproduce this one, and it's unclear whether it should be a priority.

@mattseddon
Copy link
Contributor Author

Tl;dr - I can recreate the issue by using dvc queue start -j 2. As j > 1 is currently experimental we can probably close this.

I was unable to reproduce this one, and it's unclear whether it should be a priority.

I can definitely recreate it. I just ran into it again:

image

When trying to clean up experiments after getting that warning:

❯ dvc exp gc -f --all-tags 
WARNING: This will remove all experiments except those derived from the workspace and all git tags of the current repo. Run queued experiments will be removed.
ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

This will be an issue in the extension because of errors generate a popup that the user sees.

Deleting .dvc/tmp/exps gets rid of the error altogether.

Repro steps:

  1. Using checkpoint-based experiments with 2.11.0
  2. Run an experiment in the workspace.
  3. Upgrade to 2.15.0.
  4. Queue two experiments with different params.
  5. dvc queue start -j 2
  6. run dvc exp show --show-json almost immediately after starting the queue (as per extension).
  7. One experiment will run, the other will disappear.
  8. dvc queue status returns ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

Even these repro steps are a bit hit or miss. From 3 attempts I hit the error and with a missing experiment 2 times.

I can also recreate just by using steps 4-8 (no upgrade needed).

The error is probably caused by 5+6. As j > 1 is a known issue we can probably close this.

@karajan1001
Copy link
Contributor

karajan1001 commented Jul 28, 2022

Sound like related to iterative/dvc-task#73. I tried several times but didn't meet this. I guess it is not related to the experiments in old versions. And related to 1. concurrency 2. checkpoint. I can repair the error message '{entry.stash_rev[:7]}' first to see what stash_rev value it is.

@karajan1001 karajan1001 moved this from In Progress to Todo in DVC Jul 28, 2022
@dberenbaum
Copy link
Collaborator

@mattseddon No need to close it. We will need to stabilize behavior here and with j > 1 in general, but good to know the context and that it doesn't seem like a release blocker.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Jul 28, 2022
@karajan1001 karajan1001 moved this from Todo to Backlog in DVC Aug 4, 2022
karajan1001 added a commit to karajan1001/dvc that referenced this issue Aug 5, 2022
fix: iterative#8014
> ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

This happens when the queue task failed from a scm error.

{"exc_type": "GitMergeError", "exc_message": ["Cannot fast-forward HEAD
to '05100047a341f2fa4a02421289d48f84d8c45e86'"], "exc_module": "dvc.scm"}.

And the failed task neither create a infofile nor create a fail_stash.

1. Didn't raise DvcException if no infofile found for failed tasks.
   (success tasks still raise it)
2. Add a new unit test for this.
@karajan1001
Copy link
Contributor

karajan1001 commented Aug 5, 2022

Finally, I accidentally reproduced it. It happened when the queue task failed because some SCM error, that didn't generate a normal fail_stash and a result info file.

@karajan1001 karajan1001 moved this from Backlog to Review In Progress in DVC Aug 6, 2022
karajan1001 added a commit to karajan1001/dvc that referenced this issue Aug 8, 2022
fix: iterative#8014
> ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

This happens when the queue task failed from a scm error.

{"exc_type": "GitMergeError", "exc_message": ["Cannot fast-forward HEAD
to '05100047a341f2fa4a02421289d48f84d8c45e86'"], "exc_module": "dvc.scm"}.

And the failed task neither create a infofile nor create a fail_stash.

1. Didn't raise DvcException if no infofile found for failed tasks.
   (success tasks still raise it)
2. Add a new unit test for this.
Repository owner moved this from Review In Progress to Done in DVC Aug 8, 2022
pmrowla pushed a commit that referenced this issue Aug 8, 2022
fix: #8014
> ERROR: Invalid experiment '{entry.stash_rev[:7]}'.

This happens when the queue task failed from a scm error.

{"exc_type": "GitMergeError", "exc_message": ["Cannot fast-forward HEAD
to '05100047a341f2fa4a02421289d48f84d8c45e86'"], "exc_module": "dvc.scm"}.

And the failed task neither create a infofile nor create a fail_stash.

1. Didn't raise DvcException if no infofile found for failed tasks.
   (success tasks still raise it)
2. Add a new unit test for this.
@eliadl
Copy link

eliadl commented Aug 20, 2024

Deleting .dvc/tmp/exps gets rid of the error altogether.

This fixed it for me, thank you. dvc 3.54.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp p2-medium Medium priority, should be done, but less important product: VSCode Integration with VSCode extension
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants