Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp show: --queued and --failed #7986

Closed
dberenbaum opened this issue Jul 7, 2022 · 10 comments · Fixed by #8318
Closed

exp show: --queued and --failed #7986

dberenbaum opened this issue Jul 7, 2022 · 10 comments · Fixed by #8318
Assignees
Labels
A: experiments Related to dvc exp

Comments

@dberenbaum
Copy link
Collaborator

  • dvc exp show should only include successfully completed experiments by default but should have options to include queued/failed items from the queue. (edit: exp show --queued/--failed)

Originally posted by @dberenbaum in #7592 (comment)

@dberenbaum dberenbaum added the A: experiments Related to dvc exp label Jul 7, 2022
@dberenbaum
Copy link
Collaborator Author

Need to coordinate with VS Code on these changes cc @mattseddon

@dberenbaum dberenbaum added this to DVC Jul 7, 2022
@dberenbaum dberenbaum moved this to Backlog in DVC Jul 7, 2022
@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Jul 7, 2022
@mattseddon
Copy link
Contributor

@dberenbaum
Copy link
Collaborator Author

Given the discussion of dvc exp being the primary interface over dvc queue, it might make more sense to include all of these by default and make the flags --hide-queued/--hide-failed.

@karajan1001 karajan1001 moved this from Todo to In Progress in DVC Sep 6, 2022
@karajan1001 karajan1001 self-assigned this Sep 6, 2022
@karajan1001
Copy link
Contributor

karajan1001 commented Sep 7, 2022

@mattseddon I'm now working on this, but I found that there is still some data format required to be decided.

In the current json result of exp show there is no flag for failed experiments .

    "a4dedc07945e645a3b5ca0ede77330908688ea47": {
      "data": {
        "timestamp": "2022-08-26T15:55:17",
        "params": {
          "params.yaml": {
            "data": {
              "interval": 1
            }
          }
        },
        "deps": {},
        "outs": {
          "output.txt": {
            "hash": "c4ca4238a0b923820dcc509a6f75849b",
            "size": 1,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          },
          "out.txt": {
            "hash": null,
            "size": null,
            "nfiles": null,
            "use_cache": true,
            "is_data_source": false
          }
        },
        "queued": true,
        "running": false,
        "executor": null
      }
    }

We need a new flag for it. what do you prefer "success": false or "failed": true. Some other things to be noticed:

  1. the default value to compatible with old versions.
  2. for those failed experiments, you should also disregard the info in outs.

@dberenbaum
Copy link
Collaborator Author

Do we need separate fields for each state, or are they all mutually exclusive and we should have a single state field?

@karajan1001
Copy link
Contributor

Do we need separate fields for each state, or are they all mutually exclusive and we should have a single state field?

All are OK for me but will give more trouble to those who are parsing the JSON themselves, inside the DVC the JSON parsing is always going with the JSON format. This is why I'm asking for advice from matt.

@mattseddon
Copy link
Contributor

Is completed/failed a boolean result? Can there be a partial failure? If it is a boolean then I'm happy with succeeded of failed. Will the error message will be nested somewhere within the dict? We will want to display as much information to the user as possible 👍🏻.

@karajan1001
Copy link
Contributor

karajan1001 commented Sep 8, 2022

Is completed/failed a boolean result? Can there be a partial failure? If it is a boolean then I'm happy with succeeded of failed. Will the error message will be nested somewhere within the dict? We will want to display as much information to the user as possible 👍🏻.

For the error message, it could be gotten through the dvc queue logs, this command reads the stored output of the executor. It's not hard for me to read the message from this file and paste them into the JSON, the only question is What should I put into the JSON, is the last line of the output enough? And for current discussion the final JSON format would be something like

"sha_rev": {
  "data": {
    "timestamp": %YYYY-%mm-%ddT%HH-%MM-%SS,
    "params": {
...
      },
    "error_msg": "", (if added)
    },
    "deps": {
...
},
    "outs": {
...
    },
    "queued": bool,
    "running": bool,
    "success":bool,
    "executor": ...
  }
}

and what do you think about using a single "status" field instead of three boolean ones? Because you might get a confused result with more than one true (of course it's a bug but it is still possible).

@mattseddon
Copy link
Contributor

The format that we expect for a record that is completely broken is:

"sha_rev": {
  "error": {
    "msg": error_msg,
    "type": ""
      },
    },
    ...
    "success":false,
  }
}

Where the error key replaces data.

However, if we know what the params/deps are we could simply put the error message into the missing metrics/outs files.

{
  "data": {
    "timestamp": "2022-09-15T11:07:57",
    "params": {
      "params.yaml": {
        "data": {
          "lr": 0.006,
          "weight_decay": 0,
          "epochs": 15
        }
      }
    },
    "deps": {
      "data/MNIST": {
        ...
      },
      ...
    },
    ...
    "metrics": {
      "metric_file": {
        "error": {
          "msg": error_msg
          "type": error_type
        }
      }
    },
  }
}

The second option is IMO more beneficial to users because they'd want to know which of the queued items have failed so that they can retry them. If we omit the param info and only show failed records they would have to reverse engineer/work out which ones have failed.

and what do you think about using a single "status" field instead of three boolean ones? Because you might get a confused result with more than one true (of course it's a bug but it is still possible).

I would be fine to merge the running/queued/success/pending fields into a single status field. This is a breaking change though so we would need to coordinate on the release. Anyone using an old version of the extension with a new version of the CLI is going to have a weird experience.

@karajan1001
Copy link
Contributor

The second option is IMO more beneficial to users because they'd want to know which of the queued items have failed so that they can retry them. If we omit the param info and only show failed records they would have to reverse engineer/work out which ones have failed.

It is more an execution failure instead of metric_file file format wrong. The error message is probably some callstack message. So I think

"sha_rev": {
  "error": {
    "msg": error_msg,
    "type": ""
      },
    },
    ...
    "success":false,
  }
}

Makes more sense.

I would be fine to merge the running/queued/success/pending fields into a single status field. This is a breaking change though so we would need to coordinate on the release. Anyone using an old version of the extension with a new version of the CLI is going to have a weird experience.

Another choice is to keep both

    "queued": bool,
    "running": bool,

and status for a time, the front end can first look if the status field exists and if not back to the old version of the parsing method. And remove the old API 1 or 2 versions later.

karajan1001 added a commit to karajan1001/dvc that referenced this issue Sep 19, 2022
fix: iterative#7986
1. Add two new flags  `--hide-queued` and  `--hide-failed` to `exp show`
2. Allow `exp show` to show failed experiments.
3. Add unit test for the failed experiments shown.
karajan1001 added a commit to karajan1001/dvc that referenced this issue Sep 20, 2022
fix: iterative#7986
1. Add two new flags  `--hide-queued` and  `--hide-failed` to `exp show`
2. Allow `exp show` to show failed experiments.
3. Add unit test for the failed experiments shown.
4. Add name support for failed exp
karajan1001 added a commit to karajan1001/dvc that referenced this issue Sep 27, 2022
fix: iterative#7986
1. Add two new flags  `--hide-queued` and  `--hide-failed` to `exp show`
2. Allow `exp show` to show failed experiments.
3. Add unit test for the failed experiments shown.
4. Add name support for failed exp
@karajan1001 karajan1001 moved this from In Progress to Review In Progress in DVC Sep 27, 2022
karajan1001 added a commit to karajan1001/dvc that referenced this issue Sep 30, 2022
fix: iterative#7986
1. Add two new flags  `--hide-queued` and  `--hide-failed` to `exp show`
2. Allow `exp show` to show failed experiments.
3. Add unit test for the failed experiments shown.
4. Add name support for failed exp
karajan1001 added a commit to karajan1001/dvc that referenced this issue Oct 6, 2022
fix: iterative#7986
1. Add two new flags  `--hide-queued` and  `--hide-failed` to `exp show`
2. Allow `exp show` to show failed experiments.
3. Add unit test for the failed experiments shown.
4. Add name support for failed exp
5. Add error msg to the `exp show` output
Repository owner moved this from Review In Progress to Done in DVC Oct 6, 2022
karajan1001 added a commit that referenced this issue Oct 6, 2022
fix: #7986
1. Add two new flags  `--hide-queued` and  `--hide-failed` to `exp show`
2. Allow `exp show` to show failed experiments.
3. Add unit test for the failed experiments shown.
4. Add name support for failed exp
5. Add error msg to the `exp show` output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants