Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue: track logs #8483

Closed
Tracked by #9442
dberenbaum opened this issue Oct 27, 2022 · 5 comments
Closed
Tracked by #9442

queue: track logs #8483

dberenbaum opened this issue Oct 27, 2022 · 5 comments
Labels
A: experiments Related to dvc exp A: task-queue Related to task queue. feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@dberenbaum
Copy link
Collaborator

dvc queue saves info into .dvc/tmp/exps, including stdout, stderr, and structured JSON output with info like time and return code. This is useful info, but it is only retrievable through queue-specific commands and is treated like temporary data.

Instead, each experiment can use DVC to track its own logs and keep them somewhere like .dvc/logs with an associated .dvc/logs.dvc file. This would enable the logs to be saved and shared as part of the experiment. Return code, start/end time, and any other info that we decide to collect in the future, can also be included. The logs and this metadata can be used by Studio, VS Code, and any other experiment tracking interface.

For non-queued experiments, there are currently no logs, but maybe we will eventually treat all experiments as queued?

Related: #7160

@dberenbaum dberenbaum added A: experiments Related to dvc exp A: task-queue Related to task queue. labels Oct 27, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Oct 28, 2022

How would these logs/metadata be shared with other tools (like studio)? Would it need to be pushed to DVC remotes? Or would it need to be stored entirely in git?

The main issue here that I can see is that the existing log capturing behavior is entirely separate from the DVC experiment. The logs that are available through the queue commands is the output of an entire encapsulated dvc exp run call that is made by the celery worker (in the temp workspace). It's basically equivalent to doing $ dvc exp run > log.txt in a terminal. In this case, the log contains more than just the pipeline command output (it includes DVC CLI output/warnings/etc as well), and it's done separately from the exp commit generation.

In order to actually capture this and track it in git/dvc, you would really need to generate the entire experiment (including multiple potential checkpoint commits), and then generate an additional new commit that contains just the log/metadata changes.

This workflow also seems a bit strange because for any other non-dvc exp DVC or git commands, your repo will still have an outdated git or DVC tracked logs/metadata file that just has not been modified since the last time you ran an experiment. The reason the logs are separated right now is because they really only belong to the celery worker context (and they are not really part of the DVC repo state).

We could maybe implement this using something like git notes, where we would attach logs/metadata to a specific exp commit, without actually tracking them as files/data within the DVC/git repo.

@dberenbaum
Copy link
Collaborator Author

How would these logs/metadata be shared with other tools (like studio)? Would it need to be pushed to DVC remotes? Or would it need to be stored entirely in git?

I think it should be pushed to DVC remotes and just tracked with its own .dvc file or in dvc.lock.

In order to actually capture this and track it in git/dvc, you would really need to generate the entire experiment (including multiple potential checkpoint commits), and then generate an additional new commit that contains just the log/metadata changes.

Good point. Another option would be to generate the logs only within the stage run so they can be captured as part of the exp commit. I think it would be fine to exclude or keep separate logs of celery operations and dvc boilerplate for setup/teardown of experiments.

This workflow also seems a bit strange because for any other non-dvc exp DVC or git commands, your repo will still have an outdated git or DVC tracked logs/metadata file that just has not been modified since the last time you ran an experiment.

So maybe dvc.lock makes the most sense to track this info since it will be tied to the experiment run?

We could maybe implement this using something like git notes, where we would attach logs/metadata to a specific exp commit, without actually tracking them as files/data within the DVC/git repo.

I would prefer if we can avoid it so we don't have to get even deeper into explaining esoteric git features.

@dberenbaum
Copy link
Collaborator Author

@pmrowla
Copy link
Contributor

pmrowla commented Oct 28, 2022

If we are looking to track individual stage timings, I agree it makes sense for logs to be kept at the stage level as well (and then the log/metadata files can essentially be handled as a per-stage output in dvc.lock)

@dberenbaum
Copy link
Collaborator Author

Right, stage-level outputs make more sense. In that case, DVC/VS Code/Studio could merge logs and/or times to provide info for the "full" experiment.

@karajan1001 karajan1001 added p2-medium Medium priority, should be done, but less important feature request Requesting a new feature labels Nov 3, 2022
@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p2-medium Medium priority, should be done, but less important labels Feb 6, 2023
@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp A: task-queue Related to task queue. feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

3 participants