Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

live integration: get rid of dvc from live #5466

Merged
merged 5 commits into from
Mar 2, 2021

Conversation

pared
Copy link
Contributor

@pared pared commented Feb 15, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Important

Tests are failing because dvclive with corresponding change needs to be released. If change is accepted, please do not merge, I need to synchronize dvclive with dvc release to make the time window of incompatible versions as small as possible.

@@ -67,7 +67,8 @@ def exp_stage(tmp_dir, scm, dvc):


@pytest.fixture
def checkpoint_stage(tmp_dir, scm, dvc):
def checkpoint_stage(tmp_dir, scm, dvc, mocker):
mocker.patch("dvc.stage.run.MonitorConfig.AWAIT", 0.01)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+/- 40% time less for test_checkpoints.py

@pared pared force-pushed the live_no_dvc branch 2 times, most recently from b5d73ba to 49a9b64 Compare February 19, 2021 11:44
dvc/stage/run.py Outdated
def _checkpoint_run(stage, callback_func, done, proc, killed):
"""Run callback_func whenever checkpoint signal file is present."""
signal_path = os.path.join(stage.repo.tmp_dir, CHECKPOINT_SIGNAL_FILE)
@dataclass
Copy link
Member

@skshetry skshetry Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we will need a scheduler for the Runner soon. :)

Copy link
Member

@skshetry skshetry Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just saying that all the logic in the stage/run does not belong here.

We should think of decoupling all of the logic here and in the repro to a Runner and TaskScheduler respectively. Runner should support hooks (on_start/on_end/on_failure/every(1s) etc.) and actually running the scripts.
And, the TaskScheduler that schedules stages to be run from the given DAG (which would help us parallelize in the future).

We'd likely come to it later though, just sharing my thoughts πŸ™‚ . But this also looks good for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry That sounds like a great idea!

@pared pared requested review from pmrowla and a team February 19, 2021 13:53
@pared pared changed the title [WIP] live integration: get rid of dvc from live live integration: get rid of dvc from live Feb 19, 2021
from copy import deepcopy
from textwrap import dedent

import pytest
from funcy import first

from dvc import stage as stage_module
from dvc.exceptions import MetricsError
from dvc.exceptions import MetricDoesNotExistError, MetricsError

LIVE_SCRITP = dedent(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the spelling be fixed here and where it's used below as long as other changes are being made?

Suggested change
LIVE_SCRITP = dedent(
LIVE_SCRIPT = dedent(

Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand what's happening here. You are not only making dvclive independent from dvc, but also keeping dvclive-like html output functionality implemented independently within dvc? Or do I have that completely wrong @pared ?

@pared
Copy link
Contributor Author

pared commented Feb 22, 2021

@dberenbaum You got it right.
We want to make live as lightweight as possible in its basic installation, so instead of adding dvc as a dependency, this change aims to introduce checkpoint-like behaviour.
live, if it finds .dvc dir, will create, under proper path kind of semafor file that dvc will be reacting to. If file is present, prepare summary and delete file, wait for another one and repeat.

@dberenbaum
Copy link
Collaborator

Thanks, @pared! So how does dvclive generate the html summary if no .dvc dir is found?

@pared
Copy link
Contributor Author

pared commented Feb 22, 2021

@dberenbaum Without dvc, dvclive is unable to produce a summary. It can work only as a logger in that case.

@dberenbaum
Copy link
Collaborator

Got it. Thanks, @pared! It seems like live without dvc is not a particularly likely regular scenario, since the functionality will be limited to logging metrics to a file, which users can do pretty easily themselves. However, maybe it's a better code design to keep them separate?

@pared
Copy link
Contributor Author

pared commented Feb 22, 2021

@dberenbaum
So the reasoning for the removal was to provide potential users with a logger that comes with no strings attached, and yet allows close integration with dvc. While it definitely does not sound like a popular scenario, keeping dependencies as scarce as possible was mainly for this purpose.

dvc/stage/run.py Outdated
@dataclass
class MonitorConfig:
name: str
stage: "Stage" # type: ignore[name-defined] # noqa: F821
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be ignored, you just need

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from dvc.stage import Stage
...
    stage: "Stage"

dvc/stage/run.py Outdated
_kill(proc)
killed.set()
_kill(config.proc)
config.killed.set()
finally:
logger.debug("Remove checkpoint signal file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message should probably be generalized now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feels like the top level _... methods here should go into the base Monitor as class methods now (w/the checkpoint callback one in CheckpointMonitor)

dvc/stage/run.py Outdated
super().__exit__(exc_type, exc_val, exc_tb)


def _monitor_loop(config: MonitorConfig):
Copy link
Contributor

@pmrowla pmrowla Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general design question, does it make sense for us to have separate threads watching for each of the checkpoint signal files and the dvclive signal files? Or should we have one monitor thread that checks for multiple signals (and runs their callbacks) sequentially like

while True:
    for signal_path, task in signals_to_watch:
        if os.path.exists(signal_path):
            try:
                task()
            finally:
                remove(signal_path)

I guess the real question is would running both callbacks at the same time in parallel threads potentially cause any issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the user creates the checkpoint signal, their code is supposed to block so that we know the user won't be touching the DVC workspace until we signal them to continue (by removing the signal file). This way we know that we can do whatever is needed in the workspace to create the checkpoint commits safely, and I'm assuming that dvclive works the same way?

In the event that a user creates both a checkpoint and dvclive signal file at the same time, I'm not sure that it is safe for dvclive to be doing anything while the workspace may be in the state where we are moving HEAD around and creating git commits during checkpoint creation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla Your proposition makes more sense, Ill introduce necessary changes

Copy link
Contributor Author

@pared pared Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redesigned so that we have single monitor with multiple tasks provided as needed, depending on stage content. Now Monitor and Tasks starts to look more and more similar to what @skshetry proposes.

@pared pared force-pushed the live_no_dvc branch 2 times, most recently from ae1a368 to e7bf95e Compare February 24, 2021 00:14
@pared pared requested a review from pmrowla February 24, 2021 00:20
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dberenbaum dberenbaum marked this pull request as ready for review February 24, 2021 19:52
Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, changed the status on this, but I don't have anymore comments.

@pared
Copy link
Contributor Author

pared commented Feb 24, 2021

Thanks for review! After discussing with @efiop we decided that I will give it a try and try to prepare TaskScheduler and Runner in this change.

@skshetry
Copy link
Member

@pared, I'll share some libraries that I was looking for inspiration, might be helpful/useful: invoke, doit and prefect (mostly for experiment executors, generalizing it for repro).

@pared
Copy link
Contributor Author

pared commented Feb 26, 2021

@skshetry thanks, though it was misunderstanding between me and @efiop - in this change we will just try to move Monitor logic out of cmd_run. So, definining Scheduler and Runner will not be a scope of this change.

@pared pared changed the title live integration: get rid of dvc from live [WIP] live integration: get rid of dvc from live Mar 1, 2021
@pytest.mark.parametrize("summary", (True, False))
def test_export_config_tmp(tmp_dir, dvc, mocker, summary, report):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the test_export_config_tmp test - now that live is online test_export_config tests the same things

@pared pared changed the title [WIP] live integration: get rid of dvc from live live integration: get rid of dvc from live Mar 1, 2021
@pared
Copy link
Contributor Author

pared commented Mar 1, 2021

Extracted monitor logic out of stage/run.
Looking at problems that I will face releasing this small change, and looking at
this change I came to a conclusion that it would be better to have those tests inside dvclive. That leads to the need of having fixtures reused between projects. Similar problem in dvcx was solved by copying the fixtures. It seems that official way of handling this kind of issue is to create a pytest-plugin. I think we should prepare the plugin. It could already be used in dvcx, dvclive and potentially in dvc-bench, once we find a capacity to get back to making the benchmarks pytest-like.

@pared pared requested a review from efiop March 1, 2021 16:15
Comment on lines +16 to +25
def create_summary(out):
from dvc.utils.html import write

assert out.live and out.live["html"]

metrics, plots = out.repo.live.show(str(out.path_info))

html_path = out.path_info.with_suffix(".html")
write(html_path, plots, metrics)
logger.info(f"\nfile://{os.path.abspath(html_path)}")
Copy link
Collaborator

@dberenbaum dberenbaum Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like there's much to this HTML summary (either here or in https://github.com/iterative/dvc/blob/master/dvc/utils/html.py). Would it make sense to build out summary capability in dvclive? That would allow for complete dvclive functionality without dvc.

Also, could the summary output format be abstracted in dvclive so that non-HTML outputs could be built? For example, a matplotlib output that auto-updates, or even a very basic cli output. Different output types could enable users to see realtime updates for model performance (similar to a progress bar) without opening a separate page that needs to be manually refreshed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum
This question has few levels.

  1. Would it make sense to build out summary capability in dvclive? That would allow for complete dvclive functionality without dvc

I guess it would make sense, but it would mean copying not only html module and this particular function, but quite a lot of code from metrics and plots (live.show is actually calling them) - so all the functionality related to vega-js would probably need to be copied too, the templates for example. Also having it in dvc allows us to use a command dvc live to visualize the live outputs. Its not too flashy as of today, but AFAIK, it is supposed to be iterated upon in the future.

non-HTML outputs could be built ?

Probably, but that leaves us with a lot of questions related to plots itself - how would it work on different os-es? What about some pipeline running in a server where we connect via SSH? Also, what to do with plots command? Do we want to keep the HTML functionality? The plots were made in certain way in order to not have to support our own plotting framework, and use something that probably all of the users share - having browser. I think visualizing live using different means will be hard both to maintain and explain to users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including these features in dvclive doesn't need to impact dvc at all. No need to block this PR or the release.

For users to be able to get value out of dvclive without dvc, could we build a basic summary capability in dvclive? I was wondering whether we could adapt what's currently in https://github.com/iterative/dvc/blob/master/dvc/utils/html.py as a starting point, but it doesn't need to be if there's a simpler way. Looking at that file, it looked like the data structures to feed to HTML.write() are reasonably simple, and the data is already being collected by dvclive.

Thoughts? I can open a new issue to discuss if it's worthwhile.

@dberenbaum
Copy link
Collaborator

What is the intended user experience for including dvclive in a dvc stage?

@pared
Copy link
Contributor Author

pared commented Mar 1, 2021

@dberenbaum I guess the main point is that if you use dvclive in your code, you don't need to specify live logs directory inside the code. So, if you run dvc run ... --live logs - the information that logs is supposed to be te live logs dir will be passed to the code. And if code itself does not specify init method, provided directory will be used.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Mar 1, 2021

if you use dvclive in your code, you don't need to specify live logs directory inside the code...

Is this being documented? I had completely missed this until now πŸ˜„ . I see it's in the help for dvc stage add and dvc run, but I don't see it in the command reference docs or anywhere else. CC @jorgeorpinel

As another feature request for the future, would it be possible to simplify how to write code when wanting to use both checkpoints and dvclive? Right now, I'm writing code like:

for _ in epochs:
    train()
    metrics = evaluate()
    with open("metrics.json", "w") as f:
        json.dump(metrics, f)
    for k, v in metrics.items():
        dvclive.log(k, v)
    dvclive.next_step()
    make_checkpoint()

This seems redundant. If I use a dvclive callback, I'm not even sure how I would write out the metrics or define the checkpoints for dvc since my code wouldn't iterate through each epoch.

EDIT: Looks like this already works like I'm requesting. Awesome! Sorry, just playing around with this now. In that case, we just need to document.

@efiop efiop requested a review from pmrowla March 2, 2021 14:13
@pmrowla pmrowla merged commit 4a8cb80 into iterative:master Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants