-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queue status: will show the current worker status #7903
queue status: will show the current worker status #7903
Conversation
fix: iterative#7950 Currently queue worker status are not show in `queue status` makes it hard to tell if the worker is still living. 1. add a new method `active_worker` to return the currently running worker node name. 2. give every worker a unique name. 3. `queue status` will show how much worker running at present.
b3573ae
to
bdec881
Compare
dvc/repo/experiments/queue/local.py
Outdated
worker_status = self.active_worker() | ||
while node_name in worker_status: | ||
number += 1 | ||
node_name = f"dvc-exp-{wdir_hash}-{number}@localhost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result celery.control.inspect().active()
and celery.control.ping()
returned is based on node name
, different worker with the same node name will be regarded as the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional. We want to make sure we are reusing node names (so for something like queue start --jobs 4
, we would only use nodes 1 through 4 (and it would never increment the node number past 4).
When the dvc-task TemporaryWorker
starts, it checks whether or not another worker with the same node name exists, and if it does exist, the newly started worker exits immediately. This ensures that we always have the correct number of workers running at a time.
dvc/repo/experiments/queue/local.py
Outdated
if os.name == "nt": | ||
daemonize(cmd) | ||
else: | ||
ManagedProcess.spawn(["dvc"] + cmd, wdir=self.wdir, name=name) | ||
|
||
for _ in range(5): | ||
time.sleep(1) | ||
if node_name in self.active_worker(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly created node can only be detected some time later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this check. We already expect that in some cases the new worker may not start at all (if the node name is already in use it should exit immediately).
dvc/repo/experiments/queue/local.py
Outdated
node_name = f"dvc-exp-{wdir_hash}-{number}@localhost" | ||
worker_status = self.active_worker() | ||
while node_name in worker_status: | ||
number += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to give each worker a new number. The number
field should only be changing when we explicitly use exp queue start --jobs <number>
(which is still disabled to always be 1).
1. No `start -j` will only start worker (1,2,3,...,j). Will not start j new worker. 2. Distinguish active worker and idle worker.
edb845d
to
c37376c
Compare
Looks like the multi-worker is unstable on the
During today's tests, I also noticed several times the queue worker falsely shut down with the reason of no task in the queue, but actually, there are still 3-4 tasks in it. And once or twice that the queue worker didn't shut down after all of the tasks had been succeeded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM, we can address any UI changes in a separate PR when @dberenbaum returns next week
Followed by iterative#7903 Add a new unit test to increase the coverage of test Co-authored-by: Peter Rowlands (변기호) <peter@pmrowla.com>
Followed by #7903 Add a new unit test to increase the coverage of test Co-authored-by: Peter Rowlands (변기호) <peter@pmrowla.com>
Followed by #7903 Add a new unit test to increase the coverage of test Co-authored-by: Peter Rowlands (변기호) <peter@pmrowla.com>
Followed by #7903 Add a new unit test to increase the coverage of test Co-authored-by: Peter Rowlands (변기호) <peter@pmrowla.com>
Followed by #7903 Add a new unit test to increase the coverage of test Co-authored-by: Peter Rowlands (변기호) <peter@pmrowla.com>
Followed by #7903 Add a new unit test to increase the coverage of test Co-authored-by: Peter Rowlands (변기호) <peter@pmrowla.com>
fix: #7590
Currently queue worker status are not show in
queue status
makes ithard to tell if the worker is still living.
active_worker
to return the currently runningworker node name.
queue status
will show how much worker running at present.❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
Some tests are still required for this change.