You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TaskVine workers are started with an idle-disconnect parameter, where if they are connected to the manager but do not receive work within x seconds it will perform an idle-disconnect-request.
The manager is supposed to consider the worker and its files. If the worker has a temporary file in local storage that is needed by a future task the manager should refuse the request, as allowing it to disconnect would cause the file to be lost.
However, when running a workflow I found a worker requesting to idle-disconnect, which the manager allowed, yet a recovery task was immediately created for the lost file. Described by the section of the log attached.
Therefore, either we should not have allowed this disconnect to occur and there is a flaw in handle_worker_timeout , or the file lost was not needed and we should not have created a recovery task for it.
The text was updated successfully, but these errors were encountered:
Given recent experiences with dynamic map reduce, I would prefer if the disconnect requests are ignored if the worker has a unique copy of a temporary file, regardless of whether the manager knows if the file would be needed in the future. The file may be needed by a task that has not been created yet.
TaskVine workers are started with an idle-disconnect parameter, where if they are connected to the manager but do not receive work within x seconds it will perform an idle-disconnect-request.
The manager is supposed to consider the worker and its files. If the worker has a temporary file in local storage that is needed by a future task the manager should refuse the request, as allowing it to disconnect would cause the file to be lost.
However, when running a workflow I found a worker requesting to idle-disconnect, which the manager allowed, yet a recovery task was immediately created for the lost file. Described by the section of the log attached.
Therefore, either we should not have allowed this disconnect to occur and there is a flaw in
handle_worker_timeout
, or the file lost was not needed and we should not have created a recovery task for it.The text was updated successfully, but these errors were encountered: