You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem was originally reported under #1324. This issue was closed, a new one was created (#1326) and a fix was merged. I grabbed the latest src and tested it. My batch jobs are progressing further but still not running to completion. Of the 10 jobs that I submitted 2 went immediately to the dead state without running any allocations, 4 ran to completion, and 4 were stuck in the pending state.
Reproduction steps
The test was run in GCE. I spun up 3 server nodes and 9,600 cores worth of client nodes (600 16 CPU VMs). These were not preemptive VMs to insure that the problem is not being caused by nodes being yanked out from under the cluster. Once all of the nodes were up, I submitted 10 jobs . Each job had a task count of 10,000. Each individual task takes 120 seconds to complete. After submitting the jobs I waited until the cluster became idle and checked the status of the jobs. Of the 10 jobs that I submitted 2 went immediately to the dead state, 4 ran to completion, and 4 were stuck in the pending state. I then waited an additional 15 minutes and rechecked that status of the jobs and they had not changed. This was not a one time event (it happens every time I run the test).
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad Version
Nomad v0.4.0-rc1 ('e72a64e9f8d55cb3317e6791a9c74a2617e3a02c')
Operating System
CentOS7
Issue
Problem was originally reported under #1324. This issue was closed, a new one was created (#1326) and a fix was merged. I grabbed the latest src and tested it. My batch jobs are progressing further but still not running to completion. Of the 10 jobs that I submitted 2 went immediately to the dead state without running any allocations, 4 ran to completion, and 4 were stuck in the pending state.
Reproduction steps
The test was run in GCE. I spun up 3 server nodes and 9,600 cores worth of client nodes (600 16 CPU VMs). These were not preemptive VMs to insure that the problem is not being caused by nodes being yanked out from under the cluster. Once all of the nodes were up, I submitted 10 jobs . Each job had a task count of 10,000. Each individual task takes 120 seconds to complete. After submitting the jobs I waited until the cluster became idle and checked the status of the jobs. Of the 10 jobs that I submitted 2 went immediately to the dead state, 4 ran to completion, and 4 were stuck in the pending state. I then waited an additional 15 minutes and rechecked that status of the jobs and they had not changed. This was not a one time event (it happens every time I run the test).
Nomad Server Logs
svr-logs.tar.gz
Verbose Job Status
job-status.tar.gz
Job Spec
The text was updated successfully, but these errors were encountered: