Batch Jobs Still Do Not Work Correctly #1330

ghost · 2016-06-21T19:02:56Z

Nomad Version

Nomad v0.4.0-rc1 ('e72a64e9f8d55cb3317e6791a9c74a2617e3a02c')

Operating System

CentOS7

Issue

Problem was originally reported under #1324. This issue was closed, a new one was created (#1326) and a fix was merged. I grabbed the latest src and tested it. My batch jobs are progressing further but still not running to completion. Of the 10 jobs that I submitted 2 went immediately to the dead state without running any allocations, 4 ran to completion, and 4 were stuck in the pending state.

Reproduction steps

The test was run in GCE. I spun up 3 server nodes and 9,600 cores worth of client nodes (600 16 CPU VMs). These were not preemptive VMs to insure that the problem is not being caused by nodes being yanked out from under the cluster. Once all of the nodes were up, I submitted 10 jobs . Each job had a task count of 10,000. Each individual task takes 120 seconds to complete. After submitting the jobs I waited until the cluster became idle and checked the status of the jobs. Of the 10 jobs that I submitted 2 went immediately to the dead state, 4 ran to completion, and 4 were stuck in the pending state. I then waited an additional 15 minutes and rechecked that status of the jobs and they had not changed. This was not a one time event (it happens every time I run the test).

Nomad Server Logs

svr-logs.tar.gz

Verbose Job Status

job-status.tar.gz

Job Spec

{
    "Job": {
        "Region": "global",
        "ID": "XXXXXX",
        "Name": "test-01",
        "Type": "batch",
        "Priority": 50,
        "Datacenters": [
            "dc1"
        ],
        "TaskGroups": [
            {
                "Name": "test-group",
                "Count": 100,
                "Tasks": [
                    {
                        "Name": "hello-world",
                        "Driver": "docker",
                        "Config": {
                            "image": "https://docker-cache.service.consul:5000/cdi/nomad-test:v0.0.9",
                            "command": "/opt/test/bin/test_batch.py",
                            "args": ["-t","120"],
                            "network_mode": "host"
                        },
                        "Resources": {
                            "CPU": 2500,
                            "MemoryMB": 256,
                            "DiskMB": 300,
                            "IOPS": 0
                        },
                        "LogConfig": {
                           "MaxFiles": 10,
                           "MaxFileSizeMB": 10
                        }
                    }
                ]
            }
        ]
    }
}

The text was updated successfully, but these errors were encountered:

dadgar · 2016-06-23T00:29:44Z

What machines did you run the servers on?

dadgar · 2016-07-27T17:58:56Z

Worked with the customer and this has been fixed

github-actions · 2022-12-20T02:15:27Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/core labels Jun 21, 2016

dadgar mentioned this issue Jun 22, 2016

Worker waitForIndex uses StateStore index, not Raft Applied Index #1339

Merged

dadgar closed this as completed Jul 27, 2016

github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Jobs Still Do Not Work Correctly #1330

Batch Jobs Still Do Not Work Correctly #1330

ghost commented Jun 21, 2016

dadgar commented Jun 23, 2016

dadgar commented Jul 27, 2016

github-actions bot commented Dec 20, 2022

Batch Jobs Still Do Not Work Correctly #1330

Batch Jobs Still Do Not Work Correctly #1330

Comments

ghost commented Jun 21, 2016

Nomad Version

Operating System

Issue

Reproduction steps

Nomad Server Logs

Verbose Job Status

Job Spec

dadgar commented Jun 23, 2016

dadgar commented Jul 27, 2016

github-actions bot commented Dec 20, 2022