Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Jobs Still Do Not Work Correctly #1330

Closed
ghost opened this issue Jun 21, 2016 · 3 comments
Closed

Batch Jobs Still Do Not Work Correctly #1330

ghost opened this issue Jun 21, 2016 · 3 comments

Comments

@ghost
Copy link

ghost commented Jun 21, 2016

Nomad Version

Nomad v0.4.0-rc1 ('e72a64e9f8d55cb3317e6791a9c74a2617e3a02c')

Operating System

CentOS7

Issue

Problem was originally reported under #1324. This issue was closed, a new one was created (#1326) and a fix was merged. I grabbed the latest src and tested it. My batch jobs are progressing further but still not running to completion. Of the 10 jobs that I submitted 2 went immediately to the dead state without running any allocations, 4 ran to completion, and 4 were stuck in the pending state.

Reproduction steps

The test was run in GCE. I spun up 3 server nodes and 9,600 cores worth of client nodes (600 16 CPU VMs). These were not preemptive VMs to insure that the problem is not being caused by nodes being yanked out from under the cluster. Once all of the nodes were up, I submitted 10 jobs . Each job had a task count of 10,000. Each individual task takes 120 seconds to complete. After submitting the jobs I waited until the cluster became idle and checked the status of the jobs. Of the 10 jobs that I submitted 2 went immediately to the dead state, 4 ran to completion, and 4 were stuck in the pending state. I then waited an additional 15 minutes and rechecked that status of the jobs and they had not changed. This was not a one time event (it happens every time I run the test).

Nomad Server Logs

svr-logs.tar.gz

Verbose Job Status

job-status.tar.gz

Job Spec

{
    "Job": {
        "Region": "global",
        "ID": "XXXXXX",
        "Name": "test-01",
        "Type": "batch",
        "Priority": 50,
        "Datacenters": [
            "dc1"
        ],
        "TaskGroups": [
            {
                "Name": "test-group",
                "Count": 100,
                "Tasks": [
                    {
                        "Name": "hello-world",
                        "Driver": "docker",
                        "Config": {
                            "image": "https://docker-cache.service.consul:5000/cdi/nomad-test:v0.0.9",
                            "command": "/opt/test/bin/test_batch.py",
                            "args": ["-t","120"],
                            "network_mode": "host"
                        },
                        "Resources": {
                            "CPU": 2500,
                            "MemoryMB": 256,
                            "DiskMB": 300,
                            "IOPS": 0
                        },
                        "LogConfig": {
                           "MaxFiles": 10,
                           "MaxFileSizeMB": 10
                        }
                    }
                ]
            }
        ]
    }
}
@dadgar
Copy link
Contributor

dadgar commented Jun 23, 2016

What machines did you run the servers on?

@dadgar
Copy link
Contributor

dadgar commented Jul 27, 2016

Worked with the customer and this has been fixed

@dadgar dadgar closed this as completed Jul 27, 2016
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant