Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad fingerprinting thinks Termina doesn't support bridge, but it does #10902

Open
insanitybit opened this issue Jul 15, 2021 · 12 comments
Open
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/fingerprint type/bug

Comments

@insanitybit
Copy link

Nomad version

Output from nomad version
❯ nomad --version
Nomad v1.1.2 (60638a0)

Operating system and Environment details

❯ uname -a
Linux penguin 5.4.109-26094-g381754fbb430 #1 SMP PREEMPT Sat Jun 26 21:31:00 PDT 2021 x86_64 GNU/Linux

❯ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster

Issue

ChromeOS provides a Linux VM that runs the Termina operating system, a stripped down, hardened Linux environment. Termina is perfectly capable of running bridge networks, but Nomad fails to detect this with its fingerprint heuristic.

Reproduction steps

sudo nomad agent -dev-connect &
sudo nomad job run testjob

Expected Result

Nomad schedules the job appropriately.

Actual Result

Nomad is unable to schedule the job because it believes that the agents are running on nodes that don't support bridge networks.

Job file (if appropriate)

job "test" {
    datacenters = ["dc1"]
    type = "service"
    group "foo" {
        network { 
            mode = "bridge"
        }
        task "test-task" {
            driver = "docker"
            config {
               image = "dgraph/dgraph:latest"
               args = ["dgraph", "zero", "--my=localhost:5080"]
            }
        }
    }
}

Nomad Server logs (if appropriate)

    2021-07-14T15:38:08.374-0700 [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled: error="3 errors occurred:
	* failed to open /proc/modules: open /proc/modules: no such file or directory
	* failed to open /lib/modules/5.4.109-26094-g381754fbb430/modules.builtin: open /lib/modules/5.4.109-26094-g381754fbb430/modules.builtin: no such file or directory
	* failed to open /lib/modules/5.4.109-26094-g381754fbb430/modules.dep: open /lib/modules/5.4.109-26094-g381754fbb430/modules.dep: no such file or directory

Notes

There's a simple enough workaround:

sudo mkdir -p /lib/modules/$(uname -r)/
sudo echo '_/bridge.ko' > /lib/modules/$(uname -r)/modules.builtin

This "tricks" nomad into thinking that there's a ko file registered for bridge networking, so it doesn't bail from fingerprinting. After this you can use my repro steps and see that it's then able to schedule the job just fine, it'll even be healthy after just a bit.

func (f *BridgeFingerprint) Fingerprint(req *FingerprintRequest, resp *FingerprintResponse) error {

This function here is the culprit. It assumes that if the kernel supports bridge networking it must have a module somewhere, but that's not the case. Termina doesn't have a /proc/modules, nor a /lib/modules/. Termina does not support loading kernel modules at all, in fact.

I don't know Nomad's internals intimately, but my suggestion is to not do fingerprinting for this sort of thing. Just try to create a bridge network and if it works it works and if it doesn't it doesn't. That's the easiest way to know that support exists. Or if there's some way for me to tell nomad "no for real, we support it, ignore your fingerprint".

Alternatively, it's certainly a hack, but you could use uname -n and see if it's "penguin", which would solve this very specific instance.

@wimax-grapl
Copy link

(I'd argue against the uname -n; I'm also on chromeos/termina and I manually renamed my host to a different name, and I'm sure there are other non-cros linux users who have a hostname of 'penguin')

@jrasell jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/fingerprint labels Jul 15, 2021
@jrasell
Copy link
Member

jrasell commented Jul 15, 2021

Hi @insanitybit and thanks for the detailed report. This is certainly and interesting problem and something we would like to solve.

@shoenig
Copy link
Member

shoenig commented Jul 15, 2021

I don't know Nomad's internals intimately, but my suggestion is to not do fingerprinting for this sort of thing. Just try to create a bridge network and if it works it works and if it doesn't it doesn't.

Fingerprinting is important because remember, Nomad is often used to deploy to large non-homogeneous environments where a given workload may only work on a subset of the available nodes. The Nomad scheduler makes use of the fingerprinting data and combines it with the resource ask of workloads to schedule them accordingly.

The "just run it" strategy may work on your laptop, but it is not a good design.

@d0nut-grapl
Copy link

I think that this is a misunderstanding of the suggestion, @shoenig. I believe what @insanitybit was suggesting is that the nomad client should attempt to spin up a bridge network at startup to see if bridge networking is enabled, instead of attempting to guess beforehand via fingerprinting.

If it fails when the client starts, then the nomad client knows that it can't support bridge-mode networking. If it succeeds, it can delete the network and mark itself as supporting bridge-mode.

Does that make sense?

@insanitybit
Copy link
Author

"works on my laptop" is the point

@ghost
Copy link

ghost commented Aug 24, 2021

For anyone else who runs into this, the workaround should be

sudo mkdir -p /lib/modules/$(uname -r)/
echo '_/bridge.ko' | sudo tee -a /lib/modules/$(uname -r)/modules.builtin

@rcoder
Copy link

rcoder commented Sep 20, 2021

Realistically, I don't think addressing this for ChromeOS-hosted VMs (or really any distro that fails to disclose what kernel modules/features are available via the standard method used in fingerprinting today) is going to push up to the very top of our priority stack soon. We are doing some other work around improving bridge network fingerprinting (e.g. #11038) and would be happy to look at a PR covering this case as well.

Generally speaking, the less invasive and privileged the fingerprinting check the better. So a reliable and side-effect-free way to probe for bridge support without actually setting up and tearing down NICs at agent startup would be particularly interesting to look through and potentially merge.

@insanitybit
Copy link
Author

@rcoder Yeah, that's reasonable - I'll take a look and certainly if we find a way to do this in a nice side-effect-free way we'll share.

One thing that might be sort of ideal is just an escape hatch where we can bypass the fingerprinter and "promise" that the capability is there.

@c16a
Copy link

c16a commented Nov 6, 2021

I see that this happens with Archlinux on Linode servers as well. Using Nomad 1.1.6

@tgross
Copy link
Member

tgross commented Nov 8, 2021

I want to toss in that this same problem has popped up previously in #10983 #9837; the heuristic of looking for the kernel module isn't reliable outside of the "big distros".

We can fix this via a "feature detection" approach, which we have open as #6618 (basically what @insanitybit
is suggesting to do).

@tgross
Copy link
Member

tgross commented Apr 20, 2023

Adding another note here based on some experiments I was working on earlier this week. This also makes it really challenging to run Nomad on something like Firecracker VMs. In this case you provision the kernel separately from the root filesystem, and your typical cloud image you download from, say, Canonical, isn't going to have the right kconfigs needed by Firecracker to boot. So either you end up baking your own monolithic kernel and then the fingerprinting we're doing here doesn't work, or you end up having to lift the appropriate modules out of your build and into the root filesystem after the fact. This is incredibly tedious (and I still don't have it working yet. 😀 ). The feature detection approach in #6618 would let us bypass most of the pain here.

@needarun
Copy link

Hi @tgross
Thanks for keeping on feeding this issue !
I can confirm this is a problem in other contexts too: when using Nomad on Proxmox, and Nomad clients are in LXC containers, this problem occurs too.

And thank you @insanitybit for the trick

Cheers ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/fingerprint type/bug
Projects
None yet
Development

No branches or pull requests

9 participants