Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running nf-nomad with acl enabled #56

Closed
jhaezebr opened this issue Jul 3, 2024 · 10 comments · Fixed by #57
Closed

Error running nf-nomad with acl enabled #56

jhaezebr opened this issue Jul 3, 2024 · 10 comments · Fixed by #57

Comments

@jhaezebr
Copy link
Collaborator

jhaezebr commented Jul 3, 2024

Nextflow seems to be unable to submit jobs when ACL is enabled, but using the same token I can submit a job using the nomad CLI.

Nextflow log
Jul-03 12:13:27.492 [main] DEBUG nextflow.cli.Launcher - $> nextflow run hello -c nomad.config -w ./work
Jul-03 12:13:27.870 [main] DEBUG nextflow.cli.CmdRun - N E X T F L O W  ~  version 24.04.2
Jul-03 12:13:27.930 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/home/research/.nextflow/plugins; core-plugins: nf-amazon@2.5.2,nf-azure@1.6.0,nf-cloudcache@0.4.1,nf-codecommit@0.2.0,nf-console@1.1.3,nf-ga4gh@1.3.0,nf-google@1.13.2,nf-tower@1.9.1,nf-wave@1.4.2
Jul-03 12:13:28.014 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Jul-03 12:13:28.016 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Jul-03 12:13:28.025 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.10.0 in 'deployment' mode
Jul-03 12:13:28.189 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Jul-03 12:13:28.232 [main] DEBUG nextflow.scm.ProviderConfig - Using SCM config path: /home/research/.nextflow/scm
Jul-03 12:13:28.253 [main] DEBUG nextflow.scm.AssetManager - Listing projects in folder: /home/research/.nextflow/assets
Jul-03 12:13:30.130 [main] DEBUG nextflow.scm.AssetManager - Git config: /home/research/.nextflow/assets/nextflow-io/hello/.git/config; branch: master; remote: origin; url: https://github.com/nextflow-io/hello.git
Jul-03 12:13:30.344 [main] DEBUG nextflow.scm.RepositoryFactory - Found Git repository result: [RepositoryFactory]
Jul-03 12:13:30.389 [main] DEBUG nextflow.scm.AssetManager - Git config: /home/research/.nextflow/assets/nextflow-io/hello/.git/config; branch: master; remote: origin; url: https://github.com/nextflow-io/hello.git
Jul-03 12:13:32.835 [main] DEBUG nextflow.config.ConfigBuilder - Found config home: /home/research/.nextflow/config
Jul-03 12:13:32.837 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /home/research/.nextflow/assets/nextflow-io/hello/nextflow.config
Jul-03 12:13:32.849 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /scratch/nf-nomad/nomad.config
Jul-03 12:13:32.852 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/research/.nextflow/config
Jul-03 12:13:32.853 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/research/.nextflow/assets/nextflow-io/hello/nextflow.config
Jul-03 12:13:32.854 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /scratch/nf-nomad/nomad.config
Jul-03 12:13:32.892 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /home/research/.nextflow/secrets/store.json
Jul-03 12:13:32.900 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@2b736fee] - activable => nextflow.secret.LocalSecretsProvider@2b736fee
Jul-03 12:13:32.912 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Jul-03 12:13:33.202 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Jul-03 12:13:33.274 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Jul-03 12:13:33.744 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 by global default
Jul-03 12:13:33.751 [main] DEBUG nextflow.cli.CmdRun - Launching `https://github.com/nextflow-io/hello` [disturbed_shannon] DSL2 - revision: 7588c46ffe [master]
Jul-03 12:13:33.756 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins declared=[nf-nomad@0.1.1]
Jul-03 12:13:33.758 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
Jul-03 12:13:33.760 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[nf-nomad@0.1.1]
Jul-03 12:13:33.761 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-nomad version: 0.1.1
Jul-03 12:13:33.798 [main] INFO  org.pf4j.AbstractPluginManager - Plugin 'nf-nomad@0.1.1' resolved
Jul-03 12:13:33.798 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin 'nf-nomad@0.1.1'
Jul-03 12:13:33.862 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-nomad@0.1.1
Jul-03 12:13:34.025 [main] DEBUG nextflow.Session - Session UUID: 52aae5fc-1036-4f86-af10-e5633ac019f5
Jul-03 12:13:34.026 [main] DEBUG nextflow.Session - Run name: disturbed_shannon
Jul-03 12:13:34.026 [main] DEBUG nextflow.Session - Executor pool size: 80
Jul-03 12:13:34.047 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null
Jul-03 12:13:34.063 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=240; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Jul-03 12:13:34.134 [main] DEBUG nextflow.cli.CmdRun -
  Version: 24.04.2 build 5914
  Created: 29-05-2024 06:19 UTC
  System: Linux 5.4.0-150-generic
  Runtime: Groovy 4.0.21 on OpenJDK 64-Bit Server VM 11.0.23-internal+0-adhoc..src
  Encoding: UTF-8 (UTF-8)
  Process: 59747@compute-87hs7j2 [127.0.1.1]
  CPUs: 80 - Mem: 629.8 GB (13.6 GB) - Swap: 4 GB (3.6 GB)
Jul-03 12:13:34.273 [main] DEBUG nextflow.Session - Work-dir: /scratch/nf-nomad/work [ceph]
Jul-03 12:13:34.274 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /home/research/.nextflow/assets/nextflow-io/hello/bin
Jul-03 12:13:34.331 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[NomadExecutor]
Jul-03 12:13:34.369 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Jul-03 12:13:34.506 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
Jul-03 12:13:34.545 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 81; maxThreads: 1000
Jul-03 12:13:34.749 [main] DEBUG nextflow.Session - Session start
Jul-03 12:13:35.455 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Jul-03 12:13:35.736 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: nomad
Jul-03 12:13:35.736 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'nomad'
Jul-03 12:13:35.744 [main] DEBUG nextflow.executor.Executor - [warm up] executor > nomad
Jul-03 12:13:35.765 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'nomad' > capacity: 100; pollInterval: 5s; dumpInterval: 5m
Jul-03 12:13:35.771 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: nomad)
Jul-03 12:13:36.185 [main] DEBUG n.nomad.executor.NomadService - [NOMAD] Client Address: http://nomad.ops.cmgg.be/v1
Jul-03 12:13:36.186 [main] DEBUG n.nomad.executor.NomadService - [NOMAD] Client Token: 4465a..
Jul-03 12:13:36.549 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: sayHello
Jul-03 12:13:36.550 [main] DEBUG nextflow.Session - Igniting dataflow network (2)
Jul-03 12:13:36.552 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > sayHello
Jul-03 12:13:36.564 [main] DEBUG nextflow.script.ScriptRunner - Parsed script files:
  Script_45e06ae60646ee81: /home/research/.nextflow/assets/nextflow-io/hello/main.nf
Jul-03 12:13:36.565 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination
Jul-03 12:13:36.565 [main] DEBUG nextflow.Session - Session await
Jul-03 12:13:38.298 [Actor Thread 8] INFO  nextflow.processor.TaskProcessor - [sayHello (4)] cache hash: 233d257343efe6e16bd7c6104c229955; mode: STANDARD; entries:
  264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5
  3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello
  ee0a1d23a8c26fdf4d1575310833774f [java.lang.String]     """
    echo '$x world!'
    """

  20edf49cb4b22a20a5e05a9d1144bf0f [java.lang.String] quay.io/nextflow/bash
  769f897d21d56476ad01edc930becff0 [java.lang.String] x
  f5e76d4e64af0c5d859ff08ab3b720b7 [java.lang.String] Hola
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

Jul-03 12:13:38.275 [Actor Thread 7] INFO  nextflow.processor.TaskProcessor - [sayHello (3)] cache hash: 7121055b03c0817999f33638f4237c5d; mode: STANDARD; entries:
  264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5
  3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello
  ee0a1d23a8c26fdf4d1575310833774f [java.lang.String]     """
    echo '$x world!'
    """

  20edf49cb4b22a20a5e05a9d1144bf0f [java.lang.String] quay.io/nextflow/bash
  769f897d21d56476ad01edc930becff0 [java.lang.String] x
  0ab6632d52e811e9ef7c044666ac496a [java.lang.String] Hello
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

Jul-03 12:13:38.357 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [sayHello (1)] cache hash: 5c5ceeed61a78867efbf73384c00380e; mode: STANDARD; entries:
  264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5
  3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello
  ee0a1d23a8c26fdf4d1575310833774f [java.lang.String]     """
    echo '$x world!'
    """

  769f897d21d56476ad01edc930becff0 [java.lang.String] x
  c9273e5a7ac3508ef910437c4bb35a90 [java.lang.String] Bonjour
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

Jul-03 12:13:38.298 [Actor Thread 6] INFO  nextflow.processor.TaskProcessor - [sayHello (2)] cache hash: c607458338b72c0746d6fcac6772aa62; mode: STANDARD; entries:
  264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5
  3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello
  ee0a1d23a8c26fdf4d1575310833774f [java.lang.String]     """
    echo '$x world!'
    """

  20edf49cb4b22a20a5e05a9d1144bf0f [java.lang.String] quay.io/nextflow/bash
  769f897d21d56476ad01edc930becff0 [java.lang.String] x
  442e002ddd8b0a2b10ed51352f8c0488 [java.lang.String] Ciao
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

Jul-03 12:13:38.649 [Task submitter] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] Submitting task sayHello (2) - work-dir=/scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa
Jul-03 12:13:39.197 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=sayHello (2); work-dir=/scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa
  error [nextflow.exception.ProcessSubmitException]: [NOMAD] Failed to submit sayHello (2) -- Cause: Forbidden
Jul-03 12:13:39.256 [Task submitter] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa/.command.log
Jul-03 12:13:39.269 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'sayHello (2)'

Caused by:
  Forbidden


Command executed:

  echo 'Ciao world!'

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Jul-03 12:13:39.274 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: [NOMAD] Failed to submit sayHello (2) -- Cause: Forbidden
Jul-03 12:13:39.360 [Task submitter] DEBUG nextflow.Session - The following nodes are still active:
  [operator] view

Jul-03 12:13:39.409 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: nomad) - terminating tasks monitor poll loop
Jul-03 12:13:39.428 [main] DEBUG nextflow.Session - Session await > all processes finished
Jul-03 12:13:39.428 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jul-03 12:13:39.446 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=4; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=0; peakCpus=0; peakMemory=0; ]
Jul-03 12:13:39.697 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Jul-03 12:13:39.745 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin 'nf-nomad@0.1.1'
Jul-03 12:13:39.745 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-nomad
Jul-03 12:13:39.753 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye
Nextflow config
dumpHashes = true

plugins {
     id 'nf-nomad@0.1.1'
}

process {
     executor = "nomad"
     docker.enabled = true
}

nomad {
     client {
          address = "http://nomad.example.com"
          token   = "XXXXXXXXXXXXXXXXXXX"
     }

     jobs {
          deleteOnCompletion = false
          namespace = "nextflow"
          datacenters = ['dc']

          volumes = [
               { type "csi" name "nf_scratch_volume" path "/scratch" },
               { type "csi" name "nf_reference_volume" path "/references" }
          ]
     }
}
Nomad log
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=node-pool-mutation warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=vault warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=namespace-constraint-check warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=node-pool-validation warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=memory_oversubscription warnings=[] error=<nil>
2024-07-03T12:13:39.168Z [DEBUG] http: request failed: method=POST path=/v1/jobs?namespace=nextflow error="Permission denied" code=403
2024-07-03T12:13:39.168Z [DEBUG] http: request complete: method=POST path=/v1/jobs?namespace=nextflow duration=1.19574ms

Manual run
$ export NOMAD_TOKEN='XXXXXXXXXXXX'
$ export NOMAD_ADDR="http://nomad.example.com"
$ export NOMAD_NAMESPACE=nextflow
$ export NOMAD_DC=s10
$ nomad job run test.hcl
==> Monitoring evaluation "02b6eef0"
    Evaluation triggered by job "example"
    Evaluation within deployment: "4d4d3f64"
    Allocation "984a1dcb" created: node "57dfcfcd", group "example"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "02b6eef0" finished with status "complete"

$ nomad job status
ID       Type     Priority  Status   Submit Date
example  service  50        running  2024-07-03T12:12:42Z

$ cat test.hcl
job "example" {
  group "example" {
    task "sleep" {
      driver = "docker"
      config {
        image = "busybox:latest"
        entrypoint = ["/bin/sleep", "300"]
      }
      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
Nomad nextflow ACL
namespace "nextflow" {
  policy = "write"
}

agent {
  policy = "deny"
}

operator {
  policy = "deny"
}

quota {
  policy = "deny"
}

node {
  policy = "deny"
}

host_volume "*" {
  policy = "deny"
}

plugin {
  policy = "deny"
}
@jagedn
Copy link
Collaborator

jagedn commented Jul 3, 2024

can you check mounting a volume in test.hcl please?

Not sure (yet) how acl works but the host_volume in your example is "deny" and the nf-task requires to mount the volume

@jagedn
Copy link
Collaborator

jagedn commented Jul 3, 2024

I've tested against the local cluster created in the validation folder

( see #57 )

When the --secure flag is provided the cluster is bootstraping with ACL and the NOMAD_TOKEN is required to run the pipelines

@jhaezebr
Copy link
Collaborator Author

jhaezebr commented Jul 3, 2024

So to utilize csi volumes you at least need the plugin read permissions and csi-list-volume capability.

Updated policy
namespace "nextflow" {
  policy = "write"
  capabilities = [
    "csi-write-volume",
    "csi-read-volume",
    "csi-list-volume",
    "csi-mount-volume"
  ]
}

agent {
  policy = "deny"
}

operator {
  policy = "deny"
}

quota {
  policy = "deny"
}

node {
  policy = "deny"
}

host_volume "*" {
  policy = "deny"
}

plugin {
  policy = "read"
}

Other than that there is still a problem with volumes that are read-only

  capability {
    access_mode     = "multi-node-reader-only"
    attachment_mode = "file-system"
  }

  mount_options {
    mount_flags = [ "ro" ]
  }

@jagedn
Copy link
Collaborator

jagedn commented Jul 3, 2024

we're mounting (all) the volumes as writable

taskDef.config.mount = [
type : "volume",
target : destinationDir,
source : config.jobOpts().dockerVolume,
readonly : false
]

so probably we need to extend our dsl spec with more features

@abhi18av abhi18av linked a pull request Jul 3, 2024 that will close this issue
@abhi18av
Copy link
Member

abhi18av commented Jul 3, 2024

@jhaezebr what's the overall use-case for read-only file systems in your setup?

@matthdsm
Copy link
Collaborator

matthdsm commented Jul 4, 2024

@jagedn we use a read only mount for our reference store. This isn't strictly needed, but we want this mount to be read-only so a rogue process can't go about deleting or changing any of the references.

@jhaezebr
Copy link
Collaborator Author

jhaezebr commented Jul 4, 2024

I've made a seperate issue for the read-only use-case: #60
I'll focus on the ACL part here :)

@jhaezebr
Copy link
Collaborator Author

jhaezebr commented Jul 4, 2024

For the moment this ACL seems to work for nextflow:

namespace "nextflow" {
  policy = "write"
  capabilities = [
    "csi-write-volume",
    "csi-read-volume",
    "csi-list-volume",
    "csi-mount-volume"
  ]
}

agent {
  policy = "deny"
}

operator {
  policy = "deny"
}

quota {
  policy = "deny"
}

node {
  policy = "deny"
}

host_volume "*" {
  policy = "deny"
}

plugin {
  policy = "read"
}

@abhi18av
Copy link
Member

abhi18av commented Jul 4, 2024

Gotcha - thanks @jhaezebr !

Quick question, did you test with fusionfs setup or just CSI?

Judging from the following, I think as fusionfs requires the use of tmp, this could be a blocker.

host_volume "*" {
  policy = "deny"
}

Ideally, we want to keep feature parity with both 🤝

@jhaezebr
Copy link
Collaborator Author

jhaezebr commented Jul 5, 2024

No, I didn't test fusionfs, just csi. We don't use fusionfs in our cluster and I'm not familiar with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants