Skip to content

Conversation

@cjh1
Copy link
Contributor

@cjh1 cjh1 commented Dec 1, 2025

This PR add some properties to the JobSpec to allow containerize jobs to be run.

@juztas
Copy link
Contributor

juztas commented Dec 2, 2025

Hi, Just my observation and few comments. For container runtime, the current JobSpec is minimal for an image and volume mounts, and there might be a need for more options to express for container runtime, like --mpi , --nv/--gpu , --network host ….(just a wild guess based on my experience with physicists, not based on AmSC requirements). Would it be more valuable to have it like this:

class VolumeMount(BaseModel):
    source: str
    target: str
    read_only: bool = True

class ContainerRuntime(BaseModel):
    image: str | None = None
    network_mode: str | host = host
    mpi: bool | False = False
    gpu: bool | False - False
    volume_mounts: list[VolumeMount] = []
    ... expand as needed, required in the future

class JobSpec(BaseModel):
    executable : str | None = None
    container_runtime : ContainerRuntime | None = None

Kind of, rather than pushing these flags directly in JobSpec, the API could introduce a dedicated ContainerRuntime model and separate container-related configurations.

It also raises additional questions for facilities and IRI Interface implementation (how this would work in practise between all facilities), as each facility might use a different container runtime (Docker, Apptainer, Podman…), and not everyone allows fully privileged containers (just my guess). How are these capabilities exposed (container runtime, flags supported) and who does the "heavy lifting" to translate container parameters to facility container runtime. Is it IRI Interface or is it left for the end-user to identify each facilities capabilities and make changes as required to run jobs.

@cjh1
Copy link
Contributor Author

cjh1 commented Dec 2, 2025

Hi, Just my observation and few comments. For container runtime, the current JobSpec is minimal for an image and volume mounts, and there might be a need for more options to express for container runtime, like --mpi , --nv/--gpu , --network host ….(just a wild guess based on my experience with physicists, not based on AmSC requirements). Would it be more valuable to have it like this:

class VolumeMount(BaseModel):
    source: str
    target: str
    read_only: bool = True

class ContainerRuntime(BaseModel):
    image: str | None = None
    network_mode: str | host = host
    mpi: bool | False = False
    gpu: bool | False - False
    volume_mounts: list[VolumeMount] = []
    ... expand as needed, required in the future

class JobSpec(BaseModel):
    executable : str | None = None
    container_runtime : ContainerRuntime | None = None

Kind of, rather than pushing these flags directly in JobSpec, the API could introduce a dedicated ContainerRuntime model and separate container-related configurations.

Separating the configuration into into a separate container specific object is a good idea, however, I think we need to be careful to avoid exposing too much as we need to allow for sites to implement the interface, so it really needs to be the lowest common denominator that the container runtimes used across the different sites can support. For example I didn't expose the network configuration as I was thinking that we should just default to the host. For MPI and GPU configuration, I would say that these options could be enabled if the job spec dictated that they where necessary, to avoid duplicating configuration.

It also raises additional questions for facilities and IRI Interface implementation (how this would work in practise between all facilities), as each facility might use a different container runtime (Docker, Apptainer, Podman…), and not everyone allows fully privileged containers (just my guess). How are these capabilities exposed (container runtime, flags supported) and who does the "heavy lifting" to translate container parameters to facility container runtime. Is it IRI Interface or is it left for the end-user to identify each facilities capabilities and make changes as required to run jobs.

Yes, as I said above, we need to expose a very minimal subset of container functionality, so it can be implemented successfully across sites. I see this interface as a subset of container functionality rather than as superset of all container runtime options. We could also provide a site specific "extra container options" property, as an escape hatch that would allow sites to support more advanced options, but these would not necessarily be supported across all sites.

@cjh1 cjh1 force-pushed the containers branch 2 times, most recently from afd294d to 205b127 Compare December 4, 2025 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants