Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CWL support for xclim #1955

Open
2 tasks done
SarahG-579462 opened this issue Oct 15, 2024 · 1 comment
Open
2 tasks done

Add CWL support for xclim #1955

SarahG-579462 opened this issue Oct 15, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@SarahG-579462
Copy link
Contributor

SarahG-579462 commented Oct 15, 2024

Addressing a Problem?

CWL is a language to standardize function inputs and outputs, and is used for creating data workflows, particularly in other geospatial applications. It is a planned addition to pygeoapi and, more generally, OGC-Processes. Adding support for xclim to be used through this language would be very helpful for people who don't want to dig through python code and just want a plug-and-play solution to compute indices/bias correct/etc.

Potential Solution

  • I have a working prototype for individual indicators in CWL at the moment, see the additional context below for the code snippet. It creates a docker container for the command line tool, which means there is a lag in running any command, but this may be acceptable for some users?

  • I have the beginnings of a prototype for CWL for all commands together, but it is still non-functional. (I don't fully understand the language yet!)

  • In order to avoid the start-up latency, I see a few options:

    • we could propose to CWL to add support for attaching to a running container, however this runs against the philosophy they have of reproducibility (a running container could have a non-constant state, generally)
    • Perhaps two steps: Create a constantly running container in the first step of the workflow, and creating fast-running containers for the individual commands, which pipes the commands to the first image, and then in the final step of the workflow, destroy the initial container?
  • Add support for other sections of xclim than just indicator calculation: bias correction, spatial analogues, unit standardization, etc... This could be done by augmenting the CLI for xclim.

Additional context

Exacmple Code for the CWL indicator calculations

cwlVersion: v1.2
class: CommandLineTool
id: xclim_tx_max
label: Maximum temperature
doc: |
  Maximum of daily maximum temperature.
requirements:
  EnvVarRequirement:
    envDef:
      PYTHONPATH: /app
  ResourceRequirement:
    coresMax: 1
    ramMax: 512
hints:
  DockerRequirement:
    dockerPull: localhost/xclim:latest

baseCommand: ["xclim"]
arguments: []
inputs:
  input:
    type: File
    inputBinding:
      position: 0
      prefix: --input
  output:
    type: string
    inputBinding:
      position: 1
      prefix: --output

  TX_MAX:
    type: 
      type: record
      fields:
        
        - name: tasmax
          doc: |
            Maximum daily temperature.
            Default : tasmax.
          type: string?
          inputBinding:
            prefix: --tasmax 
        

        - name: freq
          doc: |
            Resampling frequency.
            Default : YS.
          type: string?
          inputBinding:
            prefix: --freq 
        
    name: tx_max
    inputBinding:
      position: 2
      prefix: tx_max



outputs:
  outdir:
    outputBinding:
      glob: "*.nc"
    type: File[]

Code for generating indicators CWL, and beginnings of a master CWL

# Generate CWL files from xclim Indicators
import yaml
from pathlib import Path
from xclim.core.utils import InputKind
from loguru import logger
template = Path("cwl_template.yaml")
template_str = template.read_text()

master_template = Path("cwl_master.yaml")
master_str = master_template.read_text()

step_template = Path("cwl_step.yaml")
step_str = step_template.read_text()

fields_template_str = """
- name: {param}
  doc: |
    {doc}
  type: string{optional_flag}
  inputBinding:
    prefix: --{param} 
"""
fields_template_enum = """
- name: {param}
  doc: |
    {doc}
  type:
    {optional_flag}
    - type: enum
      symbols:
        {symbols}
  inputBinding:
    prefix: "--{param}"
"""
input_template = """
  {indicator_id}:
    type: 
      type: record
      fields:
        {fields}
    name: {indicator}
    inputBinding:
      position: 2
      prefix: {indicator}
"""
docker_path = "/app"
docker_image = "localhost/xclim:latest"

import xclim as xc
param_str = "{indicator_id}.{param}: {indicator_id}.{param}"
# indicators = xc.core.indicator.registry
indicators = {'TX_MAX':xc.core.indicator.registry['TX_MAX']}

steps = []
param_fields = []
for name, ind in indicators.items():
    ind_instance = ind.get_instance()
    logger.info("Processing Indicator: " + ind_instance.identifier)
    field_arr = []
    param_list = []
    for param_name, param in ind_instance.parameters.items():
        if param_name in ["ds"] or param.kind == InputKind.KWARGS:
            continue
        param_list.append(param_str.format(param=param_name, indicator_id=name))

        optional_flag = ""
        doc = [param.description.replace("\n", "\n    ")]
        if param.default:
            doc.append(f"Default : {param.default}.")
        

        if "choices" in param:
            choices = f"\n    Choices: {param.choices}"
            doc.append(choices)

            doc = "\n    ".join(doc)
            if param.default:
                optional_flag = '- type: "null"' 
            field = fields_template_enum.format(
                param=param_name,
                symbols="\n        ".join([f'- "{c}"' for c in param.choices]),
                optional_flag = optional_flag,
                doc = doc,
            )
        else:
            if param.default:
                optional_flag = '?' 
            
            doc = "\n    ".join(doc)
            field = fields_template_str.format(
                param=param_name,
                optional_flag=optional_flag,
                doc=doc,
            )
        field_arr.append(field)
    fields = "\n".join([field.replace("\n", "\n        ") for field in field_arr])
    #param_fields.append("\n".join([field.replace("\n", "\n        ") for field in field_arr]))
    inputs = input_template.format(
        indicator_id=name,
        indicator=ind_instance.identifier,
        fields=fields
    )
    param_fields.append(inputs)
    cwl = template_str.format(
        indicator_id=name,
        indicator=ind_instance.identifier,
        indicator_label=ind_instance.title,
        indicator_doc=ind_instance.abstract.replace("\n", "\n  "),
        docker_path=docker_path,
        docker_image=docker_image,
        indicator_inputs=inputs,
    )
    filename = Path(f"cwl/{name}.yml")
    with open(filename, "w") as f:
        f.write(cwl)

    # for each indicator, also generate a step and add to the master CWL.abs
    param_list = '\n    '.join(param_list)
    
    step = step_str.format(
        indicator_id=name,
        indicator=name,
        file = filename.name,
        params=param_list
    )

    steps.append(step.replace("\n", "\n    "))
    break
master_cwl = master_str.format(
    steps="\n    ".join(steps),
    params="\n    ".join([p.replace("\n", "\n  ") for p in param_fields]),
)
logger.info("Writing master CWL")

with open("cwl/master.yml", "w") as f:
    f.write(master_cwl)

Creating a docker image for xclim:


FROM python:3.10-slim

WORKDIR /app

RUN pip install xclim loguru h5netcdf --no-cache-dir

USER root

COPY cwl.py .
COPY *.yaml .


RUN mkdir /app/cwl

#RUN python -m compileall `python -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())"`

USER $USER

Templates for the CWL generator:

cwl_template.yml:

cwlVersion: v1.2
class: CommandLineTool
id: xclim_{indicator}
label: {indicator_label}
doc: |
  {indicator_doc}
requirements:
  EnvVarRequirement:
    envDef:
      PYTHONPATH: {docker_path}
  ResourceRequirement:
    coresMax: 1
    ramMax: 512
hints:
  DockerRequirement:
    dockerPull: {docker_image}

baseCommand: ["xclim"]
arguments: []
inputs:
  input:
    type: File
    inputBinding:
      position: 0
      prefix: --input
  output:
    type: string
    inputBinding:
      position: 1
      prefix: --output
{indicator_inputs}


outputs:
  outdir:
    outputBinding:
      glob: "*.nc"
    type: File[]

cwl_step.yml:

{indicator}:
  run: {file}
  when: $( (inputs.indicator == {indicator} )
  in:
    input: input
    output: output    
    {params}
  out:
    outdir: outdir

cwl_master.yml

cwlVersion: v1.2
$graph:

- class: Workflow
  requirements:
    - MultipleInputFeatureRequirement
    - SubworkflowFeatureRequirement
    - InlineJavascriptRequirement
    - DockerRequirement
  inputs:
    input: 
      type: string
    output: 
      type: string
    indicator: 
      type: string
    {params}
  steps:
    {steps}

  outputs:
    outdir:
      type: File
      outputSource: 
        valueFrom: ${{ inputs.indicator + '/outdir' }}

Commands for docker/podman, running the CWL:

Build the image:
podman build -t localhost/xclim:latest .

Create the CWL files:
podman run -v $(pwd)/cwl/:/app/cwl -v $(pwd)/cwl.py:/app/cwl.py localhost/xclim:latest python /app/cwl.py:

Run Indicator calculations:
cwltool --podman --outdir runs cwl/TX_MAX.yml --input data/daily_surface_cancities_1990-1993.nc --output out.nc --TX_MAX.freq ME

(not working) run indicator calculations thru master CWL:
cwltool --podman --outdir runs cwl/master.yml --input data/daily_surface_cancities_1990-1993.nc --output out.nc --indicator TX_MAX --TX_MAX.freq MS

Related issues: #1949

This idea came up during the CLINT/OGC code sprint in Bonn, this October.

Contribution

  • I would be willing/able to open a Pull Request to contribute this feature.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@SarahG-579462 SarahG-579462 added the enhancement New feature or request label Oct 15, 2024
@SarahG-579462 SarahG-579462 mentioned this issue Oct 15, 2024
2 tasks
@fmigneault
Copy link

I think the TX_MAX could be simplified like so:

inputs:
  indice:
    type: 
      type: enum
      symbols:
      - tx_max
    inputBinding:
      position: 2

  tasmax:
    doc: Maximum daily temperature.
    type: string?
    default: tasmax
    inputBinding:
      prefix: --tasmax

  freq:
    doc: Resampling frequency.
    type: 
    - "null"
    - type: enum
      symbols:
      - YS
      - MS
      # ... ?
    default: YS
    inputBinding:
      prefix: --freq 

Any input that has a very specific set of values should define a type: enum/symbols.

This should result (with the other inputs I didn't repeat), to a call like:

docker run \
  [...volume mounts, user-id map opts, etc. ...] \
  localhost/xclim:latest \
  xclim \
  --input /tmp/mounted/file.nc \
  --output /tmp/mounted/out.nc \
  tx_max \
  --tas tasmax \
  --freq YS 

Defining the inputs this way makes them look more natural/similar to what xclim expect (ie: using cwltool ... --freq MS rather than cwltool ... --TX_MAX.freq MS).

Similarly, using a job file would use names and values that are easier to define:

input:
  class: File
  path: "/path/to/file.nc"
freq: MS
tasmax: tasmax

To me, it sounds odd that there would be any start-up latency from the container if it was prebuilt, and that nothing triggers rebuilding it each time (modified file in context for example).

I have noticed that calling xclim by itself has a noticeable start-up latency, so not sure if the container is actually at cause at all.


Another thing to consider when defining the CWL.
xclim takes as input a --output file path.
However, the actual path from the point of view of the CWL/container will be mounted volumes with temporary dirs to do the processing.

Therefore, the path doesn't really matter. Only the file name does. The CWL output should do a glob considering this. Something along the lines of :

inputs:
  output: 
    type: string
    inputBinding:
      position: 1
      prefix: --output
    valueFrom: "$(runtime.outdir)/$(self.basename)"

outputs:
  output:
    type: File
    outputBinding:
      glob: "$(inputs.output.basename)"

Then, calling CWL with:

cwltool --outdir /tmp xclim-tasmxa.cwl --input /path/to/in.nc --output result.nc --freq YS

Should create the file /tmp/result.nc.
But what CWL will have done is actually mount the created temp dirs, retrieve the output from the runtime dir, and stage out the output to the requested --outdir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants