Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added initial WfExS-backend examples based on toy workflow. #53

Merged
merged 20 commits into from
Oct 26, 2023

Conversation

jmfernandez
Copy link
Contributor

I have just added the generated RO-Crates from the execution of two toy workflows with WfExS-backend.

@jmfernandez jmfernandez requested review from lrodrin and simleo March 16, 2023 17:42
Copy link
Collaborator

@simleo simleo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start from cosifer-cwl_provenance.

  • The root dataset's conformsTo needs to change to that of Workflow Run Crate (i.e., it should not conform to https://w3id.org/ro/wfrun/provenance/0.1, but only to the other three profiles), since the tool execution is not recorded. The recorded actions are the "consolidation" (what is this, by the way?) and three executions of the workflow.

  • The crate needs to specify a license.

  • The root dataset needs to link to the CreateAction instances via mentions.

  • There are duplicate entries in the workflows' input and output that need to be eliminated. This is not handled by ro-crate-py, so it has to be taken care of in the library user's code. Other lists in the crate also have duplicates, e.g. the workflow's softwareRequirements. BTW, softwareRequirements is a property of SoftwareApplication, so SoftwareApplication should be added to the workflow's type list.

  • The workflow lists itself in hasPart. It should only list the tools (only one in this case). However, since it's a Workflow Run Crate, hasPart can be omitted altogether.

  • The additionalType for CWL string parameters should be Text, not String. See https://www.researchobject.org/workflow-run-crate/cwl_param_mapping

  • Input and output files should either be included in the crate or be retrievable via download. So data_matrix.csv is ok (although it's better to include it in the crate), but the nih: URIs are another matter. What should the crate user do with e.g. nih:sha-256;95fe-414b-b4e0-e193-6230-ea67-cd55-4ca8-cf30-6281-0870-0748-5d76-2df3-67b7-8bf4;9? How can the file be retrieved?

cosifer-cwl_staged is similar to cosifer-cwl_provenance, but does not have the workflow executions. Remarks are similar to the ones above, with the following differences:

  • Since the only action is the consolidation, which is not a workflow execution, this crate should conform to Process Run Crate and Workflow RO-Crate, but not Workflow Run Crate or Provenance Run Crate.

  • The workflow has input but not output. They could both be omitted though.

The cosifer-nxf* crates have issues similar to the ones above. In addition:

  • The workflow's input lists both an outputsDir and an outputs-dir parameter, but only the former is present in the workflow.

  • The workflow lists nextflow.config in hasPart, but hasPart should only list the tools orchestrated by the workflow. nextflow.config should be listed in the action's "object" (see referencing configuration files). hasPart in the workflow can be omitted altogether.

@simleo
Copy link
Collaborator

simleo commented Mar 30, 2023

The workflow run crates will have to conformTo the 0.2 version of the profile, to be released soon -- see #55.

@simleo simleo mentioned this pull request Jun 7, 2023
@jmfernandez jmfernandez force-pushed the wfexs_backend_examples branch from 49dbf69 to 5a11d94 Compare August 31, 2023 02:44
@jmfernandez
Copy link
Contributor Author

I have been updating previous examples, as well as adding new, real life, ones. New examples in this repo only include the generated ro-crate-metadata.json and either a copy of the workflow or its packed version, as the used inputs and containers need several GBs.

@simleo
Copy link
Collaborator

simleo commented Sep 1, 2023

Thanks for the updates, José María. Looking at the previous examples, progress has been made. However, some issues remain, and the new workflows bring some other issues. I'll list what I've found below.

All (or most) crates

  • Crates needs to specify a license. We've also been discussing this aspect with the Sapporo team here.

  • Some lists still have duplicates. For instance, the workflow's input and output in cosifer-nxf_provenance.

  • softwareRequirements is a property of SoftwareApplication, so SoftwareApplication should be added to the workflow's type list when softwareRequirements is used on a workflow.

  • The additionalType for entities representing CWL string parameters (e.g., for separator) needs to be Text, not String (String is not a term in the RO-Crate context).

  • Consolidation actions have WfExS-backend as their agent. The agent of an action should be a Person or Organization, not a SoftwareApplication. I guess that here there is an attempt to represent the fact that cwltool was launched by WfExS-backend, but elsewhere in the crate this kind of relationship is modeled with softwareRequirements. So I would remove the agent entry, change the instrument of the action to WfExS-backend and list cwltool as a software requirement (this was discussed at the meeting yesterday).

  • docker:// URIs are used as contextual entity @ids in several places. I don't think this is forbidden by the RO-Crate spec, but the examples you find there are usually http(s) and mailto. If less common schemes are used, I think the crate should include a README to help the user understand what they are about and how to interact with them.

  • The sha256 term is not in the https://w3id.org/ro/crate/1.1/context, but we have it in the workflow-run ro-terms context. To bring in the definition, change the @context entry as follows:

"@context": [
    "https://w3id.org/ro/crate/1.1/context",
    "https://w3id.org/ro/terms/workflow-run"
]

cosifer-cwl_staged

  • Since the only action is the consolidation, which is not a workflow execution, this crate should not conformsTo Workflow Run Crate (https://w3id.org/ro/wfrun/workflow/0.2).

  • The root dataset should to link to the CreateAction instance via mentions.

  • The crate contains "dangling" PropertyValue entities, i.e., not listed in any action's object. The fact that they don't appear in the action's object is normal since what's described is not the workflow run but its consolidation. Those PropertyValue entries should be removed. The same goes for the "inputs/data_matrix.csv" File: it should be removed since it's not involved in the described process. Is there any other reason to include them? Does WfExS use such "staged" RO-Crates to run the workflow, reading parameter settings from them? If so, this should be explained in the README and in the paper.

cosifer-nxf_provenance

  • The workflow lists nextflow.config in hasPart, but hasPart should only list the tools orchestrated by the workflow. nextflow.config should be listed in the action's "object" (see referencing configuration files). hasPart in the workflow can be omitted altogether.

cosifer-nxf_staged

  • This crate does not have a CreateAction, so it should not conformsTo https://w3id.org/ro/wfrun/process/0.2 nor to https://w3id.org/ro/wfrun/workflow/0.2. It's a plain Workflow RO-Crate.

  • Same consideration as cosifer-cwl_staged for dangling entities.

wombat-pipelines_provenance

  • Some of the entities listed in the workflow's hasPart seem to be configuration files rather than tools, see note above for cosifer-nxf_provenance.

  • Some of the action's object point to Intangible entities, which we are not part of the model. As discussed at the meeting yesterday, this was an attempt to represent a null value. We agreed to add "valueRequired": "False" in the FormalParameter and omit the value from the object listing instead. Ideally, however, we should be able to explicitly state that a defaultValue is null in the FormalParameter and to express that a value was set to null in a PropertyValue for languages that allow this, but how? @stain can Javascript null be used directly in RO-Crate (e.g., "value": null)?

  • There are some type mismatches between FormalParameters and corresponding PropertyValues. For instance, workflow/main.nf#param:run_statistics has an additionalType of Integer but the corresponding PropertyValue has a boolean value.

nfcore-rnaseq_provenance

Same issues as wombat-pipelines_provenance, plus:

  • This crate uses git+https:// and s3:// schemes for data entities. What's the consumer supposed to do with them? Considerations are similar to those made for docker:// above, but these are data entities, so the consumer should know how to retrieve them.

  • There's a Collection that has an Intangible as its mainEntity. The mainEntity should be an entry point to a composite dataset (e.g. an index file), as explained in Representing multi-file objects. If such a "special" file exists in the collection it should be listed both as the mainEntity and under hasPart.

Wetlab2Variations_CWL_provenance

  • This crate uses trs:// schemes for data entities. Same consideration as those made for git+https:// etc.

  • The #e0b54cc1-05bb-46cd-9f62-b44293f4c053 collection has a mainEntity that's not listed in the collection's hasPart

@jmfernandez
Copy link
Contributor Author

I have revised the main issues. I still have to revise the other ones, and figure out the best way to add and semantically relate a README file describing the meaning and usage of the different schemes to the generated RO-Crates

@simleo
Copy link
Collaborator

simleo commented Sep 21, 2023

The biggest issue with the recent changes is that CreateAction instances have instrument properties with multiple values. CreateAction instances need to have a single instrument, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as its instrument. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application's SoftwareRequirements as you have been doing all along. Multiple values for instrument break runcrate report BTW.

Other issues I've found with the latest changes:

  • The root dataset should not have an "about": {"@id": "README.md"}; it's the readme that is about the crate, not the other way around, so the "README.md" entity should have an "about": {"@id": "./"}.

  • The entities used to represent Docker images link to a JSON file via hasPart. I think this should be subjectOf instead, though it's redundant since the entities that represent the JSON files have an about pointing to the Docker image entities. By the way, what are these JSON files? Do they conform to some standard for describing images / containers?

@jmfernandez
Copy link
Contributor Author

The biggest issue with the recent changes is that CreateAction instances have instrument properties with multiple values. CreateAction instances need to have a single instrument, pointing to the application used to perform the action. In particular, the action that represents the workflow run must have the workflow itself as its instrument. This is crucial to reconstruct the mapping between the action's actual values and the formal parameters. If you wish to list dependencies somehow, it's better to include them in the application's SoftwareRequirements as you have been doing all along. Multiple values for instrument break runcrate report BTW.

I understand it, I have applied the change. The reason I added multiple instruments to the CreateAction is that I realized a workflow can be run in different containerization modes, so fully describing the workflow execution (not the workflow itself) requires either adding all the software requirements under instrument, declaring all the instruments together in a collection and adding the collection under instrument or declaring them in a different place (although not completely correct, under softwareAddOn). For instance, an nf-core workflow can be run either using conda mode (so the dependencies should be linked in some place) or singularity (the used containers should be pointed out), plus other tangential software execution details, like a queue system.

So, my conclusion is that the containers and the container engine should not appear under softwareRequirements of the computational workflow as such, as they could not appear in a different execution. So, in the mean time, I'm declaring them under softwareAddOn.

So, in the long term, I think we should distinguish between the workflow as SoftwareSourceCode (the targetPlatform is the workflow engine) and the instantiated workflow as SoftwareApplication (the softwareRequirements are the workflow engine, the containers, etc...). What do you think?

Other issues I've found with the latest changes:

* The root dataset should not have an `"about": {"@id": "README.md"}`; it's the readme that is about the crate, not the other way around, so the "README.md" entity should have an `"about": {"@id": "./"}`.

Thanks! I didn't realize I put the relation in the wrong way.

* The entities used to represent Docker images link to a JSON file via `hasPart`. I think this should be `subjectOf` instead, though it's redundant since the entities that represent the JSON files have an `about` pointing to the Docker image entities. By the way, what are these JSON files? Do they conform to some standard for describing images / containers?

These JSON are metadata gathered by WfExS-backend, and depending on whether Singularity/Apptainer or Docker/Podman have one or another format. They help WfExS-backend to identify when the original container and the contents from the cache (or RO-Crate) do not match. I have just added a couple of paragraphs to the automatically included README.md files, giving some details.

Following with your last questions, when docker or podman modes have been used to run the workflow, the metadata of the images can be obtained using docker inspect imagetag or podman inspect imagetag. This metadata is an array describing each layer of the image, and is preserved under the manifests key. The other keys keep digests and an id which is stable with docker save + docker load operations.

When Singularity/Apptainer mode is used for the workflow execution, container images can come from either http requests or from docker registries. Any of them are materialized by singularity pull, but that command does not preserve the metadata from docker registries, like the original id and original layers of the container. So, for singularity images created from docker ones the code asks through REST API to get the original container image metadata from the registry.

So, although part from those JSON files contain original metadata, they are augmented with additional details.

@simleo
Copy link
Collaborator

simleo commented Sep 25, 2023

I think distinguishing between "just code" and actual running application would be very hard at this point, since the whole model (and the tooling) is based on code/application being on the prospective part and actions on the retrospective part. Even at the lowest level of Process Run Crate it says that the application's type should include SoftwareApplication, SoftwareSourceCode or ComputationalWorkflow. The main problem with softwareAddOn is that it's still on the prospective side (in SoftwareApplication). We need a way to associate the container image with the action. Can you try using the current ContainerImage proposal? For instance, for cosifer-cwl_provenance it should look like:

{
    "@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
    "@type": "CreateAction",
    "containerImage": "#cosifer-image"
},
{
    "@id": "#cosifer-image",
    "@type": "ContainerImage",
    "additionalType": "DockerImage",
    "registry": "docker.io",
    "name": "tsenit/cosifer",
    "tag": "b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}

We could add URL to the range of containerImage, so you can also do:

{
    "@id": "#e0d55b35-b042-420e-8cf3-c8424644f17b",
    "@type": "CreateAction",
    "containerImage": "docker://tsenit/cosifer:b4d5af45d2fc54b6bff2a9153a8e9054e560302e"
}

Regarding the JSON files about the container images, the container images should not refer to them via hasPart.

@jmfernandez jmfernandez force-pushed the wfexs_backend_examples branch from cd6a83f to 547a5f1 Compare October 9, 2023 23:55
@jmfernandez
Copy link
Contributor Author

I have updated all the examples, so they are now using ContainerImage. Also, they should be reflecting the cases where the original source of the container is a Docker registry, but the tool used to materialize it was singularity/apptainer

@simleo
Copy link
Collaborator

simleo commented Oct 11, 2023

Looking good for the most part. In Wetlab2Variations_CWL_provenance, ContainerImage entities have @ids that are neither URIs nor strings starting with #. This should not happen since these are contextual entities. For some reason it seems that the naming scheme used in other crates is not applied here.

Other remarks (for all crates):

  • I would remove softwareAddOn and put everything under softwareRequirements
  • Since you seem to have the hashes for the container images, you should list them using the sha256 property. E.g.:
{
    "@id": "docker://docker.io/node:slim",
    "@type": [
        "ContainerImage",
        "SoftwareApplication"
    ],
    "additionalType": "DockerImage",
    "applicationCategory": "https://www.wikidata.org/wiki/Q51294208",
    "name": "docker.io/node",
    "operatingSystem": "linux",
    "processorRequirements": "amd64",
    "registry": "docker.io",
    "softwareRequirements": {
        "@id": "https://apptainer.org/"
    },
    "softwareVersion": "library/node@sha256:dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e",
    "tag": "slim",
    "sha256": "dc1906714d1993d291e1e7b5f236291236b0a0b6dfacdb164e4a9ea44d09c52e"
}

@simleo
Copy link
Collaborator

simleo commented Oct 17, 2023

Version 0.3 of the profiles, which include the specs on container images, has been released, so you can change the conformsTo entries accordingly.

ro-crate-py 0.9.0 has also been released and includes ResearchObject/ro-crate-py#162, so you can use it to add the whole workflow-run context.

@jmfernandez
Copy link
Contributor Author

Nice! I'm doing it along these days!

@simleo
Copy link
Collaborator

simleo commented Oct 26, 2023

Merging to give a home to the examples. Remaining updates can be done in a subsequent PR

@simleo simleo merged commit 2b31a8e into ResearchObject:main Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants