Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve External User Documentation #125

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 99 additions & 1 deletion .secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,105 @@
"line_number": 12,
"is_secret": false
}
],
"subworkflows/main/dockers.json": [
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "2416fefd0c93bc214e888dfc8a288f0635e4d8e8",
"is_verified": false,
"line_number": 2,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "1c7d94661f7b215d70638e4d700634218c40ad8e",
"is_verified": false,
"line_number": 3,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "2f14c08e97808c63b5986a9fb86ce6d13edccfa1",
"is_verified": false,
"line_number": 4,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "6afb3835f0666e4e5079a0e0aab6555f1d6d5e94",
"is_verified": false,
"line_number": 6,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "9ba9785e9d5ad470fab3a3fac4b201f18ac35876",
"is_verified": false,
"line_number": 7,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "f4ead6e323ad7fd3419a29473fd1dec397388022",
"is_verified": false,
"line_number": 8,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "b287314600ba074479581bb919a8fb221af9d367",
"is_verified": false,
"line_number": 9,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "30a720c4c6f44684244443cbf019aacf4c45060e",
"is_verified": false,
"line_number": 11,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "7a675a58377346fe4a9f4e3a3dd404ce2437dab2",
"is_verified": false,
"line_number": 12,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "5ab41bcfbe0bda5dd064758836983708791cf055",
"is_verified": false,
"line_number": 15,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "3d055e679fe09c30807ac3a0b686cfc43674d73b",
"is_verified": false,
"line_number": 16,
"is_secret": false
},
{
"type": "Hex High Entropy String",
"filename": "subworkflows/main/dockers.json",
"hashed_secret": "22b5cf6dab89595dbc5c5e45ca65362130b9f012",
"is_verified": false,
"line_number": 17,
"is_secret": false
}
]
},
"generated_at": "2023-04-10T18:42:13Z"
"generated_at": "2023-07-17T14:08:39Z"
}
252 changes: 245 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,110 @@ a harmonized BAM file and an sqlite database of various metrics collected.
This repository is licensed under Apache License Version 2.0. Exceptions are code blocks licensed under CC-BY-SA-4.0. <br>
The CC-BY-SA-4.0 code blocks are denoted by `/begin <AUTHOR> CC-BY-SA-4.0` to `/end <AUTHOR> CC-BY-SA-4.0`.

## Environment
---
## gdc-dnaseq-prealn

### Pre-requisites
### Dockers

Workflow development requires the following programs installed:
[bio_alpine](https://github.com/NCI-GDC/bio-containers)

- `docker`
- `just`
- `python>=3.8`
- `cwltool==3.1.20230213100550`
[bio-client](https://github.com/NCI-GDC/bio-client)

[gatk](https://github.com/NCI-GDC/gatk-docker)

[merge_sqlite](https://github.com/NCI-GDC/merge-sqlite)

[picard](https://github.com/NCI-GDC/picard-docker)

[picard_metrics_sqlite](https://github.com/NCI-GDC/picard_metrics_sqlite)

[readgroup_json_db](https://github.com/NCI-GDC/readgroup_json_db)

[samtools](https://github.com/NCI-GDC/samtools-docker)

[samtools_metrics_sqlite](https://github.com/NCI-GDC/samtools_metrics_sqlite)

## gdc-dnaseq-aln

### Dockers

[bam_readgroup_to_json](https://github.com/NCI-GDC/bam_readgroup_to_json)

[bio_alpine](https://github.com/NCI-GDC/bio-containers)

[bio-client](https://github.com/NCI-GDC/bio-client)

[biobambam](https://github.com/NCI-GDC/biobambam-docker)

[bwa](https://github.com/NCI-GDC/bwa-docker)

[fastq_cleaner](https://github.com/NCI-GDC/fastq_cleaner)

[fastqc](https://github.com/NCI-GDC/fastqc-docker)

[fastqc_db](https://github.com/NCI-GDC/fastqc_db)

[fastqc_to_json](https://github.com/NCI-GDC/fastqc_to_json)

[gatk](https://github.com/NCI-GDC/gatk-docker)

[json_to_sqlite](https://github.com/NCI-GDC/json-to-sqlite)

[merge_sqlite](https://github.com/NCI-GDC/merge-sqlite)

[picard](https://github.com/NCI-GDC/picard-docker)

[picard_metrics_sqlite](https://github.com/NCI-GDC/picard_metrics_sqlite)

[readgroup_json_db](https://github.com/NCI-GDC/readgroup_json_db)

[samtools](https://github.com/NCI-GDC/samtools-docker)

[samtools_metrics_sqlite](https://github.com/NCI-GDC/samtools_metrics_sqlite)

## External Users (`subworkflows/main/gdc_dnaseq_main_workflow.cwl`)

[bam_readgroup_to_json](https://github.com/NCI-GDC/bam_readgroup_to_json)

[biobambam](https://github.com/NCI-GDC/biobambam-docker)

[bwa](https://github.com/NCI-GDC/bwa-docker)

[fastq_cleaner](https://github.com/NCI-GDC/fastq_cleaner)

[fastqc](https://github.com/NCI-GDC/fastqc-docker)

[fastqc_db](https://github.com/NCI-GDC/fastqc_db)

[fastqc_to_json](https://github.com/NCI-GDC/fastqc_to_json)

[gatk](https://github.com/NCI-GDC/gatk-docker)

[json_to_sqlite](https://github.com/NCI-GDC/json-to-sqlite)

[merge_sqlite](https://github.com/NCI-GDC/merge-sqlite)

[picard](https://github.com/NCI-GDC/picard-docker)

[picard_metrics_sqlite](https://github.com/NCI-GDC/picard_metrics_sqlite)

[readgroup_json_db](https://github.com/NCI-GDC/readgroup_json_db)

[samtools](https://github.com/NCI-GDC/samtools-docker)

[samtools_metrics_sqlite](https://github.com/NCI-GDC/samtools_metrics_sqlite)

----

## Pre-requisites
- Docker
- [just](https://github.com/casey/just)
- `jq` (recommended)
- Activated python3.8+ virtualenv

`just init` will install the correct version of `cwltool`, `jinja-cli`, and `pre-commit`, be sure to have a python3.8+ virtual environment active.

`just build-all` will build and validate all workflows in the repo.

These workflows are developed and tested on Ubuntu.

Expand All @@ -31,18 +125,61 @@ The CWL is developed for `cwltools` version `3.1.20230213100550`
<br>
https://www.commonwl.org/ <br>

----

## Repository Structure

### Top-level

The top level of the workflow repository should contain a `justfile`, `build.sh` script, and a `.gitlab-ci.yml` config.

Small CWL scripts not specific to any workflow should be stored in a top-level `tools` directory. These can include general shell commands and `CommandLine` workflows to call bioinformatics tools.

__NOTICE__: This `tools` directory will be copied to the root of the Docker image. Relative path references to CWL in tools will remain valid in both the repo filesystem and docker image filesystem.

Eventually these tooling CWL scripts will be stored in a common library.

The `build.sh` script is used to automatically build and publish images in a CI environment.

The `justfile` contains commands to locally build workflow images, run `just -l` for a full list of commands.

Individual workflow CWL scripts should be stored within top-level directories named after the workflow.

### Workflow Subdirectory

Each workflow directory should also contain the template justfile and Dockerfile.

The `ENTRY_CWL` should be updated with the path to the main workflow cwl script, relative to the workflow directory.

This CWL script will be used as the argument to the `cwltool --pack` command, but can be overwritten using `make pack ENTRY_CWL=...`

The CWL scripts comprising a workflow can be stored under any manner of directory structure.

Ideally any CWL script referenced by another file is in the same directory or a subdirectory of the calling script. (Essentially do not traverse up a directory, only sideways and/or down).

This will enable moving an entire subdirectory, if needed, without needing to update any references contained within.

## GDC Users

There are two entry points:
- harmonization of FASTQ and BAM data: `gdc-dnaseq-aln-cwl/gdc_dnaseq.bamfastq_align.workflow.cwl`.
- trusted pre-aligned data: `gdc-dnaseq-prealn-cwl/gdc_dnaseq.aligned_reads.workflow.cwl`.

----

## For external users

The repository has only been tested on GDC data and in the particular environment GDC is running in. Some of the reference data required for the workflow production are hosted in [GDC reference files](https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files "GDC reference files"). For any questions related to GDC data, please contact the GDC Help Desk at support@nci-gdc.datacommons.io.

The entrypoint CWL workflow for external users is `subworkflows/main/gdc_dnaseq_main_workflow.cwl`.

With the updates to the workflow builds, the easiest way to run this workflow externally is to:

1. Pack the workflow into a single JSON file via `cwltool pack subworkflows/main/gdc_dnaseq_main_workflow.cwl > gdc_dnaseq_main_workflow.tmp`
2. Update the `dockerPull` template strings: `jinja -u 'strict' -d gdc-dnaseq-aln-cwl/dockers.json gdc_dnaseq_main_workflow.tmp > gdc_dnaseq_main_workflow.json`

Alternatively, fork this repo and make any necessary updates locally.

An example input yaml is available here: `example/gdc_dnaseq_main_workflow_example_input.yaml`.

### Inputs
Expand Down Expand Up @@ -118,3 +255,104 @@ An example input yaml is available here: `example/gdc_dnaseq_main_workflow_examp
| `output_bam` | `File` | harmonized and indexed BAM file |
| `sqlite` | `File` | sqlite file containing metrics data |

Further documentation available [here](https://docs.google.com/document/d/17NFwGvn4vMEXZV9Qmg30BqAcKrdxBYCOB4pFdkkwIIo/edit#)

# CWL Workflow Development

## Pre-requisites

- Docker
- [just](https://github.com/casey/just)
- `jq` (recommended)

`just init` will install the correct version of `cwltool`, `jinja-cli`, and `pre-commit`, be sure to have a python3.8+ virtual environment active.

`just build-all` will build and validate all workflows in the repo.

## Just

The `just` utility is a command runner replacement for `make`.

It has various improvements over `make` including the ability to list available command with `just -l`:

### Root Justfile

```
Available recipes:
build WORKFLOW # Builds individual workflow
build-all # Builds all docker images for each directory with a justfile
init
pack WORKFLOW # Builds Docker for WORKFLOW and prints packed JSON
```

The root `justfile` provides recipes for Dockerizing workflows locally, while workflow-level `justfiles` provide recipes for building the workflow.

### Workflow Justfile

The workflow-level `justfile` requires the `ENTRY_CWL` path be updated.

```
# justfile
ENTRY_CWL := "workflow.cwl"
```

Certain recipes check if the `ENTRY_CWL` file exists, and will show an error message if not.

```
Available recipes:
get-dockers # Formats and prints all Dockers used in workflow
get-dockers-template # Prints all dockerPull declarations in unformatted workflow
inputs # Print template input file for workflow
pack # Pack and apply Jinja templating. Creates cwl.json file
validate # Validates CWL workflow
```

Some important commands for workflow development:

`just validate` will run cwltool's validation and show any errors in the CWL.

`just inputs` will output a template input file for the workflow.

`just get-dockers-template` will pack the workflow to JSON and print all unique dockerPull declarations.

This command is useful for building the `dockers.json` file or finding un-templated image strings.

`just get-dockers` will pack the workflow to JSON and apply the Docker formatting.

No template strings should remain after formatting.

## dockerPull Jinja Templates

For CommandLineTool workflows utilizing `dockerPull`, the docker image should be specified in CWL as a jinja-compatible template string.

`dockerPull: "{{ docker_repository }}/image_name:{{ image_name }}"`

__NOTICE__: Double quotes required

Within each workflow's `justfile` is the `just pack` command which:

1. Packs the CWL workflow into a temporary JSON file
2. Uses `jinja-cli` and the `dockers.json` file to replace each template string
3. Saves the result to `cwl.json`

------

### `dockers.json`

```json
{
"docker_repository": "docker.osdc.io/ncigdc",
"image_name": "abcdef"
}
```

This JSON file combined with the example string above will result in a final string:

`dockerPull: "docker.osdc.io/ncigdc/image_name:abcdef`

in the packed cwl.json file.

------

While this prevents the CWL from being used directly, it enables easy updating of multiple Docker images for GPAS, and allows external users to supply their own images/tags.

Loading