Skip to content

Commit

Permalink
Merge pull request #3 from IGVF/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
emattei authored Sep 22, 2024
2 parents 4eb8247 + 6bd590a commit 1702486
Show file tree
Hide file tree
Showing 6 changed files with 98 additions and 193 deletions.
56 changes: 56 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Welcome to the IGVF Single Cell Pipeline repository.
- [Introduction](#introduction)
- [Documentation](#documentation)
- [Usage](#usage)
- [Usage](#usage)
- [Contributing](#contributing)
- [License](#license)

Expand All @@ -31,6 +32,61 @@ To run the pipeline on Terra/Anvil go to the [dockstore page](https://dockstore.

To run the pipeline locally you can refer to these [instructions](docs/install-pipeline-locally.org).

## Input Files

The following input files are required to run the pipeline:

- **read1_atac** *(Type: Array[File])*: First read file for ATAC sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing RNA use `[]`
- **read2_atac** *(Type: Array[File])*: Second read file for ATAC sequencing data. Accepted inputs formata are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing RNA use `[]`
- **fastq_barcode** *(Type: Array[File])*: Barcode file for ATAC sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing RNA use `[]`

- **read1_rna** *(Type: Array[File])*: First read file for RNA sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing ATAC use `[]`
- **read2_rna** *(Type: Array[File])*: Second read file for RNA sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing ATAC use `[]`
- **whitelist_atac** *(Type: Array[File])*: Whitelist file for ATAC barcodes. Accepted input formats are: `local path`, `https` url, or a Google Storage (`gs`) url. If you are only processing RNA use `[]`
- **whitelist_rna** *(Type: Array[File])*: Whitelist file for RNA barcodes. Accepted input formats are: `local path`, `https` url, or a Google Storage (`gs`) url. If you are only processing ATAC use `[]`
- **seqspecs** *(Type: Array[File])*: Seqspec file describing the sequencing data. Accepted input formats are: `local path`, `https` url, or a Google Storage (`gs`) url.
- **Chemistry** *(Type: String)*: Chemistry used in the sequencing process.
- **Prefix** *(Type: String)*: Prefix for output files.
- **Genome_tsv** *(Type: File)*: TSV formatted file containing the links to chromap and kb inputs and other reference annotations. For human and mouse you can use the following.

| Genome | URL |
|--------|-----|
| hg38 | `gs://broad-buenrostro-pipeline-genome-annotations/IGVF_human_v43/IGVF_human_v43_Homo_sapiens_genome_files_hg38_v43.tsv` |
| mm39 | `gs://broad-buenrostro-pipeline-genome-annotations/IGVF_mouse_v32/IGVF_mouse_v32_Mus_musculus_genome_files_mm39_v32.tsv` |
| mm39_cast(beta) | `gs://broad-buenrostro-pipeline-genome-annotations/CAST_EiJ_v1/Mus_musculus_casteij_genome_files.tsv` |

## Output Files

### RNA
| Name | Suffix | Description |
|--------|-----|--------|
| `rna_mtx_tar `| .mtx.tar.gz | [Raw] Tarball containing four separated count matrices in mtx format: Spliced, Unspliced, Ambiguous, and Total |
| `rna_mtxs_h5ad` | .rna.align.kb.{genome}.count_matrix.h5ad | [Raw] h5ad containing four separated count matrices: Spliced, Unspliced, Ambiguous, and Total |
| `rna_aggregated_counts_h5ad` | .cells_x_genes.total.h5ad | [Raw] Aggregated(Ambiguous+Spliced+Unspliced) count matrix in h5ad format |
| `rna_barcode_metadata` | .qc.rna.{genome}.barcode.metadata.tsv | [Raw] Per barcode alignment statistics file |
| `rna_log` | .rna.log.{genome}.txt | [Raw] Log file from kb |
| `rna_kb_output` | .rna.align.kb.{genome}.tar.gz | [Raw] Tarball containing all the logs and bus files generated from kb |

### ATAC

| Name | Suffix | Description |
|--------|-----|--------|
| `atac_bam` | .atac.align.k4.{genome_name}.bam | [Raw] Aligned bam file from Chromap |
| `atac_bam_log` | .atac.align.k4.{genome}.log.txt | [Raw] Log file from aligner |
| `atac_fragments` | .tsv.gz | [Raw] Fragment file from Chromap |
| `atac_fragments_index` | .tsv.gz.tbi | [Raw] Fragment file index from Chromap |
| `atac_chromap_barcode_metadata` | .chromap.barcode.metadata.tsv | [Raw] Per barcode alignment statistics file from Chromap |
| `atac_snapatac2_barcode_metadata` | .snapatac2.barcode.metadata.tsv | [Filtered] Per barcode statistics file from SnapATAC2 |

### Joint

| Name | Suffix | Description |
|--------|-----|--------|
| `joint_barcode_metadata` | .joint.barcode.metadata.{genome}.csv | [Filtered] Joint per barcode statistics file |
| `csv_summary` | .csv | [Filtered] CSV summary report |
| `html_summary` | .html | [Filtered] HTML summary report |


## Contributing

We welcome contributions to improve this pipeline. Please fork the repository and submit a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.
Expand Down
37 changes: 0 additions & 37 deletions dockerfiles/share_task_correct_fastq.dockerfile

This file was deleted.

13 changes: 13 additions & 0 deletions docs/example_input.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"multiome_pipeline.chemistry": "${this.Chemistry}",
"multiome_pipeline.fastq_barcode": "${this.ATAC_barcode}",
"multiome_pipeline.genome_tsv": "${this.genome_tsv}",
"multiome_pipeline.prefix": "${this.WUSTL_coronary_id}",
"multiome_pipeline.read1_atac": "${this.ATAC_fastq_R2}",
"multiome_pipeline.read1_rna": "${this.RNA_fastq_R1}",
"multiome_pipeline.read2_atac": "${this.ATAC_fastq_R2}",
"multiome_pipeline.read2_rna": "${this.RNA_fastq_R2}",
"multiome_pipeline.seqspecs": "${this.seqspec}",
"multiome_pipeline.whitelist_atac": "${this.Barcode_inclusion_list_ATAC}",
"multiome_pipeline.whitelist_rna": "${this.Barcode_inclusion_list_RNA}",
}
179 changes: 26 additions & 153 deletions docs/install-pipeline-locally.org
Original file line number Diff line number Diff line change
@@ -1,41 +1,25 @@
* Introduction
* IGVF Single-Cell Pipeline Setup and Execution Documentation

The default configuration for the IGVF single-cell pipeline is a docker based Terra
cloud environment. Docker tends not to be available on HPC
environments so I wanted to use singularity instead. Though as of yet
I haven't figured out how to adjust the cromwell configuration for the
queing systems I have access to and have just been running individual
experiments on a single reasonably sized host.
The default configuration for the IGVF single-cell pipeline operates in a Docker-based Terra cloud environment. However, Docker is not always available on high-performance computing (HPC) environments. To accommodate this, Singularity can be used as an alternative. This documentation outlines the setup for running the pipeline on a single host using Singularity. As of now, the Cromwell configuration for specific queuing systems has not been finalized, and individual experiments are being executed on a standalone machine.

* Define some variables
** Define some variables

#+name: default-table
| root_dir | /home/diane/proj |
| root_dir | /home/user/proj |
| single_cell_pipeline_git | https://github.com/IGVF/single-cell-pipeline |
| diane_conda_url | https://woldlab.caltech.edu/~diane/conda/ |
| conda_command | /home/diane/bin/micromamba |
| conda_command | /home/user/bin/micromamba |
| environment | the IGVF single-cell pipeline |

These variables are accessed has ${defaults["variable_name"]} in the shell
scripts below.
These variables can be accessed using ${defaults["variable_name"]} in the shell scripts provided.

* Set up the conda environment
** Set up the conda environment

It is very important to make sure you get the recent version of
cromwell, conda has some quite old versions available, and atomic
workflow needs something relatively recent. (The version 40 of
cromwell I accidentally installed on my first try definitely did not
work).
It is crucial to use a recent version of Cromwell, as older versions available on Conda may not be compatible with the pipeline's requirements. Version 40, for example, is not supported.

This block should install most of the dependencies to run
single-cell-pipeline into a [[https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html][conda environment]]. Conda-forge recommends
[[https://mamba.readthedocs.io/en/latest/user_guide/mamba.html][mamba]] and I found [[https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html][micromamba]] to be easiest to install, though it is a
bit limited compared to conda or mamba.

You will need to replace the ${defaults["conda_command"]} and
${defaults["environment"]} with appropriate settings for your
environment.
The following block installs most of the necessary dependencies for the IGVF single-cell pipeline into a Conda environment. The usage of Micromamba is recommended for faster installation, though it is more limited than Conda or Mamba.

Make sure to replace ${defaults["conda_command"]} and ${defaults["environment"]} with your own settings.

#+name: create-single-cell-pipeline-environment
#+begin_src bash :var defaults=default-table
Expand Down Expand Up @@ -69,8 +53,7 @@ make sure the main components were installed.
| gsutil | version: | 5.3 |
| Synapse | Client | 4.4.0 |

Next we check-out the IGVF single-cell pipeline repository. As of the time I am
writing this, we are using release v1
Checkout the IGVF Single-Cell Pipeline Repository. As of Sep 2024, this is for release v1.

#+name: checkout-atomic-workflow
#+begin_src bash :var defaults=default-table :async yes :results none
Expand All @@ -79,30 +62,12 @@ writing this, we are using release v1
popd
#+end_src

* Configuring cromwell
** Configuring cromwell

This configuration file is the magic that tells cromwell to use
singularity instead of docker. This configuration file is for running
locally on a single machine. I need to pass
~-Dconfig.file=cromwell.conf~ when launching the pipleine to get
cromwell to use the configuration file. There needs to be more
To use Singularity instead of Docker with Cromwell, the following configuration is required. The example below is for running the pipeline on a single machine. When launching the pipeline, pass the ~-Dconfig.file=cromwell.conf~ argument to Cromwell to use this configuration. Additional adjustments are needed for integration with local queuing systems.
adjustments to integrate with the local queing system.

I found the initial example in the singularity section of the cromwell
[[https://cromwell.readthedocs.io/en/latest/tutorials/Containers/#singularity][Containers]] documentation.

After some testing I did discovered that synapse needs access to a
configuration file and possible a cache directory, to work correctly.

Without the configuration file the pipeline aborts when ~synapse get~
failing because of the authentication prompt. This happens because the
default behavior of singularity is to limit the containers to the
directory tree below where the container was started.

To resolve this you need to replace the ${HOME} paths in the
configuration below with your actual paths. Using the ~ to represent
home directories won't work, at least for the destination path as
singularity will then complain about needing an absolute path.
This configuration is based on the [[https://cromwell.readthedocs.io/en/latest/tutorials/Containers/#singularity][Containers]] documentation. Note that Synapse requires access to a configuration file and a cache directory to function properly. Ensure that ${HOME} paths in the configuration are replaced with absolute paths to avoid issues with Singularity.

#+name: cromwell-local-singularity
#+begin_src wdl
Expand All @@ -129,25 +94,9 @@ singularity will then complain about needing an absolute path.
}
#+end_src

* Configuring a local IGVF single-cell pipeline run

Terra takes values from the dataset tables and uses that to generate a
metadata json file for cromwell. When running outside of terra, one
will need to manually generate the json file.

I found it could be somewhat confusing to figure out what the proper
name for a variable should be. Since cromwell supports subworkflows,
it seemed like it uses a dotted path to refer to setting a value in a
particular workflow.

If you have an example .json file for the current workflow and some of
the terra tables, this you should replace the contents of
${This.variable} with what's in the contents of the variable column
for the row you're working on.

This example is for a customized 10x multiome experiment. This was
tested using a version of single-cell pipeline from around August 16th.
** Running the IGVF Single-Cell Pipeline Locally

When running outside of Terra, the metadata .json file for Cromwell must be generated manually. Below is an example .json configuration for a customized 10x multiome experiment.

#+name: gm12878.json
#+begin_src json
Expand Down Expand Up @@ -206,94 +155,19 @@ tested using a version of single-cell pipeline from around August 16th.
}
#+end_src

This example is for a parse splitseq RNA experiment, and comes from
before I figured out how to use synapse, so uses local file names for
the source data.

I think I'd also made a slight change to the pipeline for it to better
support local files

#+name: tiny-13a.json
#+begin_src json
{
"single_cell_pipeline.trim_fastqs": true,
"single_cell_pipeline.chemistry": "parse",
"single_cell_pipeline.prefix": "106",
"single_cell_pipeline.subpool": "13A",
"single_cell_pipeline.pipeline_modality": "full",

"single_cell_pipeline.genome_fasta": "/woldlab/loxcyc/home/diane/proj/genome/GRCm39-M32-male/GRCm38-IGVFI282QLXO.fasta.gz",

"single_cell_pipeline.seqspecs": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/b01_13atiny_seqspec.yaml"],

"single_cell_pipeline.whitelist_atac": ["737K-arc-v1-ATAC.txt.gz"],
"single_cell_pipeline.read1_atac": [],
"single_cell_pipeline.read2_atac": [],
"single_cell_pipeline.fastq_barcode": [],
"single_cell_pipeline.count_only": false,
"single_cell_pipeline.atac_barcode_offset": 8,

"single_cell_pipeline.whitelist_rna": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/parse-splitseq-v2/CB1.txt", "/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/parse-splitseq-v2/CB23.txt"],
"single_cell_pipeline.read1_rna": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/igvf_b01/next1/B01_13Atiny_R1.fastq.gz"],
"single_cell_pipeline.read2_rna": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/igvf_b01/next1/B01_13Atiny_R2.fastq.gz"],
"single_cell_pipeline.gtf": "/woldlab/loxcyc/home/diane/proj/genome/GRCm39-M32-male/M32-IGVFFI9744VSJF.gtf.gz",
"single_cell_pipeline.rna.kb_strand": "forward",
"single_cell_pipeline.rna.replacement_list": "r1_RT_replace.txt",

"single_cell_pipeline.genome_name": "GRCm39-M32",
"single_cell_pipeline.genome_tsv": "Mus_musculus_genome_files_mm39_v32.tsv",
"single_cell_pipeline.whitelists_tsv": "permit_lists_urls.tsv",

"single_cell_pipeline.check_read1_rna.disk_factor": 1,
"single_cell_pipeline.check_read2_rna.disk_factor": 1,
"single_cell_pipeline.check_read1_atac.disk_factor": 1,
"single_cell_pipeline.check_read2_atac.disk_factor": 1,
"single_cell_pipeline.check_fastq_barcode.disk_factor": 1
}
#+end_src

* Running cromwell

For me, I needed to manually add the ${atomic_workflows}/src/bash
directory added to the path for a script stored in there. The
-Dconfig.file points to the contents of the cromwell-local-singularity
file described above.

My systems default umask is 022 and with cromwell I end up with
directories with the permissions 757 and for some reason that
generates error messages when trying to modify files or delete the
cromwell-execution directory.

I needed to reset my umask to 002 before running cromwell with

#+begin_src bash
umask 002
#+end_src

You will need to pass the path to the wdl script, the cromwell
configuration file and json configuration file as shown below.
** Running cromwell

To execute the pipeline, ensure the script and configuration files are properly referenced, including the Cromwell configuration file (cromwell.conf), the WDL script, and the generated JSON file.

#+begin_src bash
PATH=/woldlab/loxcyc/home/diane/proj/single-cell-pipeline/src/bash/:$PATH
PATH=/woldlab/loxcyc/home/user/proj/single-cell-pipeline/src/bash/:$PATH
\ cromwell -Dconfig.file=cromwell.conf run \
../single-cell-pipeline/single_cell_pipeline.wdl \ -i tiny-13a.json
#+end_src

* Finding results with cromwell.
** Finding results with cromwell.

This is sufficiently annoying that the ENCODE DCC wrote
[[https://github.com/ENCODE-DCC/caper]] to try and make it easier to stage
results in an easier to understand directory structure.

Unfortunately I had trouble getting caper to work, and so started with
the simpler goal of getting cromwell to work.

Cromwell makes it's own directory hierarchy under cromwell-executions/
here's an example from a partial run of the atomic_workflows pipeline

As an example subset, here's a bit of the directory tree from one
failed run.
Cromwell stores results in a nested directory structure under cromwell-executions/. To locate error or output logs, navigate to the appropriate subdirectory based on the workflow step. The example below shows a partial directory tree from a failed run:

- cromwell-executions
- single_cell_pipeline
Expand Down Expand Up @@ -325,10 +199,9 @@ failed run.
- call-check_read2_rna
- call-check_seqspec

I was usually looking for the stderr/stdout files in the various
workflow steps to try and find what had gone wrong with a run. It will
probably take some experimentation to get the configuration settings
correct for your environment.
To find specific files, use the following command:
~find cromwell-executions -name ${filename}~

This structure can take some trial and error to configure correctly for your environment. For easier results management, consider using tools like Caper. However, it is also possible to manage this manually as demonstrated above.

I was using ~find cromwell-executions -name ${filename}~ to find any
result files.
We extend our gratitude to [[https://github.com/detrout][@detrout]] for their contribution in providing the initial draft of this document.
4 changes: 2 additions & 2 deletions single_cell_pipeline.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -285,8 +285,8 @@ workflow multiome_pipeline {
# ATAC ouputs
File? atac_bam = atac.atac_chromap_bam
File? atac_bam_log = atac.atac_chromap_bam_alignement_stats
File? atac_filter_fragments = atac.atac_fragments
File? atac_filter_fragments_index = atac.atac_fragments_index
File? atac_fragments = atac.atac_fragments
File? atac_fragments_index = atac.atac_fragments_index
File? atac_chromap_barcode_metadata = atac.atac_qc_chromap_barcode_metadata
File? atac_snapatac2_barcode_metadata = atac.atac_qc_snapatac2_barcode_metadata
Expand Down
Loading

0 comments on commit 1702486

Please sign in to comment.