Merge pull request #3 from IGVF/dev

Dev
IGVF · Sep 22, 2024 · 1702486 · 1702486
2 parents 4eb8247 + 6bd590a
commit 1702486
Show file tree

Hide file tree

Showing 6 changed files with 98 additions and 193 deletions.
diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@ Welcome to the IGVF Single Cell Pipeline repository.
 - [Introduction](#introduction)
 - [Documentation](#documentation)
 - [Usage](#usage)
+- [Usage](#usage)
 - [Contributing](#contributing)
 - [License](#license)
 
@@ -31,6 +32,61 @@ To run the pipeline on Terra/Anvil go to the [dockstore page](https://dockstore.
 
 To run the pipeline locally you can refer to these [instructions](docs/install-pipeline-locally.org).
 
+## Input Files
+
+The following input files are required to run the pipeline:
+
+- **read1_atac** *(Type: Array[File])*: First read file for ATAC sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing RNA use `[]`
+- **read2_atac** *(Type: Array[File])*: Second read file for ATAC sequencing data. Accepted inputs formata are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing RNA use `[]` 
+- **fastq_barcode** *(Type: Array[File])*: Barcode file for ATAC sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing RNA use `[]`
+
+- **read1_rna** *(Type: Array[File])*: First read file for RNA sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing ATAC use `[]`
+- **read2_rna** *(Type: Array[File])*: Second read file for RNA sequencing data. Accepted input formats are: `local path`, `https` url, Google Storage (`gs`) url, or a `Synapse ID`. If you are only processing ATAC use `[]`
+- **whitelist_atac** *(Type: Array[File])*: Whitelist file for ATAC barcodes. Accepted input formats are: `local path`, `https` url, or a Google Storage (`gs`) url. If you are only processing RNA use `[]`
+- **whitelist_rna** *(Type: Array[File])*: Whitelist file for RNA barcodes. Accepted input formats are: `local path`, `https` url, or a Google Storage (`gs`) url. If you are only processing ATAC use `[]`
+- **seqspecs** *(Type: Array[File])*: Seqspec file describing the sequencing data. Accepted input formats are: `local path`, `https` url, or a Google Storage (`gs`) url.
+- **Chemistry** *(Type: String)*: Chemistry used in the sequencing process.
+- **Prefix** *(Type: String)*: Prefix for output files.
+- **Genome_tsv** *(Type: File)*: TSV formatted file containing the links to chromap and  kb inputs and other reference annotations. For human and mouse you can use the following.
+
+| Genome | URL |
+|--------|-----|
+| hg38   | `gs://broad-buenrostro-pipeline-genome-annotations/IGVF_human_v43/IGVF_human_v43_Homo_sapiens_genome_files_hg38_v43.tsv` |
+| mm39   | `gs://broad-buenrostro-pipeline-genome-annotations/IGVF_mouse_v32/IGVF_mouse_v32_Mus_musculus_genome_files_mm39_v32.tsv` |
+| mm39_cast(beta)   | `gs://broad-buenrostro-pipeline-genome-annotations/CAST_EiJ_v1/Mus_musculus_casteij_genome_files.tsv` |
+
+## Output Files
+
+### RNA
+| Name | Suffix | Description |
+|--------|-----|--------|
+| `rna_mtx_tar `| .mtx.tar.gz | [Raw] Tarball containing four separated count matrices in mtx format: Spliced, Unspliced, Ambiguous, and Total |
+| `rna_mtxs_h5ad` | .rna.align.kb.{genome}.count_matrix.h5ad | [Raw] h5ad containing four separated count matrices: Spliced, Unspliced, Ambiguous, and Total |
+| `rna_aggregated_counts_h5ad` | .cells_x_genes.total.h5ad | [Raw] Aggregated(Ambiguous+Spliced+Unspliced) count matrix in h5ad format |
+| `rna_barcode_metadata` | .qc.rna.{genome}.barcode.metadata.tsv | [Raw] Per barcode alignment statistics file |
+| `rna_log` | .rna.log.{genome}.txt | [Raw] Log file from kb |
+| `rna_kb_output` | .rna.align.kb.{genome}.tar.gz | [Raw] Tarball containing all the logs and bus files generated from kb |
+
+### ATAC
+
+| Name | Suffix | Description |
+|--------|-----|--------|
+| `atac_bam` | .atac.align.k4.{genome_name}.bam | [Raw] Aligned bam file from Chromap |
+| `atac_bam_log` | .atac.align.k4.{genome}.log.txt | [Raw] Log file from aligner |
+| `atac_fragments` | .tsv.gz | [Raw] Fragment file from Chromap |
+| `atac_fragments_index`  | .tsv.gz.tbi | [Raw] Fragment file index from Chromap |
+| `atac_chromap_barcode_metadata` | .chromap.barcode.metadata.tsv | [Raw] Per barcode alignment statistics file from Chromap |
+| `atac_snapatac2_barcode_metadata` | .snapatac2.barcode.metadata.tsv | [Filtered] Per barcode statistics file from SnapATAC2 |
+
+### Joint
+
+| Name | Suffix | Description |
+|--------|-----|--------|
+| `joint_barcode_metadata` | .joint.barcode.metadata.{genome}.csv | [Filtered] Joint per barcode statistics file |
+| `csv_summary` | .csv | [Filtered] CSV summary report |
+| `html_summary` | .html | [Filtered] HTML summary report |
+
+
 ## Contributing
 
 We welcome contributions to improve this pipeline. Please fork the repository and submit a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.

diff --git a/dockerfiles/share_task_correct_fastq.dockerfile b/dockerfiles/share_task_correct_fastq.dockerfile
diff --git a/docs/example_input.json b/docs/example_input.json
@@ -0,0 +1,13 @@
+{
+    "multiome_pipeline.chemistry": "${this.Chemistry}",
+    "multiome_pipeline.fastq_barcode": "${this.ATAC_barcode}",
+    "multiome_pipeline.genome_tsv": "${this.genome_tsv}",
+    "multiome_pipeline.prefix": "${this.WUSTL_coronary_id}",
+    "multiome_pipeline.read1_atac": "${this.ATAC_fastq_R2}",
+    "multiome_pipeline.read1_rna": "${this.RNA_fastq_R1}",
+    "multiome_pipeline.read2_atac": "${this.ATAC_fastq_R2}",
+    "multiome_pipeline.read2_rna": "${this.RNA_fastq_R2}",
+    "multiome_pipeline.seqspecs": "${this.seqspec}",
+    "multiome_pipeline.whitelist_atac": "${this.Barcode_inclusion_list_ATAC}",
+    "multiome_pipeline.whitelist_rna": "${this.Barcode_inclusion_list_RNA}",
+}
diff --git a/docs/install-pipeline-locally.org b/docs/install-pipeline-locally.org
@@ -1,41 +1,25 @@
-* Introduction
+* IGVF Single-Cell Pipeline Setup and Execution Documentation
 
-The default configuration for the IGVF single-cell pipeline is a docker based Terra
-cloud environment. Docker tends not to be available on HPC
-environments so I wanted to use singularity instead. Though as of yet
-I haven't figured out how to adjust the cromwell configuration for the
-queing systems I have access to and have just been running individual
-experiments on a single reasonably sized host.
+The default configuration for the IGVF single-cell pipeline operates in a Docker-based Terra cloud environment. However, Docker is not always available on high-performance computing (HPC) environments. To accommodate this, Singularity can be used as an alternative. This documentation outlines the setup for running the pipeline on a single host using Singularity. As of now, the Cromwell configuration for specific queuing systems has not been finalized, and individual experiments are being executed on a standalone machine.
 
-* Define some variables
+** Define some variables
 
 #+name: default-table
-| root_dir             | /home/diane/proj                          |
+| root_dir             | /home/user/proj                          |
 | single_cell_pipeline_git | https://github.com/IGVF/single-cell-pipeline |
 | diane_conda_url      | https://woldlab.caltech.edu/~diane/conda/ |
-| conda_command        | /home/diane/bin/micromamba                |
+| conda_command        | /home/user/bin/micromamba                |
 | environment          | the IGVF single-cell pipeline                          |
 
-These variables are accessed has ${defaults["variable_name"]} in the shell
-scripts below.
+These variables can be accessed using ${defaults["variable_name"]} in the shell scripts provided.
 
-* Set up the conda environment
+** Set up the conda environment
 
-It is very important to make sure you get the recent version of
-cromwell, conda has some quite old versions available, and atomic
-workflow needs something relatively recent. (The version 40 of
-cromwell I accidentally installed on my first try definitely did not
-work).
+It is crucial to use a recent version of Cromwell, as older versions available on Conda may not be compatible with the pipeline's requirements. Version 40, for example, is not supported.
 
-This block should install most of the dependencies to run
-single-cell-pipeline into a [[https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html][conda environment]]. Conda-forge recommends
-[[https://mamba.readthedocs.io/en/latest/user_guide/mamba.html][mamba]] and I found [[https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html][micromamba]] to be easiest to install, though it is a
-bit limited compared to conda or mamba.
-
-You will need to replace the ${defaults["conda_command"]} and
-${defaults["environment"]} with appropriate settings for your
-environment.
+The following block installs most of the necessary dependencies for the IGVF single-cell pipeline into a Conda environment. The usage of Micromamba is recommended for faster installation, though it is more limited than Conda or Mamba.
 
+Make sure to replace ${defaults["conda_command"]} and ${defaults["environment"]} with your own settings.
 
 #+name: create-single-cell-pipeline-environment
 #+begin_src bash :var defaults=default-table
@@ -69,8 +53,7 @@ make sure the main components were installed.
 | gsutil      | version: |   5.3 |
 | Synapse     | Client   | 4.4.0 |
 
-Next we check-out the IGVF single-cell pipeline repository. As of the time I am
-writing this, we are using release v1
+Checkout the IGVF Single-Cell Pipeline Repository. As of Sep 2024, this is for release v1. 
 
 #+name: checkout-atomic-workflow
 #+begin_src bash :var defaults=default-table :async yes :results none
@@ -79,30 +62,12 @@ writing this, we are using release v1
   popd
 #+end_src
 
-* Configuring cromwell
+** Configuring cromwell
 
-This configuration file is the magic that tells cromwell to use
-singularity instead of docker. This configuration file is for running
-locally on a single machine. I need to pass
-~-Dconfig.file=cromwell.conf~ when launching the pipleine to get
-cromwell to use the configuration file. There needs to be more
+To use Singularity instead of Docker with Cromwell, the following configuration is required. The example below is for running the pipeline on a single machine. When launching the pipeline, pass the ~-Dconfig.file=cromwell.conf~ argument to Cromwell to use this configuration. Additional adjustments are needed for integration with local queuing systems.
 adjustments to integrate with the local queing system.
 
-I found the initial example in the singularity section of the cromwell
-[[https://cromwell.readthedocs.io/en/latest/tutorials/Containers/#singularity][Containers]] documentation.
-
-After some testing I did discovered that synapse needs access to a
-configuration file and possible a cache directory, to work correctly.
-
-Without the configuration file the pipeline aborts when ~synapse get~
-failing because of the authentication prompt. This happens because the
-default behavior of singularity is to limit the containers to the
-directory tree below where the container was started.
-
-To resolve this you need to replace the ${HOME} paths in the
-configuration below with your actual paths. Using the ~ to represent
-home directories won't work, at least for the destination path as
-singularity will then complain about needing an absolute path.
+This configuration is based on the [[https://cromwell.readthedocs.io/en/latest/tutorials/Containers/#singularity][Containers]] documentation. Note that Synapse requires access to a configuration file and a cache directory to function properly. Ensure that ${HOME} paths in the configuration are replaced with absolute paths to avoid issues with Singularity.
 
 #+name: cromwell-local-singularity
 #+begin_src wdl
@@ -129,25 +94,9 @@ singularity will then complain about needing an absolute path.
   }
 #+end_src
 
-* Configuring a local IGVF single-cell pipeline run
-
-Terra takes values from the dataset tables and uses that to generate a
-metadata json file for cromwell. When running outside of terra, one
-will need to manually generate the json file.
-
-I found it could be somewhat confusing to figure out what the proper
-name for a variable should be. Since cromwell supports subworkflows,
-it seemed like it uses a dotted path to refer to setting a value in a
-particular workflow.
-
-If you have an example .json file for the current workflow and some of
-the terra tables, this you should replace the contents of
-${This.variable} with what's in the contents of the variable column
-for the row you're working on.
-
-This example is for a customized 10x multiome experiment. This was
-tested using a version of single-cell pipeline from around August 16th.
+** Running the IGVF Single-Cell Pipeline Locally
 
+When running outside of Terra, the metadata .json file for Cromwell must be generated manually. Below is an example .json configuration for a customized 10x multiome experiment.
 
 #+name: gm12878.json
 #+begin_src json
@@ -206,94 +155,19 @@ tested using a version of single-cell pipeline from around August 16th.
 }
 #+end_src  
 
-This example is for a parse splitseq RNA experiment, and comes from
-before I figured out how to use synapse, so uses local file names for
-the source data.
-
-I think I'd also made a slight change to the pipeline for it to better
-support local files
-
-#+name: tiny-13a.json
-#+begin_src json
-{
-    "single_cell_pipeline.trim_fastqs": true,
-    "single_cell_pipeline.chemistry": "parse",
-    "single_cell_pipeline.prefix": "106",
-    "single_cell_pipeline.subpool": "13A",
-    "single_cell_pipeline.pipeline_modality": "full",
-
-    "single_cell_pipeline.genome_fasta":  "/woldlab/loxcyc/home/diane/proj/genome/GRCm39-M32-male/GRCm38-IGVFI282QLXO.fasta.gz",
-
-    "single_cell_pipeline.seqspecs": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/b01_13atiny_seqspec.yaml"],
-
-    "single_cell_pipeline.whitelist_atac": ["737K-arc-v1-ATAC.txt.gz"],
-    "single_cell_pipeline.read1_atac": [],
-    "single_cell_pipeline.read2_atac": [],
-    "single_cell_pipeline.fastq_barcode": [],
-    "single_cell_pipeline.count_only": false,
-    "single_cell_pipeline.atac_barcode_offset": 8,
-
-    "single_cell_pipeline.whitelist_rna": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/parse-splitseq-v2/CB1.txt", "/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/parse-splitseq-v2/CB23.txt"],
-    "single_cell_pipeline.read1_rna": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/igvf_b01/next1/B01_13Atiny_R1.fastq.gz"],
-    "single_cell_pipeline.read2_rna": ["/woldlab/loxcyc/home/diane/proj/igvf-bridge-samples/igvf_b01/next1/B01_13Atiny_R2.fastq.gz"],
-    "single_cell_pipeline.gtf": "/woldlab/loxcyc/home/diane/proj/genome/GRCm39-M32-male/M32-IGVFFI9744VSJF.gtf.gz",
-    "single_cell_pipeline.rna.kb_strand": "forward",
-    "single_cell_pipeline.rna.replacement_list": "r1_RT_replace.txt",
-
-    "single_cell_pipeline.genome_name": "GRCm39-M32",
-    "single_cell_pipeline.genome_tsv": "Mus_musculus_genome_files_mm39_v32.tsv",
-    "single_cell_pipeline.whitelists_tsv": "permit_lists_urls.tsv",
-
-    "single_cell_pipeline.check_read1_rna.disk_factor": 1,
-    "single_cell_pipeline.check_read2_rna.disk_factor": 1,
-    "single_cell_pipeline.check_read1_atac.disk_factor": 1,
-    "single_cell_pipeline.check_read2_atac.disk_factor": 1,
-    "single_cell_pipeline.check_fastq_barcode.disk_factor": 1
-}
-#+end_src
-
-* Running cromwell
-
-For me, I needed to manually add the ${atomic_workflows}/src/bash
-directory added to the path for a script stored in there. The
--Dconfig.file points to the contents of the cromwell-local-singularity
-file described above.
-
-My systems default umask is 022 and with cromwell I end up with
-directories with the permissions 757 and for some reason that
-generates error messages when trying to modify files or delete the
-cromwell-execution directory.
-
-I needed to reset my umask to 002 before running cromwell with
-
-#+begin_src bash
-umask 002
-#+end_src
-
-You will need to pass the path to the wdl script, the cromwell
-configuration file and json configuration file as shown below.
+** Running cromwell
 
+To execute the pipeline, ensure the script and configuration files are properly referenced, including the Cromwell configuration file (cromwell.conf), the WDL script, and the generated JSON file.
 
 #+begin_src bash
-  PATH=/woldlab/loxcyc/home/diane/proj/single-cell-pipeline/src/bash/:$PATH
+  PATH=/woldlab/loxcyc/home/user/proj/single-cell-pipeline/src/bash/:$PATH
     \ cromwell -Dconfig.file=cromwell.conf run \
     ../single-cell-pipeline/single_cell_pipeline.wdl \ -i tiny-13a.json
 #+end_src
 
-* Finding results with cromwell.
+** Finding results with cromwell.
 
-This is sufficiently annoying that the ENCODE DCC wrote
-[[https://github.com/ENCODE-DCC/caper]] to try and make it easier to stage
-results in an easier to understand directory structure.
-
-Unfortunately I had trouble getting caper to work, and so started with
-the simpler goal of getting cromwell to work.
-
-Cromwell makes it's own directory hierarchy under cromwell-executions/
-here's an example from a partial run of the atomic_workflows pipeline
-
-As an example subset, here's a bit of the directory tree from one
-failed run.
+Cromwell stores results in a nested directory structure under cromwell-executions/. To locate error or output logs, navigate to the appropriate subdirectory based on the workflow step. The example below shows a partial directory tree from a failed run:
 
 - cromwell-executions
   - single_cell_pipeline
@@ -325,10 +199,9 @@ failed run.
       - call-check_read2_rna
       - call-check_seqspec
 
-I was usually looking for the stderr/stdout files in the various
-workflow steps to try and find what had gone wrong with a run. It will
-probably take some experimentation to get the configuration settings
-correct for your environment.
+To find specific files, use the following command:
+~find cromwell-executions -name ${filename}~ 
+
+This structure can take some trial and error to configure correctly for your environment. For easier results management, consider using tools like Caper. However, it is also possible to manage this manually as demonstrated above.
 
-I was using ~find cromwell-executions -name ${filename}~ to find any
-result files.
+We extend our gratitude to [[https://github.com/detrout][@detrout]] for their contribution in providing the initial draft of this document.
diff --git a/single_cell_pipeline.wdl b/single_cell_pipeline.wdl
@@ -285,8 +285,8 @@ workflow multiome_pipeline {
         # ATAC ouputs
         File? atac_bam = atac.atac_chromap_bam
         File? atac_bam_log = atac.atac_chromap_bam_alignement_stats
-        File? atac_filter_fragments = atac.atac_fragments
-        File? atac_filter_fragments_index = atac.atac_fragments_index
+        File? atac_fragments = atac.atac_fragments
+        File? atac_fragments_index = atac.atac_fragments_index
         File? atac_chromap_barcode_metadata = atac.atac_qc_chromap_barcode_metadata
         File? atac_snapatac2_barcode_metadata = atac.atac_qc_snapatac2_barcode_metadata