diff --git a/README.md b/README.md index 817996e..b95584e 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,9 @@ [![GitHub Actions CI Status](https://github.com/nf-core/deepmutscan/actions/workflows/ci.yml/badge.svg)](https://github.com/nf-core/deepmutscan/actions/workflows/ci.yml) -[![GitHub Actions Linting Status](https://github.com/nf-core/deepmutscan/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/deepmutscan/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/deepmutscan/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX) +[![GitHub Actions Linting Status](https://github.com/nf-core/deepmutscan/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/deepmutscan/actions/workflows/linting.yml) +[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/deepmutscan/results) +[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX) [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com) [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A524.04.2-23aa62.svg)](https://www.nextflow.io/) @@ -14,110 +16,85 @@ [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/) [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/) [![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://cloud.seqera.io/launch?pipeline=https://github.com/nf-core/deepmutscan) -[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23deepmutscan-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/deepmutscan)[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core)[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core)[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core) +[![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23deepmutscan-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/deepmutscan) +[![Follow on Twitter](http://img.shields.io/badge/twitter-%40nf__core-1DA1F2?labelColor=000000&logo=twitter)](https://twitter.com/nf_core) +[![Follow on Mastodon](https://img.shields.io/badge/mastodon-nf__core-6364ff?labelColor=FFFFFF&logo=mastodon)](https://mstdn.science/@nf_core) +[![Watch on YouTube](http://img.shields.io/badge/youtube-nf--core-FF0000?labelColor=000000&logo=youtube)](https://www.youtube.com/c/nf-core) -# 1. Overview -**nf-core/deepmutscan** is a reproducible, scalable, and community-curated pipeline for analyzing deep mutational scanning (DMS) data using shotgun DNA sequencing. DMS enables researchers to measure the fitness effects of thousands of gene variants simultaneously, helping to classify disease causing mutants in human and animal populations, to learn fundamental rules of virus evolution, protein architecture, splicing or small-molecule interactions. +## Introduction -While DNA synthesis and sequencing technologies have advanced substantially, long open reading frame (ORF) targets still present major challenges for DMS studies. Shotgun DNA sequencing can be used to greatly speed up the inference of long ORF mutant fitness landscapes, theoretically at no expense in accuracy. We have designed the **nf-core/deepmutscan** pipeline to unlock the power of shotgun sequencing based DMS studies on long ORFs, to simplify and standardise the complex bioinformatics steps involved in data processing of such experiments – from read alignment to QC reporting and fitness landscape inferences. +`nf-core/deepmutscan` is a workflow designed for the analysis of deep mutational scanning (DMS) data. DMS enables researchers to experimentally measure the fitness effects of thousands of genes or gene variants simultaneously, helping to classify disease causing mutants in human and animal populations, to learn the fundamental rules of virus evolution, protein architecture, splicing, small-molecule interactions and many other phenotypes. -> 📄 Reference: Wehnert et al., _bioRxiv_ preprint (coming soon) +While DNA synthesis and sequencing technologies have advanced substantially, long open reading frame (ORF) targets still present major challenges for DMS studies. Shotgun DNA sequencing can be used to greatly speed up the inference of long ORF mutant fitness landscapes, theoretically at no expense in accuracy. We have designed the `nf-core/deepmutscan` pipeline to unlock the power of shotgun sequencing based DMS studies on long ORFs, to simplify and standardise the complex bioinformatics steps involved in data processing of such experiments – from read alignment to QC reporting and fitness landscape inferences. ---- +

+ +

-# 2. Features of nf-core/deepmutscan -- End-to-end analyses of DMS shotgun sequencing data -- Modular, three-stage workflow: alignment → QC → error-aware fitness estimation -- Integrates with popular statistical tools like [DiMSum](https://github.com/lehner-lab/DiMSum), [Enrich2](https://github.com/FowlerLab/Enrich2), [rosace](https://github.com/pimentellab/rosace/) and [mutscan](https://github.com/fmicompbio/mutscan) -- Supports multiple mutagenesis strategies, e.g. nicking by NNK and NNS codons -- Containerized via Docker, Singularity and Apptainer -- Scalable across HPC and Cloud systems -- Monitors CPU, memory, and COâ‚‚ usage - -For details of the pipeline and potential future expansions, please consider reading our [detailed description](docs/pipeline_steps.md). - ---- - -# 3. Installation -**nf-core/deepmutscan** uses [Nextflow](https://nf-co.re/docs/usage/getting_started/installation), which must be installed on your system: - -```bash -java -version # Check that Java v11+ is installed -curl -s https://get.nextflow.io | bash # Download Nextflow -chmod +x nextflow # Make executable -mv nextflow ~/bin/ # Add to user's $PATH -``` - -The pipeline itself requires no installation – Nextflow will fetch it directly from GitHub: +The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community! -```bash -nextflow run nf-core/deepmutscan -profile docker -``` -For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/deepmutscan/usage) and the [parameter documentation](https://nf-co.re/deepmutscan/parameters). +On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/deepmutscan/results). ---- +## Major features -# 4. Usage -Prepare: -- A **sample sheet** CSV to specify input/output labels, replicates, etc. (see [example](assets/samplesheet.csv)) -- A **reference FASTA** file for the gene or region of interest +- End-to-end analyses of various DMS data +- Modular, three-stage workflow: alignment → QC → error-aware fitness estimation +- Integration with popular statistical fitness estimation tools like [DiMSum](https://github.com/lehner-lab/DiMSum), [Enrich2](https://github.com/FowlerLab/Enrich2), [rosace](https://github.com/pimentellab/rosace/) and [mutscan](https://github.com/fmicompbio/mutscan) +- Support of multiple mutagenesis strategies, e.g. by nicking with degenerate NNK and NNS codons +- Containerisation via Docker, Singularity and Apptainer +- Scalability across HPC and Cloud systems +- Monitoring of CPU, memory, and COâ‚‚ usage -To execute **nf-core/deepmutscan**, run the basic command: +For more details on the pipeline and on potential future expansions, please consider reading our [usage description](https://nf-co.re/deepmutscan/usage). -```bash -nextflow run nf-core/deepmutscan \ - -profile singularity,local \ - --input ./input.csv \ - --reading_frame 1-300 \ - --fasta ./ref.fa \ - --mutagenesis max_diff_to_wt \ - --run_seqdepth false \ - --fitness true \ - --outdir ./results -``` +## Step-by-step pipeline summary -### Required parameters +The pipeline processes deep mutational scanning (DMS) sequencing data in several stages: -| Parameter | Description | -|--------------------|-----------------------------------------------------| -| `--input` | Path to sample sheet CSV | -| `--outdir` | Path to output directory | -| `--fasta` | Reference FASTA file | -| `--reading_frame` | Start and end nucleotide (e.g. `1-300`) | +1. Alignment of reads to the reference open reading frame (ORF) (`BWA-mem`) +2. Filtering of wildtype and erroneous reads (`samtools view`) +3. Read merging for base error reduction (`vsearch merge`, `BWA-mem`) +4. Mutation counting (`GATK AnalyzeSaturationMutagenesis`) +5. DMS library quality control +6. Data summarisation across samples +7. Single nucleotide variant error correction _(in development)_ +8. Fitness estimation _(in development)_ -### Optional parameters *(in development)* +## Usage -| Parameter | Default | Description | -|------------------------|-------------|-------------------------------------------------| -| `--run_seqdepth` | `false` | Estimate sequencing saturation by rarefaction | -| `--fitness` | `false` | Default fitness inference module | -| `--dimsum` | `false` | Optional fitness inference module *(AMD/x86_64 systems only)* | -| `--mutagenesis` | `max_diff_to_wt` | Deep mutational scanning strategy used *(in development)* | -| `--error-estimation` | `wt_sequencing` | Error model used to correct 1nt counts *(in development)* | -| `--read-align` | `bwa-mem` | Read aligner *(in development)* | +> [!NOTE] +> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data. -More options and advanced configuration: [see vignette](link). For further information or help, don't hesitate to get in touch on the [Slack `#deepmutscan` channel](https://nfcore.slack.com/channels/deepmutscan) (you can join with [this invite](https://nf-co.re/join/slack)). +First, prepare a samplesheet with your input/output data in which each row represents a pair of fastq files (paired end). This should look as follows: ---- - -# 5. Input Data +```csv title="samplesheet.csv" +sample,type,replicate,file1,file2 +ORF1,input,1,/reads/forward1.fastq.gz,/reads/reverse1.fastq.gz +ORF1,input,2,/reads/forward2.fastq.gz,/reads/reverse2.fastq.gz +ORF1,output,1,/reads/forward3.fastq.gz,/reads/reverse3.fastq.gz +ORF1,output,2,/reads/forward4.fastq.gz,/reads/reverse4.fastq.gz +``` -The primary pipeline input is a sample sheet `.csv` file listing: +Secondly, specify the gene or gene region of interest using a reference FASTA file via `--fasta`. Provide the exact codon coordinates using `--reading_frame`. -- Paths to paired-end `.fastq.gz` files from shotgun sequencing -- Their classification as either input or output samples -- Replicate IDs -- Associated experimental metadata +Now, you can run the pipeline using: -See [sample CSV](assets/samplesheet.csv) for formatting. +```bash title="example pipeline run" +nextflow run nf-core/deepmutscan \ + -profile \ + --input ./samplesheet.csv \ + --fasta ./ref.fa \ + --reading_frame 1-300 \ + --outdir ./results +``` ---- +There are several optional [parameters](https://nf-co.re/deepmutscan/parameters), some of which are currently in development. For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/deepmutscan/usage). -# 6. Outputs +## Pipeline output After execution, the pipeline creates the following directory structure: -``` +```folder title="output folder structure" results/ ├── fastqc/ # Individual HTML reports for specified fastq files, raw sequencing QC ├── fitness/ # Merged variant count tables, fitness and error estimates, replicate correlations and heatmaps @@ -129,45 +106,40 @@ results/ └── report.html # Nextflow summary report incl. detailed CPU and memory usage per for all tasks ``` ---- +For a full overview of the output file types, please refer to the specific [documentation](https://nf-co.re/deepmutscan/output). -# 7. Citation - -If you use this pipeline in your research, please cite: -> 📄 Wehnert et al., _bioRxiv_ preprint (coming soon) +## Contributing -Please also cite the nf-core framework: -> 📄 Ewels et al., _Nature Biotechnology_, 2020 -> [https://doi.org/10.1038/s41587-020-0439-x](https://doi.org/10.1038/s41587-020-0439-x) - ---- +We welcome contributions from the community! -# 8. License +For technical challenges and feedback on the pipeline, please use our [Github repository](https://github.com/nf-core/deepmutscan). Please open an [issue](https://github.com/nf-core/deepmutscan/issues/new) or [pull request](https://github.com/nf-core/deepmutscan/compare) to: -[MIT License](link) +- Report bugs or solve data incompatibilities when running `nf-core/deepmutscan` +- Suggest the implementation of new modules for custom DMS workflows +- Help improve this documentation -© 2025 Benjamin Wehnert, Taylor Mighell, Fei Sang, Ben Lehner, Maximilian Stammnitz +If you are interested in getting involved as a developer, please consider joining our interactive [`#deepmutscan` Slack channel](https://nfcore.slack.com/channels/deepmutscan) (via [this invite](https://nf-co.re/join/slack)). ---- +## Credits -# 9. Contributing +`nf-core/deepmutscan` was originally written by [Benjamin Wehnert](https://github.com/BenjaminWehnert1008) and [Max Stammnitz](https://github.com/MaximilianStammnitz) at the [Centre for Genomic Regulation, Barcelona](https://www.crg.eu/), with the generous support of an EMBO Long-term Postdoctoral Fellowship and a Marie SkÅ‚odowska-Curie grant by the European Union. -We welcome contributions from the community! +If you use `nf-core/deepmutscan` in your analyses, please cite: -Please open an [issue](../../issues/new) or [pull request](../../compare) via this GitHub page, to: -- Suggest or help implementing new modules for custom workflows -- Report bugs and other challenges in running **nf-core/deepmutscan** -- Help improve this documentation +> 📄 Wehnert et al., _bioRxiv_ preprint (coming soon) -You can also reach out to us via the **nf-core Slack**, by use of the `#dms` channel ([join here](https://join.slack.com/share/enQtOTMyMDc3MTA0Mzg0Mi04YmRiNDEwZTBlOTRiN2M2ZGU5ZGVmOWQ3YzA0YjA4NzhiNjFhNTVlNDA4ZTZjOTE2MjE5MmIzYWZjZTljMTE3)). +Please also cite the `nf-core` framework: ---- +> 📄 Ewels et al., _Nature Biotechnology_, 2020 +> [https://doi.org/10.1038/s41587-020-0439-x](https://doi.org/10.1038/s41587-020-0439-x) -# 10. Contact +## Scientific contact -For detailled scientific or technical questions, feedback and experimental discussions, feel free to contact us directly: +For scientific discussions around the use of this pipeline (e.g. on experimental design or sequencing data requirements), please feel free to get in touch with us directly: -- Benjamin Wehnert — wehnertbenjamin@gmail.com +- Benjamin Wehnert — wehnertbenjamin@gmail.com - Maximilian Stammnitz — maximilian.stammnitz@crg.eu ---- +## CHANGELOG + +- [CHANGELOG](CHANGELOG.md) diff --git a/docs/output.md b/docs/output.md index 2db7739..bfe748d 100644 --- a/docs/output.md +++ b/docs/output.md @@ -2,21 +2,25 @@ ## Introduction -This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. -The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. +The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory: - - -## Pipeline overview - -The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: - -- [FastQC](#fastqc) - Raw read QC- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline -- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution +```tree +results/ +├── fastqc/ # Individual HTML reports for specified fastq files, raw sequencing QC +├── fitness/ # Merged variant count tables, fitness and error estimates, replicate correlations and heatmaps +├── intermediate_files/ # Raw alignments, raw and pre-filtered variant count tables, QC reports +├── library_QC/ # Sample-specific PDF visualizations: position-wise sequencing coverage, count heatmaps, etc. +├── multiqc/ # Shared HTML reports for all fastq files, raw sequencing QC +├── pipelineinfo/ # Nextflow helper files for timeline and summary report generation +├── timeline.html # Nextflow timeline for all tasks +└── report.html # Nextflow summary report incl. detailed CPU and memory usage per for all tasks +``` ### FastQC +[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). +
Output files @@ -26,7 +30,9 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).### MultiQC +### MultiQC + +[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see .### Pipeline information
Output files @@ -38,19 +44,66 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. +### Intermediate files + +This directory is created during the first series of steps of the pipeline, featuring raw read alignments, filtering and variant counting. + +
+Output files + +- `intermediate_files/` + - `aa_seq.txt`: + - `bam_files/bwa`: + - `bam_files/filtered`: + - `bam_files/premerged`: + - `gatk`: + - `possible_mutations.csv`: + - `processed_gatk_files`: + +
+ +### Library QC -Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see .### Pipeline information +This directory is created during the second series of steps of the pipeline, featuring various QC visualisations for each sample.
Output files +- `library_QC/` + - `counts_heatmap.pdf`: + - `counts_per_cov_heatmap.pdf`: + - `logdiff_plot.pdf`: + - `logdiff_varying_bases.pdf`: + - `rolling_counts_per_cov.pdf`: + - `rolling_counts.pdf`: + - `rolling_coverage.pdf`: + - `SeqDepth.pdf` (optional): + +
+ +### Fitness + +This directory is created during the final series of steps of the pipeline, featuring fitness and fitness error estimates (when DMS input/output sample groups are specified). + +
+Output files + +- `fitness/` + - `counts_merged.tsv`: summarised gene variant counts across all input and output samples. + - `default_results/fitness_estimation_count_correlation.pdf`: + - `default_results/fitness_estimation_fitness_correlation.pdf`: + - `default_results/fitness_heatmap.pdf`: + - `default_results/fitness_estimation.tsv`: + - `DiMSum_results/dimsum_results` (optional): + +
+ +### Pipeline Info + - `pipeline_info/` - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. - Parameters used by the pipeline run: `params.json`. - - [Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. diff --git a/docs/pipeline_steps.md b/docs/pipeline_steps.md index 670a7dc..9573527 100644 --- a/docs/pipeline_steps.md +++ b/docs/pipeline_steps.md @@ -2,33 +2,34 @@ This page provides in-depth descriptions of the data processing modules implemented in **nf-core/deepmutscan**. It is mainly intended for advanced users and developers who want to understand the rationale behind design choices, explore implementation details, and consider potential future extensions. ------------------------------------------------------------------------- +--- ## Overview The pipeline processes deep mutational scanning (DMS) sequencing data in several stages: + 1. Alignment of reads to the reference open reading frame (ORF) 2. Filtering of wildtype and erroneous reads 3. Read merging for base error reduction 4. Mutation counting 5. DMS library quality control 6. Data summarisation across samples -7. Single nucleotide variant error correction *(in development)* -8. Fitness estimation *(in development)* +7. Single nucleotide variant error correction _(in development)_ +8. Fitness estimation _(in development)_ ![pipeline](/docs/pipeline.png) Each step is explained below. Links are provided to the primary tools and libraries used, where applicable. ------------------------------------------------------------------------- +--- ## 1. Alignment All paired-end raw reads are first aligned to the provided reference ORF using [**bwa-mem**](http://bio-bwa.sourceforge.net/). This is a highly efficient mapping algorithm for reads ≥100 bp, with its multi-threading support automatically handled by nf-core. -In future versions of nf-core/deepmutscan, we consider the use of [**bwa-mem2**](https://github.com/bwa-mem2/bwa-mem2), which provides similar alignment rates with a moderate speed increase ([Vasimuddin et al., *IPDPS* 2019](https://ieeexplore.ieee.org/document/8820962)). With the increasing diversity of sequencing platforms for DMS, new throughput, read length, and error profiles may require further alignment options to be implemented. +In future versions of nf-core/deepmutscan, we consider the use of [**bwa-mem2**](https://github.com/bwa-mem2/bwa-mem2), which provides similar alignment rates with a moderate speed increase ([Vasimuddin et al., _IPDPS_ 2019](https://ieeexplore.ieee.org/document/8820962)). With the increasing diversity of sequencing platforms for DMS, new throughput, read length, and error profiles may require further alignment options to be implemented. ------------------------------------------------------------------------- +--- ## 2. Filtering @@ -36,7 +37,7 @@ For long ORF site-saturation mutagenesis libraries, most aligned shotgun sequenc To this end, we use [**samtools view**](https://www.htslib.org/doc/samtools.html). ------------------------------------------------------------------------- +--- ## 3. Read Merging @@ -47,7 +48,7 @@ Even the highest-accuracy next-generation sequencing platforms do not have perfe Future versions may offer additional options depending on sequencing type and error profiles. ------------------------------------------------------------------------- +--- ## 4. Variant Counting @@ -55,7 +56,7 @@ Aligned, non-wildtype consensus reads are screened for exact, base-level mismatc We are currently working on the nf-core/deepmutscan implementation of a much lighter, alternative Python implementation for mutation counting. In this script, users will be allowed to specify a minimum base quality cutoff for mutations to be included in the final count table (default: Q30) – an option not available in GATK. ------------------------------------------------------------------------- +--- ## 5. DMS Library Quality Control @@ -63,46 +64,48 @@ By integrating the reference ORF coordinates and the chosen DMS library type (de Custom visualisations allow for inspection of (1) mutation efficiency along the ORF, (2) position-specific recovery of amino acid diversity, and (3) overall sequencing coverage evenness and saturation. ------------------------------------------------------------------------- +--- ## 6. Data Summarisation for Fitness Estimation Steps 1-5 are iteratively run across all samples defined in the `.csv` spreadsheet. Once read alignment, merging, mutation counting, and library QC have been completed for the full list of samples, users can opt to proceed with fitness estimation. To this end, the pipeline generates all the necessary input files by merging mutation counts across samples. ------------------------------------------------------------------------- +--- -## 7. Single Nucleotide Variant Error Correction *(in development)* +## 7. Single Nucleotide Variant Error Correction _(in development)_ This module will implement strategies to distinguish true single nucleotide variants from sequencing artefacts. There are two options to perform this: + - Empirical error rate modelling based on wildtype sequencing -- Empirical error rate modelling based on false double mutants *(in development)* +- Empirical error rate modelling based on false double mutants _(in development)_ ------------------------------------------------------------------------- +--- -## 8. Fitness Estimation *(in development)* +## 8. Fitness Estimation _(in development)_ The final step of the pipeline will perform fitness estimation based on mutation counts. By default, we calculate fitness scores as the logarithm of variants' output to input ratio, normalised to that of the provided wildtype sequence. Future expansions may include: + - Integration of other popular fitness inference tools, including [DiMSum](https://github.com/lehner-lab/DiMSum), [Enrich2](https://github.com/FowlerLab/Enrich2), [rosace](https://github.com/pimentellab/rosace/) and [mutscan](https://github.com/fmicompbio/mutscan) - Standardised output formats for downstream analyses and comparison > [!IMPORTANT] > We note that exact wildtype sequence reads are filtered out in stage 2. Including synonymous wildtype codons in the original mutagenesis design is therefore essential when it comes to calibrating the fitness calculations. ------------------------------------------------------------------------- +--- ## Notes for Developers -- Custom scripts used in filtering and mutation counting are available in the `bin/` directory of the repository. -- Modules are implemented in Nextflow DSL2 and follow the nf-core community guidelines. -- Contributions, optimisations, and additional analysis modules are welcome - please open a pull request or GitHub issue to discuss ideas. +- Custom scripts used in filtering and mutation counting are available in the `bin/` directory of the repository. +- Modules are implemented in Nextflow DSL2 and follow the nf-core community guidelines. +- Contributions, optimisations, and additional analysis modules are welcome - please open a pull request or GitHub issue to discuss ideas. -*This document is meant as a living reference. As the pipeline evolves, the descriptions of steps 7 and 8 will be expanded with concrete implementation details.* +_This document is meant as a living reference. As the pipeline evolves, the descriptions of steps 7 and 8 will be expanded with concrete implementation details._ ------------------------------------------------------------------------- +--- ## Contact For detailled scientific or technical questions, feedback and experimental discussions, feel free to contact us directly: -- Benjamin Wehnert — wehnertbenjamin@gmail.com +- Benjamin Wehnert — wehnertbenjamin@gmail.com - Maximilian Stammnitz — maximilian.stammnitz@crg.eu diff --git a/docs/usage.md b/docs/usage.md index 6c1fddc..55eb737 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,69 +6,33 @@ ## Introduction - +`nf-core/deepmutscan` is a workflow designed for the analysis of deep mutational scanning (DMS) data. DMS enables researchers to experimentally measure the fitness effects of thousands of genes or gene variants simultaneously, helping to classify disease causing mutants in human and animal populations, to learn the fundamental rules of virus evolution, protein architecture, splicing, small-molecule interactions and many other phenotypes. -## Samplesheet input - -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. - -```bash ---input '[path to samplesheet file]' -``` - -### Multiple runs of the same sample - -The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: - -```csv title="samplesheet.csv" -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz -``` - -### Full samplesheet - -The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. - -A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. - -```csv title="samplesheet.csv" -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz -CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz -TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, -TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, -``` - -| Column | Description | -| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | -| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | -| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | - -An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. +This page provides in-depth descriptions of the data processing modules implemented in `nf-core/deepmutscan`. It is intended for both advanced and developers who want to understand the rationale behind certain design choices, explore implementation details, and consider potential future extensions. ## Running the pipeline -The typical command for running the pipeline is as follows: +The typical command for running the pipeline (on an example protein-coding gene with 100 amino acids) is as follows: ```bash -nextflow run nf-core/deepmutscan --input ./samplesheet.csv --outdir ./results --genome GRCh37 -profile docker +nextflow run nf-core/deepmutscan \ + -profile \ + --input ./samplesheet.csv \ + --fasta ./ref.fa \ + --reading_frame 1-300 \ + --outdir ./results ``` -This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. +The `-profile ` specification is mandatory and should reflect either your own institutional profile or any pipeline profile specified in the [profile section](##-profile). + +This will launch the pipeline by performing sequencing read alignments, various raw data QC analyses, optional fitness and fitness error estimations. Note that the pipeline will create the following files in your working directory: ```bash -work # Directory containing the nextflow working files - # Finished results in specified location (defined with --outdir) -.nextflow_log # Log file from Nextflow -# Other nextflow hidden files, eg. history of pipeline runs and old logs. +work # Directory containing the nextflow working files +results # Finished results in specified location (defined with --outdir) +.nextflow_log # Log file from Nextflow ``` If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file. @@ -81,7 +45,7 @@ Pipeline settings can be provided in a `yaml` or `json` file via `-params-file < The above pipeline run specified with a params file in yaml format: ```bash -nextflow run nf-core/deepmutscan -profile docker -params-file params.yaml +nextflow run nf-core/deepmutscan -params-file params.yaml ``` with: @@ -89,13 +53,120 @@ with: ```yaml title="params.yaml" input: './samplesheet.csv' outdir: './results/' -genome: 'GRCh37' +gene reference: 'ref.fa' <...> ``` You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). -### Updating the pipeline +## Inputs + +Users need to first prepare a samplesheet with your input/output data in which each row represents a pair of fastq files (paired end). This should look as follows: + +```csv title="samplesheet.csv" +sample,type,replicate,file1,file2 +ORF1,input,1,/reads/forward1.fastq.gz,/reads/reverse1.fastq.gz +ORF1,input,2,/reads/forward2.fastq.gz,/reads/reverse2.fastq.gz +ORF1,output,1,/reads/forward3.fastq.gz,/reads/reverse3.fastq.gz +ORF1,output,2,/reads/forward4.fastq.gz,/reads/reverse4.fastq.gz +``` + +Secondly, users need to specify the gene or gene region of interest using a reference FASTA file via `--fasta`. Provide the exact codon coordinates using `--reading_frame`. + +## Optional parameters + +Several optional parameters are available for `nf-core/deepmutscan`, some of which are currently _(in development)_. + +| Parameter | Default | Description | +| -------------------- | --------------- | ------------------------------------------------------------- | +| `--run_seqdepth` | `false` | Estimate sequencing saturation by rarefaction | +| `--fitness` | `false` | Default fitness inference module | +| `--dimsum` | `false` | Optional fitness inference module _(AMD/x86_64 systems only)_ | +| `--mutagenesis` | `nnk` | Deep mutational scanning strategy used _(in development)_ | +| `--error-estimation` | `wt_sequencing` | Error model used to correct 1nt counts _(in development)_ | +| `--read-align` | `bwa-mem` | Customised read aligner _(in development)_ | + +## Pipeline output + +After execution, the pipeline creates the following directory structure: + +``` +results/ +├── fastqc/ # Individual HTML reports for specified fastq files, raw sequencing QC +├── fitness/ # Merged variant count tables, fitness and error estimates, replicate correlations and heatmaps +├── intermediate_files/ # Raw alignments, raw and pre-filtered variant count tables, QC reports +├── library_QC/ # Sample-specific PDF visualizations: position-wise sequencing coverage, count heatmaps, etc. +├── multiqc/ # Shared HTML reports for all fastq files, raw sequencing QC +├── pipelineinfo/ # Nextflow helper files for timeline and summary report generation +├── timeline.html # Nextflow timeline for all tasks +└── report.html # Nextflow summary report incl. detailed CPU and memory usage per for all tasks +``` + +## Detailed steps + +### 1. Alignment + +All paired-end raw reads are first aligned to the provided reference ORF using [**bwa-mem**](http://bio-bwa.sourceforge.net/). This is a highly efficient mapping algorithm for reads ≥100 bp, with its multi-threading support automatically handled by nf-core. + +In future versions of `nf-core/deepmutscan`, we consider the use of [**bwa-mem2**](https://github.com/bwa-mem2/bwa-mem2), which provides similar alignment rates with a moderate speed increase ([Vasimuddin et al., _IPDPS_ 2019](https://ieeexplore.ieee.org/document/8820962)). With the increasing diversity of sequencing platforms for DMS, new throughput, read length, and error profiles may require further alignment options to be implemented. + +### 2. Filtering + +For long ORF site-saturation mutagenesis libraries, most aligned shotgun sequencing reads contain exact matches against the reference. It is not possible to infer which of these stem from mutant versus wildtype DNA molecules prior to fragmentation, hence they are filtered out. Similarly, erroneous reads with unexpected indels are also removed. + +To this end, we use [**samtools view**](https://www.htslib.org/doc/samtools.html). + +### 3. Read Merging + +Even the highest-accuracy next-generation sequencing platforms do not have perfect base accuracy. To minimise the effect of base errors (which would otherwise be counted as "false mutations"), `nf-core/deepmutscan` uses the overlap of each aligned read pair. With base errors on the forward and reverse read being independent, the pipeline applies the [**vsearch fastq_mergepairs**](https://github.com/torognes/vsearch) function to convert each read pair into a single consensus molecule with adjusted base error scores. + +> [!TIP] +> Optimal merging performance is usually obtained if the average DNA fragment size matches the read size. For example, libraries sequenced with 150 bp paired-end reads should ideally also be sheared/tagmented to a mean size of 150 bp. + +Future versions may offer additional options depending on sequencing type and error profiles. + +### 4. Variant Counting + +Aligned, non-wildtype consensus reads are screened for exact, base-level mismatches. `nf-core/deepmutscan` currently uses the popular [**GATK AnalyzeSaturationMutagenesis**](https://gatk.broadinstitute.org/hc/en-us/articles/360037594771-AnalyzeSaturationMutagenesis-BETA) function to count occurrences of all single, double, triple, and higher-order nucleotide changes between each read and the reference ORF. + +We are currently working on the `nf-core/deepmutscan` implementation of a much lighter, alternative Python implementation for mutation counting. In this script, users will be allowed to specify a minimum base quality cutoff for mutations to be included in the final count table (default: Q30) – an option not available in GATK. + +### 5. DMS Library Quality Control + +By integrating the reference ORF coordinates and the chosen DMS library type (default: NNK/NNs degenerate codon-based nicking), `nf-core/deepmutscan` calculates a number of mutation count summary statistics. + +Custom visualisations allow for inspection of (1) mutation efficiency along the ORF, (2) position-specific recovery of amino acid diversity, and (3) overall sequencing coverage evenness and saturation. + +### 6. Data Summarisation for Fitness Estimation + +Steps 1-5 are iteratively run across all samples defined in the `.csv` spreadsheet. Once read alignment, merging, mutation counting, and library QC have been completed for the full list of samples, users can opt to proceed with fitness estimation. To this end, the pipeline generates all the necessary input files by merging mutation counts across samples. + +### 7. Single Nucleotide Variant Error Correction _(in development)_ + +This module will implement strategies to distinguish true single nucleotide variants from sequencing artefacts. There are two options to perform this: + +- Empirical error rate modelling based on wildtype sequencing +- Empirical error rate modelling based on false double mutants _(in development)_ + +### 8. Fitness Estimation _(in development)_ + +The final step of the pipeline will perform fitness estimation based on mutation counts. By default, we calculate fitness scores as the logarithm of variants' output to input ratio, normalised to that of the provided wildtype sequence. Future expansions may include: + +- Integration of other popular fitness inference tools, including [DiMSum](https://github.com/lehner-lab/DiMSum), [Enrich2](https://github.com/FowlerLab/Enrich2), [rosace](https://github.com/pimentellab/rosace/) and [mutscan](https://github.com/fmicompbio/mutscan) +- Standardised output formats for downstream analyses and comparison + +> [!IMPORTANT] +> We note that exact wildtype sequence reads are filtered out in stage 2. Including synonymous wildtype codons in the original mutagenesis design is therefore essential when it comes to calibrating the fitness calculations. + +## Notes for Developers + +- Custom scripts used in filtering and mutation counting are available in the `bin/` directory of the repository. +- Modules are implemented in Nextflow DSL2 and follow the nf-core community guidelines. +- Contributions, optimisations, and additional analysis modules are welcome - please open a pull request or GitHub issue to discuss ideas. + +_This document is meant as a living reference. As the pipeline evolves, the descriptions of steps 7 and 8 will be expanded with concrete implementation details._ + +## Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: @@ -103,11 +174,11 @@ When you run the above command, Nextflow automatically pulls the pipeline code f nextflow pull nf-core/deepmutscan ``` -### Reproducibility +## Reproducibility It is a good idea to specify the pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. -First, go to the [nf-core/deepmutscan releases page](https://github.com/nf-core/deepmutscan/releases) and find the latest pipeline version - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`. Of course, you can switch to another version by changing the number after the `-r` flag. +First, go to the [nf-core/deepmutscan releases page](https://github.com/nf-core/deepmutscan/releases) and find the latest pipeline version - numeric only (eg. `1.0.0`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.0.0`. Of course, you can switch to another version by changing the number after the `-r` flag. This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. For example, at the bottom of the MultiQC reports. diff --git a/nextflow_schema.json b/nextflow_schema.json index ccd9a85..b71860f 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -42,41 +42,41 @@ "fa_icon": "fas fa-file-signature" }, "reading_frame": { - "type": "string", - "description": "Start and stop codon positions in the format 'start-stop', e.g., '352-1383'.", - "pattern": "^\\d+-\\d+$" - }, - "min_counts": { - "type": "integer", - "description": "Minimum number of variant observations required.", - "minimum": 1, - "default": 3 - }, - "mutagenesis_type": { - "type": "string", - "description": "Type of mutagenic primers. Choose from nnk, nns, max_diff_to_wt, custom. When using 'custom', also provide the parameter 'custom_codon_library'", - "default": "nnk" - }, - "custom_codon_library": { - "type": "string", - "format": "file-path", - "description": "Path to a comma-separated .txt file listing custom codons (e.g., 'AAA,AAC,AAG,...'). Required when mutagenesis_type is 'custom'." - }, - "dimsum": { - "type": "boolean", - "default": false, - "description": "Enable DiMSum execution." - }, - "fitness": { - "type": "boolean", - "default": false, - "description": "Enable basic fitness calculation and preceded data preparation (group and merge counts per sample into counts_merged.tsv, create a specific experimental design file and find a high-count 2nt synonymous mutation). All data prepared serves also as input for the dimsum fitness calculator." - }, - "run_seqdepth": { - "type": "boolean", - "default": false, - "description": "Whether to run the SeqDepth simulation module. This is computationally intensive and optional." - } + "type": "string", + "description": "Start and stop codon positions in the format 'start-stop', e.g., '352-1383'.", + "pattern": "^\\d+-\\d+$" + }, + "min_counts": { + "type": "integer", + "description": "Minimum number of variant observations required.", + "minimum": 1, + "default": 3 + }, + "mutagenesis_type": { + "type": "string", + "description": "Type of mutagenic primers. Choose from nnk, nns, max_diff_to_wt, custom. When using 'custom', also provide the parameter 'custom_codon_library'", + "default": "nnk" + }, + "custom_codon_library": { + "type": "string", + "format": "file-path", + "description": "Path to a comma-separated .txt file listing custom codons (e.g., 'AAA,AAC,AAG,...'). Required when mutagenesis_type is 'custom'." + }, + "dimsum": { + "type": "boolean", + "default": false, + "description": "Enable DiMSum execution." + }, + "fitness": { + "type": "boolean", + "default": false, + "description": "Enable basic fitness calculation and preceded data preparation (group and merge counts per sample into counts_merged.tsv, create a specific experimental design file and find a high-count 2nt synonymous mutation). All data prepared serves also as input for the dimsum fitness calculator." + }, + "run_seqdepth": { + "type": "boolean", + "default": false, + "description": "Whether to run the SeqDepth simulation module. This is computationally intensive and optional." + } } }, "reference_genome_options": {