Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix somatic snappy (#477) #491

Merged
merged 29 commits into from
Mar 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
442a92e
fix: snakemake silently requires numpy, pandas & cffi (issue #477)
ericblanc20 Feb 1, 2024
de49a4e
fix: remove the job dispache to critical, used medium (default queue)…
ericblanc20 Feb 1, 2024
a8a7eed
fix: current (2024-02-01) sequenza incompatible with recent iotools R…
ericblanc20 Feb 1, 2024
40eec67
fix: increase resource usage for PureCN
ericblanc20 Feb 1, 2024
fb112bd
feat: allow cbioportal export to use the most recent variant filtrati…
ericblanc20 Feb 1, 2024
2da149d
fix: fix bug in adapter trimmin of unpaired samples
ericblanc20 Feb 1, 2024
717d3d0
test: sync the PureCN resources with test
ericblanc20 Feb 1, 2024
3e55ab5
fix: fix bug in old filtration method (using filter_sets)
ericblanc20 Feb 2, 2024
25ed5e0
allow longer time for EB filter
Feb 12, 2024
0c2b7c2
fix: naming of path to baits reconciled between workflow & wrapper
ericblanc20 Feb 15, 2024
019a2c1
docs: clarified filters usage in cbioportal_export
ericblanc20 Feb 15, 2024
2b5224f
feat: add junctions table to STAR output
ericblanc20 Feb 15, 2024
05f0559
Merge branch '477-fix-somatic-snappy' of github.com:bihealth/snappy-p…
ericblanc20 Feb 15, 2024
dfa1c02
test: fix eb_filter time resource
ericblanc20 Feb 15, 2024
f9cd81c
feat: Add tumor mutational burden after filtration
ericblanc20 Feb 15, 2024
9472f31
fix: indexing & overlaps between files with variants considered for E…
ericblanc20 Feb 23, 2024
8d28ff8
feat: add protection filter to the filtration step (whitelist some lo…
ericblanc20 Feb 23, 2024
5e242d0
feat: add purecn & sequenza to the choice of CNV callers for cBioPortal
ericblanc20 Feb 23, 2024
d5c4022
docs: First draft of the somatic pipeline overview
ericblanc20 Feb 28, 2024
42e2155
fix: recent iotools package incompatible with latest sequenza
ericblanc20 Mar 1, 2024
3242fed
feat: Add conda info & list to logs
ericblanc20 Mar 1, 2024
4034a4f
feat: add logs of individual coverage files generation to the set of …
ericblanc20 Mar 1, 2024
1054cb2
docs: Improve documentation of selected steps
ericblanc20 Mar 1, 2024
12b4513
fix: re-enable logs for parallel jobs triggered by the ParallelMutect…
ericblanc20 Mar 1, 2024
67d798b
feat: collect logs and merge logs of individual filters, and improve …
ericblanc20 Mar 1, 2024
fbc06f3
docs: updated & improves somatic pipeline documentation
ericblanc20 Mar 11, 2024
fc10dc2
style: make linting happy
ericblanc20 Mar 14, 2024
55c49d1
test: synchronize tests with the new logs
ericblanc20 Mar 14, 2024
ed3a744
docs: fixed bunch of typos
ericblanc20 Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/cbioportal_export.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
.. _cbioportal_export:

--------------------
Export to cBioPortal
--------------------

The step ensures that the following operations are carried out:

- For each sample, the somatic variant ``vcf`` file must be converted to the ``maf`` format.
- For each sample, the CNV segments results must be applied to gene locii, for the coverage log ratio, and for the copy number count.
- The individual sample files must then be merged to create single files for the whole cohort. This must be done for somatic variants, coverage log ratio & copy number counts, and for expression levels when present.
- Clinical data must be gathered at the patient & sample level.
- List of cases must be created, for mutations, copy number changes & expression levels.
- Meta files required by cBioPortal must be created.

The ``vcf`` to ``maf`` conversion is done by a bespoke script, which requires a mapping between gene symbols and NCBI (ENTREZ) gene ids.
This mapping is independent of the genome release, and can be obtained directly from `HGNC <https://www.genenames.org/download/archive/>`_.
There is a relatively recent version on the cluster, in ``/fast/work/groups/cubi/projects/biotools/static_data/annotation/hgnc_complete_set_2023-10-01.tsv``.

Enriching clinical data
=======================

The pipeline can produce several quantities of interest, that can be added to the sample data uploaded to cBioPortal.
Currently, only the tumor mutational burden computed by the ``tumor_mutational_burden`` step can be added to the sample table.
In the near future, the Micro Satellite Instability (MSI) status and the homologous recombination deficiency will be added.
The tumor purity and ploidy might also be considered, although as these results are generally hidden in the output of the somatic copy number steps, the situation is less clear.
Finally, the patient HLA alleles and the sex could be included, either at the patient or sample level.

Functional scores based on expression data will also be included, as well as gene fusions, but structural variants might remain a more distant goal, as they are notoriously diffict to infer from exome data.

15 changes: 15 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ Pipeline User Docs
Start here to learn about the pipeline.
This section starts at :ref:`quickstart`.

The somatic pipeline in detail
Documentation for the **users of the somatic pipeline**.
Details advanced usage of the somatic pipeline, for example the generation & usage of panel of normals.
This section starts at :ref:`somatic_ngs`.

Pipeline Step Docs
Documentation for the individual pipeline steps.
This includes a general description, description of the related configuration settings, and a documentation of generated output files and input workflow steps.
Expand Down Expand Up @@ -40,6 +45,16 @@ Project Info
usage
overview

.. toctree::
:caption: Somatic pipeline
:hidden:

somatic_ngs
panel_of_normals
somatic_variant_filtration
somatic_cnv
cbioportal_export

.. toctree::
:caption: Pipeline Step Docs
:hidden:
Expand Down
14 changes: 3 additions & 11 deletions docs/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,9 @@ working directory

.. _pipeline_projects:

------------------
An Example Project
------------------
------------------------
A Simple Example Project
------------------------

The above part of this chapter is quite abstract.
Let us draw some pictures and go from the abstract description to a concrete example.
Expand Down Expand Up @@ -167,14 +167,6 @@ In its ``data_sets`` section, the project-level configuration file provides sear

The search will loop over provided search paths & search patterns. Paired reads files are coupled by similarity of their path. Note that when the ``Folder`` entry is absent from the sample sheet, the library name is used instead.

However, this default behaviour can be overriden using the ``path_link_in`` option (which is available only for steps that use FASTQ files as input). When this configuration option is not empty, ``snappy`` will use it instead of the list of search paths defined in the ``data_set`` part. It will also ignore the folder information, and rely instead on the library names to search FASTQ files. The search path becomes:

::

<path_link_in>/<library_name>/../<search_pattern>

This mechanism enables steps that generate FASTQ files on output, for example adapter trimming. In that case, the input of the mapping step can be redirected towards the ouput of the adapter trimming step using this method.


Overview of the Somatic Variant Pipeline
========================================
Expand Down
106 changes: 106 additions & 0 deletions docs/panel_of_normals.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
.. _panel_of_normals:

----------------
Panel of normals
----------------

The purpose of panel of normals is to use the normal samples to identify technical biases present in the data.
Thes biases can be (to a certain extent) be then corrected when processing the tumor samples.
It is therefore **very** important to make sure that all normal samples have been collected, processed & sequenced in the same way.
Downstream processing of tumor samples must also ensure that the sample collection, processing & sequencing match the panel.

Finally, the panels require a certain number of normal samples to be effective.
Below that number, the particulars of any normal sample included in the panel will be considered as bias, and corrected for later in processing.

Panel of normals can be generated for multiple applications: somatic variant calling (``mutect2``) & somatic copy number calling (exome data).

Somatic variant calling
=======================

The panel of normals for ``mutect2`` is straightforward to generate: only a germline resource is needed, to locate known germline variants (SNVs & indels).
This germline resource is provided in the `GATK best practice bundle <https://console.cloud.google.com/storage/browser/gatk-best-practices>`_.
On the cluster, it can be found in ``/fast/work/groups/cubi/projects/biotools/static_data/app_support/GATK/``, search for ``vcf`` files with ``af-only-gnomad`` in their name.

It is `recommended <https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON>`_ to have at least 40 normals in the panel, but it is not uncommon to build a panel with fewer samples.
When the number of samples in too small, GATK provides a panel of normals in the same `best practice bundle <https://console.cloud.google.com/storage/browser/gatk-best-practices>`_.
It can also be found on the cluster, in the same location as above.

In case samples should be omitted from the panel of normals, the user can create a file with the names of the libraries to be included in the panel (one line per library).
This file can be provided using the ``path_normals_list`` configuration option.

Finally, it is recommended to restrict the generation of the panel to the sequences of interest.
This is done by excluding unwanted sequences in the following way:

.. code-block:: yaml

panel_of_normals:
tools: [mutect2]
mutect2:
germline_resource: <path to af-only-gnomad.raw.sites.vcf.gz>
path_normals_list: "" # Use all samples when empty, otherwise use only samples in the list file.
window_length: 300000000 # For exome data, it is sufficient to split the genome by chromosomes
ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # For hs37d5
ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # Must be repeated at the level above mutect2

Copy numbers (exome data)
=========================

.. note::

The current implementations of panel of normals for ``cnvkit`` & ``PureCN`` rely on files generated by the pipeline, but that are **not** properly maintained by the pipeline.
This is suboptimal, complicates considerably the

``cnvkit``
----------

``cnvkit`` requires a file describing regions unsuitable for coverage analysis (repeats, difficult to map, extreme GC content, ...).
It provides it for ``GRCh37``, but not for ``GRCh38``, but the ``access`` tool can create such a file from mappability regions in ``bed`` format.

This ``access`` tool **must** be run before creating the ``cnvkit`` panel of normals.
In other words, the ``cnvkit`` panel of normals is created in two steps:

First, the generation of the access file:

.. code-block:: yaml

panel_of_normals:
tools: [access]
access:
exclude:
- <path to mappability file>

This create an access file in ``output/cnvkit.access/out/cnvkit.access.bed``.

Then, the panel can be created:

.. code-block:: yaml

panel_of_normals:
tools: [access, cnvkit]
access:
exclude:
- <path to mappability file>
cnvkit:
access: <absolute path to access file>
path_target_regions: <path to exome baits file>

``PureCN``
----------

``PureCN`` requires the ``mutect2`` panel of normals to be able to create one for itself.
Like ``cnvkit``, the ``PureCN`` panel of normals creation involves two steps:

First, the ``mutect2`` panel of normals must be created, and second, it can be used to create ``PureCN``'s panel:

.. code-block:: yaml

panel_of_normals:
tools: [mutect2, purecn]
mutect2:
...
purecn:
path_genomeDB: <absolute path to output/<mapper>.mutect2/out/<mapper>.mutect2.genomicsDB.tar.gz>
path_bait_regions: <path to exome baits file>
genome_name: hg19 # Must be either "hg19" or "hg38"
enrichment_kit_name: "exome" # Used to name output files and for CNV processing

96 changes: 96 additions & 0 deletions docs/somatic_cnv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
.. _somatic_cnv:

-------------------
Somatic CNV calling
-------------------

Somatic variant calling is implemented differently for exome and whole genome data.

The whole genome data "branch" is currently under review, as GRCh38 support in ``Control-FREEC`` (the main workhorse for WGS CNV calling) is not complete.
CNV calling in WGS data can also be done using ``cnvkit``, but its pipeline implementation is also incomplete.

The following documentation is restricted to the tools currently implemented to process exome data: ``cnvkit``, ``purecn`` & ``sequenza``.

Output & performance
====================

The 3 methods generally broadly agree on the log ratio of coverage between tumor & normal samples.

However, the segmentation and the number of copies assigned to a segment can be quite different between the algorithms.

Output files
------------

There is no widely used standard to report copy number alterations.
In absence of a better solution, all CNV tools implemented in somatic pipeline output the segmentation table loosely following the `DNAcopy format <https://bioconductor.org/packages/devel/bioc/manuals/DNAcopy/man/DNAcopy.pdf>`_.`
The copy number call may or may not be present, and the chromosome number is replaced by its name.
The segmentation output is in file ``output/<mapper>.<cnv caller>.<lib name>/out/<mapper>.<cnv caller>.<lib name>_dnacopy.seg``.

Genome support
--------------

Both ``purecn`` & ``sequenza`` have better support for ``GRCh37`` than for ``GRCh38``.
In both cases, it stems from the fact that the segmentation packages used by the methods haven not been updated for ``GRCh38``.
Both ``purecn`` & ``sequenza`` provide remedial solutions, however they are not available from ``bioconda``.
For that reason, the pipeline implementation relies on Docker containers for ``purecn`` and on github forks of R packages for ``sequenza``.

The pipeline currently cannot handle male & female samples differently, even though the sex has a strong influence on the results.
It is therefore useful to avoid calling CNVs on sex chromosomes.
This can be easily achieved for ``sequenza``, using the ``ignore_chroms`` option.
``purecn`` will handle the sample sex internally, but ``cnvkit`` requires the user to remove the sex chromosomes from the access file.

Segmentation and CNV calls
--------------------------

``cnvkit`` generally produces more segments than ``purecn``, and shorter ones.
However, with the default settings, ``cnvkit`` is very conservative with the calls.
Most segments are called diploid, even when the coverage log ratio suggests otherwise.
For best results, it is advisable to experiment with different segmentation algorithms and different sensitivity thresholds.

Purity & ploidy
---------------

``purecn`` attempts to model tumor purity and ploidy, and the results are found in ``<mapper>.purecn.<lib name>/out/<mapper>.purecn.<lib name>.csv``.
``sequenza`` can also offer these estimates, but they are currently only stored in ``R`` objects in the ``work`` directory, at ``<mapper>.sequenza.<lib name>/report/<lib name>_sequenza_cp_table.RData``.

Getting results for ``PureCN``
==============================

The pipeline implementation of ``PureCN`` requires a panel of normals (which in turns requires building a panel of normals for ``mutect2``).
Moreover, it also needs ``vcf`` files containing both germline & somatic variants.
For these files, ``mutect2`` needs to be run with different parameters than those used for somatic variant calling.
Therefore, the ``somatic_variant_calling`` step must be run *twice*, with different arguments, in different locations.

For the second ``mutect2`` run executed in the ``somatic_variant_calling_for_purecn`` directory, the configuration would look like:

.. code-block:: yaml

somatic_mutation_calling:
tools: [mutect2]
mutect2:
extra_arguments: [ "--genotype-germline-sites true", "--genotype-pon-sites true" ] # These arguments must be added
panel_of_normals: <absolute path to the panel of normal output>
germline_resource: <path to af-only-gnomad.raw.sites.vcf.gz>
common_variants: <path to small_exac_common_3.vcf.gz>
window_length: 300000000 # For exome data, it is sufficient to split the genome by chromosomes
ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # For hs37d5
ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # Must be repeated at the level above mutect2

somatic_targeted_seq_cnv_calling:
tools: [purecn]
purecn:
genome_name: "hg19" # This must match the names given while building the panel of normals
enrichment_kit_name: "exome" # This must match the names given while building the panel of normals
path_somatic_variants: ../somatic_variant_calling_for_purecn
somatic_variant_caller: mutect2
path_panels_of_normals: <absolute path to panel_of_normals/output/<mapper>.purecn/out/<mapper>.purecn.panel_of_normals.rds>
path_mapping_bias: <absolute path to panel_of_normals/output/<mapper>.purecn/out/<mapper>.purecn.mapping_bias.rds>
path_intervals: <absolute path to panel_of_normals/output/purecn/out/exome_hg19.list>
path_container: <absolute path to panel_of_normals/work/containers/out/purecn.simg>

From the ``panel_of_normals`` directory, ``purecn`` requires 3 types of files:

- the ``panel_of_normals`` itself, and the ``mapping_bias`` objects are taken from ``<mapper>.purecn/out``. This is because they might change with different mapping tools.
- the ``intervals`` taken from ``purecn/out``, as the definition of intervals depend only on the genome & the exome kit, but not on the mapping tool.
- the ``container`` taken from ``work/containers/out``, to ensure that the ``PureCN`` version used to compute copy number variants is identical to that used to compute the panel of normals.

Loading
Loading