bihealth · ericblanc20 · Mar 20, 2024 · Feb 1, 2024 · Feb 1, 2024 · Feb 1, 2024
diff --git a/docs/cbioportal_export.rst b/docs/cbioportal_export.rst
@@ -0,0 +1,30 @@
+.. _cbioportal_export:
+
+--------------------
+Export to cBioPortal
+--------------------
+
+The step ensures that the following operations are carried out:
+
+- For each sample, the somatic variant ``vcf`` file must be converted to the ``maf`` format.
+- For each sample, the CNV segments results must be applied to gene locii, for the coverage log ratio, and for the copy number count.
+- The individual sample files must then be merged to create single files for the whole cohort. This must be done for somatic variants, coverage log ratio & copy number counts, and for expression levels when present.
+- Clinical data must be gathered at the patient & sample level.
+- List of cases must be created, for mutations, copy number changes & expression levels.
+- Meta files required by cBioPortal must be created.
+
+The ``vcf`` to ``maf`` conversion is done by a bespoke script, which requires a mapping between gene symbols and NCBI (ENTREZ) gene ids.
+This mapping is independent of the genome release, and can be obtained directly from `HGNC <https://www.genenames.org/download/archive/>`_. 
+There is a relatively recent version on the cluster, in ``/fast/work/groups/cubi/projects/biotools/static_data/annotation/hgnc_complete_set_2023-10-01.tsv``.
+
+Enriching clinical data
+=======================
+
+The pipeline can produce several quantities of interest, that can be added to the sample data uploaded to cBioPortal.
+Currently, only the tumor mutational burden computed by the ``tumor_mutational_burden`` step can be added to the sample table.
+In the near future, the Micro Satellite Instability (MSI) status and the homologous recombination deficiency will be added.
+The tumor purity and ploidy might also be considered, although as these results are generally hidden in the output of the somatic copy number steps, the situation is less clear.
+Finally, the patient HLA alleles and the sex could be included, either at the patient or sample level.
+
+Functional scores based on expression data will also be included, as well as gene fusions, but structural variants might remain a more distant goal, as they are notoriously diffict to infer from exome data.
+
diff --git a/docs/index.rst b/docs/index.rst
@@ -10,6 +10,11 @@ Pipeline User Docs
     Start here to learn about the pipeline.
     This section starts at :ref:`quickstart`.
 
+The somatic pipeline in detail
+    Documentation for the **users of the somatic pipeline**.
+    Details advanced usage of the somatic pipeline, for example the generation & usage of panel of normals.
+    This section starts at :ref:`somatic_ngs`.
+
 Pipeline Step Docs
     Documentation for the individual pipeline steps.
     This includes a general description, description of the related configuration settings, and a documentation of generated output files and input workflow steps.
@@ -40,6 +45,16 @@ Project Info
     usage
     overview
 
+.. toctree::
+    :caption: Somatic pipeline
+    :hidden:
+
+    somatic_ngs
+    panel_of_normals
+    somatic_variant_filtration
+    somatic_cnv
+    cbioportal_export
+
 .. toctree::
     :caption: Pipeline Step Docs
     :hidden:

diff --git a/docs/overview.rst b/docs/overview.rst
@@ -92,9 +92,9 @@ working directory
 
 .. _pipeline_projects:
 
-------------------
-An Example Project
-------------------
+------------------------
+A Simple Example Project
+------------------------
 
 The above part of this chapter is quite abstract.
 Let us draw some pictures and go from the abstract description to a concrete example.
@@ -167,14 +167,6 @@ In its ``data_sets`` section, the project-level configuration file provides sear
 
 The search will loop over provided search paths & search patterns. Paired reads files are coupled by similarity of their path. Note that when the ``Folder`` entry is absent from the sample sheet, the library name is used instead.
 
-However, this default behaviour can be overriden using the ``path_link_in`` option (which is available only for steps that use FASTQ files as input). When this configuration option is not empty, ``snappy`` will use it instead of the list of search paths defined in the ``data_set`` part. It will also ignore the folder information, and rely instead on the library names to search FASTQ files. The search path becomes:
-
-::
-
-    <path_link_in>/<library_name>/../<search_pattern>
-
-This mechanism enables steps that generate FASTQ files on output, for example adapter trimming. In that case, the input of the mapping step can be redirected towards the ouput of the adapter trimming step using this method.
-
 
 Overview of the Somatic Variant Pipeline
 ========================================

diff --git a/docs/panel_of_normals.rst b/docs/panel_of_normals.rst
@@ -0,0 +1,106 @@
+.. _panel_of_normals:
+
+----------------
+Panel of normals
+----------------
+
+The purpose of panel of normals is to use the normal samples to identify technical biases present in the data.
+Thes biases can be (to a certain extent) be then corrected when processing the tumor samples.
+It is therefore **very** important to make sure that all normal samples have been collected, processed & sequenced in the same way.
+Downstream processing of tumor samples must also ensure that the sample collection, processing & sequencing match the panel.
+
+Finally, the panels require a certain number of normal samples to be effective.
+Below that number, the particulars of any normal sample included in the panel will be considered as bias, and corrected for later in processing.
+
+Panel of normals can be generated for multiple applications: somatic variant calling (``mutect2``) & somatic copy number calling (exome data).
+
+Somatic variant calling
+=======================
+
+The panel of normals for ``mutect2`` is straightforward to generate: only a germline resource is needed, to locate known germline variants (SNVs & indels).
+This germline resource is provided in the `GATK best practice bundle <https://console.cloud.google.com/storage/browser/gatk-best-practices>`_.
+On the cluster, it can be found in ``/fast/work/groups/cubi/projects/biotools/static_data/app_support/GATK/``, search for ``vcf`` files with ``af-only-gnomad`` in their name.
+
+It is `recommended <https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON>`_ to have at least 40 normals in the panel, but it is not uncommon to build a panel with fewer samples.
+When the number of samples in too small, GATK provides a panel of normals in the same `best practice bundle <https://console.cloud.google.com/storage/browser/gatk-best-practices>`_.
+It can also be found on the cluster, in the same location as above.
+
+In case samples should be omitted from the panel of normals, the user can create a file with the names of the libraries to be included in the panel (one line per library).
+This file can be provided using the ``path_normals_list`` configuration option.
+
+Finally, it is recommended to restrict the generation of the panel to the sequences of interest. 
+This is done by excluding unwanted sequences in the following way:
+
+.. code-block:: yaml
+
+    panel_of_normals:
+      tools: [mutect2]
+      mutect2:
+        germline_resource: <path to af-only-gnomad.raw.sites.vcf.gz>
+        path_normals_list: ""          # Use all samples when empty, otherwise use only samples in the list file.
+        window_length: 300000000       # For exome data, it is sufficient to split the genome by chromosomes
+        ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # For hs37d5
+      ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # Must be repeated at the level above mutect2
+
+Copy numbers (exome data)
+=========================
+
+.. note::
+
+    The current implementations of panel of normals for ``cnvkit`` & ``PureCN`` rely on files generated by the pipeline, but that are **not** properly maintained by the pipeline.
+    This is suboptimal, complicates considerably the 
+
+``cnvkit``
+----------
+
+``cnvkit`` requires a file describing regions unsuitable for coverage analysis (repeats, difficult to map, extreme GC content, ...).
+It provides it for ``GRCh37``, but not for ``GRCh38``, but the ``access`` tool can create such a file from mappability regions in ``bed`` format.
+
+This ``access`` tool **must** be run before creating the ``cnvkit`` panel of normals. 
+In other words, the ``cnvkit`` panel of normals is created in two steps:
+
+First, the generation of the access file:
+
+.. code-block:: yaml
+
+    panel_of_normals:
+      tools: [access]
+      access:
+        exclude:
+        - <path to mappability file>
+
+This create an access file in ``output/cnvkit.access/out/cnvkit.access.bed``.
+
+Then, the panel can be created:
+
+.. code-block:: yaml
+
+    panel_of_normals:
+      tools: [access, cnvkit]
+      access:
+        exclude:
+        - <path to mappability file>
+      cnvkit:
+        access: <absolute path to access file>
+        path_target_regions: <path to exome baits file>
+
+``PureCN``
+----------
+
+``PureCN`` requires the ``mutect2`` panel of normals to be able to create one for itself.
+Like ``cnvkit``, the ``PureCN`` panel of normals creation involves two steps:
+
+First, the ``mutect2`` panel of normals must be created, and second, it can be used to create ``PureCN``'s panel:
+
+.. code-block:: yaml
+
+    panel_of_normals:
+      tools: [mutect2, purecn]
+      mutect2:
+        ...
+      purecn:
+        path_genomeDB: <absolute path to output/<mapper>.mutect2/out/<mapper>.mutect2.genomicsDB.tar.gz>
+        path_bait_regions: <path to exome baits file>
+        genome_name: hg19               # Must be either "hg19" or "hg38"
+        enrichment_kit_name: "exome"    # Used to name output files and for CNV processing
+
diff --git a/docs/somatic_cnv.rst b/docs/somatic_cnv.rst
@@ -0,0 +1,96 @@
+.. _somatic_cnv:
+
+-------------------
+Somatic CNV calling
+-------------------
+
+Somatic variant calling is implemented differently for exome and whole genome data.
+
+The whole genome data "branch" is currently under review, as GRCh38 support in ``Control-FREEC`` (the main workhorse for WGS CNV calling) is not complete.
+CNV calling in WGS data can also be done using ``cnvkit``, but its pipeline implementation is also incomplete.
+
+The following documentation is restricted to the tools currently implemented to process exome data: ``cnvkit``, ``purecn`` & ``sequenza``.
+
+Output & performance
+====================
+
+The 3 methods generally broadly agree on the log ratio of coverage between tumor & normal samples. 
+
+However, the segmentation and the number of copies assigned to a segment can be quite different between the algorithms.
+
+Output files
+------------
+
+There is no widely used standard to report copy number alterations. 
+In absence of a better solution, all CNV tools implemented in somatic pipeline output the segmentation table loosely following the `DNAcopy format <https://bioconductor.org/packages/devel/bioc/manuals/DNAcopy/man/DNAcopy.pdf>`_.`
+The copy number call may or may not be present, and the chromosome number is replaced by its name.
+The segmentation output is in file ``output/<mapper>.<cnv caller>.<lib name>/out/<mapper>.<cnv caller>.<lib name>_dnacopy.seg``.
+
+Genome support
+--------------
+
+Both ``purecn`` & ``sequenza`` have better support for ``GRCh37`` than for ``GRCh38``.
+In both cases, it stems from the fact that the segmentation packages used by the methods haven not been updated for ``GRCh38``.
+Both ``purecn`` & ``sequenza`` provide remedial solutions, however they are not available from ``bioconda``. 
+For that reason, the pipeline implementation relies on Docker containers for ``purecn`` and on github forks of R packages for ``sequenza``.
+
+The pipeline currently cannot handle male & female samples differently, even though the sex has a strong influence on the results.
+It is therefore useful to avoid calling CNVs on sex chromosomes.
+This can be easily achieved for ``sequenza``, using the ``ignore_chroms`` option. 
+``purecn`` will handle the sample sex internally, but ``cnvkit`` requires the user to remove the sex chromosomes from the access file.
+
+Segmentation and CNV calls
+--------------------------
+
+``cnvkit`` generally produces more segments than ``purecn``, and shorter ones. 
+However, with the default settings, ``cnvkit`` is very conservative with the calls.
+Most segments are called diploid, even when the coverage log ratio suggests otherwise.
+For best results, it is advisable to experiment with different segmentation algorithms and different sensitivity thresholds.
+
+Purity & ploidy
+---------------
+
+``purecn`` attempts to model tumor purity and ploidy, and the results are found in ``<mapper>.purecn.<lib name>/out/<mapper>.purecn.<lib name>.csv``.
+``sequenza`` can also offer these estimates, but they are currently only stored in ``R`` objects in the ``work`` directory, at ``<mapper>.sequenza.<lib name>/report/<lib name>_sequenza_cp_table.RData``.
+
+Getting results for ``PureCN``
+==============================
+
+The pipeline implementation of ``PureCN`` requires a panel of normals (which in turns requires building a panel of normals for ``mutect2``).
+Moreover, it also needs ``vcf`` files containing both germline & somatic variants.
+For these files, ``mutect2`` needs to be run with different parameters than those used for somatic variant calling.
+Therefore, the ``somatic_variant_calling`` step must be run *twice*, with different arguments, in different locations.
+
+For the second ``mutect2`` run executed in the ``somatic_variant_calling_for_purecn`` directory, the configuration would look like:
+
+.. code-block:: yaml
+
+    somatic_mutation_calling:
+      tools: [mutect2]
+      mutect2:
+        extra_arguments: [ "--genotype-germline-sites true", "--genotype-pon-sites true" ] # These arguments must be added
+        panel_of_normals: <absolute path to the panel of normal output>
+        germline_resource: <path to af-only-gnomad.raw.sites.vcf.gz>
+        common_variants: <path to small_exac_common_3.vcf.gz>
+        window_length: 300000000       # For exome data, it is sufficient to split the genome by chromosomes
+        ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # For hs37d5
+      ignore_chroms: [NC_007605, hs37d5, chrEBV, '*_decoy', 'HLA-*', 'GL000220.*'] # Must be repeated at the level above mutect2
+
+    somatic_targeted_seq_cnv_calling:
+      tools: [purecn]
+      purecn:
+        genome_name: "hg19"             # This must match the names given while building the panel of normals
+        enrichment_kit_name: "exome"    # This must match the names given while building the panel of normals
+        path_somatic_variants: ../somatic_variant_calling_for_purecn
+        somatic_variant_caller: mutect2
+        path_panels_of_normals: <absolute path to panel_of_normals/output/<mapper>.purecn/out/<mapper>.purecn.panel_of_normals.rds>
+        path_mapping_bias: <absolute path to panel_of_normals/output/<mapper>.purecn/out/<mapper>.purecn.mapping_bias.rds>
+        path_intervals: <absolute path to panel_of_normals/output/purecn/out/exome_hg19.list>
+        path_container: <absolute path to panel_of_normals/work/containers/out/purecn.simg>
+
+From the ``panel_of_normals`` directory, ``purecn`` requires 3 types of files:
+
+- the ``panel_of_normals`` itself, and the ``mapping_bias`` objects are taken from ``<mapper>.purecn/out``. This is because they might change with different mapping tools.
+- the ``intervals`` taken from ``purecn/out``, as the definition of intervals depend only on the genome & the exome kit, but not on the mapping tool.
+- the ``container`` taken from ``work/containers/out``, to ensure that the ``PureCN`` version used to compute copy number variants is identical to that used to compute the panel of normals.
+