Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Initial implementation of Molecular bar codes handling using AGeNT #462

Merged
merged 5 commits into from
Dec 20, 2023

Conversation

ericblanc20
Copy link
Contributor

Prototype implementation of mapping data generated with Molecular BarCodes (MBC) or UMIs.

Background

The MBCs are typically used on FFPE data, where library complexity may be low. These data are also often compromised by FFPE or oxo-G artifacts which require careful analysis & filtration of somatic variants. Base-quality re-calibration (BQSR) should be used in these cases.

Design

The implementation has 5 steps:

  1. Trimming the MBC sequence at the read's 5' end and inserting it in the read name for further processing.
  2. Mapping the reads
  3. Merging separate libraries
  4. Marking the duplicates using the information in read names
  5. Performing BQSR

Steps 1 & 2 must be done separately on separate libraries, in order to easily insert read groups information requires by BQSR. The separate bam files must be merged before marking duplicates.

Implementation

Because of the multiple operations required to produce the final result, I opted to create a meta sub-step, which creates a Snakefile which handles all necessary steps. This is similar to the parallel wrapper, except that the steps are not chunks of the same operation on smaller regions, but logically different operations.

Benefits

  1. The wrapper creating and running the Snakefile creates a temporary directory where all the disk-intensive operations occur. This avoid cluttering the work/<tool>.<library> with large files which are not final results.
  2. The code is relatively straightforward, and blends well the the other features from the step (coverage analysis, ...)
  3. There is natural parallelisation of libraries sequenced on multiple lanes.

Drawbacks

  1. The code (as it is now) is inflexible: it is hard to see how another MBC tool could be added, and the mapper must be either bwa or bwa-mem2 (as they share the same input parameters). It is also currently not possible to opt out of BQSR.
  2. The notion of meta sub-step may go against the whole snappy design.
  3. The MBC tool AGeNT is a commercial software from Agilent. I don't think it is available on Bioconda. Because of time pressure, I don't have the time to look for alternatives (such as umitools).

Notes

The current implementation must be viewed as a prototype. If the meta sub-step concept is deemed acceptable for snappy, I have considered a few options to improve on the current implementation and make it more flexible.

  1. The BQSR can easily be put under user control.
  2. The mapper wrappers could be modified to adapt their parameters to enable compatibility with the MBC tool (AGeNT requires the -C option (append FASTA/FASTQ comment to SAM output) to be set, add read groups, run on separate libraries).
  3. The current wrapper could be abstract, with mbc-tool sprecific concrete classes.

However, in my opinion, all changes and improvements must be weighted against an undue complexification of the ngs_mapping step.

@ericblanc20 ericblanc20 linked an issue Nov 1, 2023 that may be closed by this pull request
@coveralls
Copy link

coveralls commented Nov 1, 2023

Coverage Status

coverage: 85.736% (-0.1%) from 85.866%
when pulling 11d5464 on 461-support-for-molecular-barcodes
into 4874074 on main.

Copy link
Contributor

@mbenary mbenary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question, otherwise looks good to me.

# Input fastqs are passed through snakemake.params.
# snakemake.input is a .done file touched after linking files in.
input_left = snakemake.params.args["input"]["reads_left"]
input_right = snakemake.params.args["input"].get("reads_right", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using get for one input and [ ] for the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the first is always present, and should raise an exception if not, while the second is optional, and with the empty list as default.

TODO: 1. rename the ugly 'mcbs' to 'somatic' or 'accurate'
      2. implement the 'extra_args' in the mapping tools to enable mapper control parameters specific for barcodes (-C)
      3. make the mapping operation generic (not restricted to bwa & bwa-mem2)
      4. implement umi_tools for barcodes/umis processing
      5. rename bqsr statistics so they can be collected by multiqc
@ericblanc20 ericblanc20 merged commit 768dded into main Dec 20, 2023
7 checks passed
@ericblanc20 ericblanc20 deleted the 461-support-for-molecular-barcodes branch December 20, 2023 14:21
@tedil tedil mentioned this pull request Jun 28, 2024
This was referenced Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for molecular barcodes
3 participants