Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace mapping with de novo assembly #442

Closed
59 tasks done
donkirkby opened this issue Jun 12, 2018 · 1 comment
Closed
59 tasks done

Replace mapping with de novo assembly #442

donkirkby opened this issue Jun 12, 2018 · 1 comment
Assignees

Comments

@donkirkby
Copy link
Member

donkirkby commented Jun 12, 2018

We currently use bowtie2 to map reads to a large set of reference sequences. For most samples, it works well. However, we have had some problems with reference drift (#290), calling HCV subtypes (#436), insertion and deletion positions (#398), and samples that produce different results when you rerun the mapping (#405).

We'd like to experiment with using de novo assembly instead of mapping.

  • use de novo assembly in micall pipeline
  • soft clipping is lost, possibly because of contig names
  • run one iteration of the remap step to map all the reads onto all the contigs
  • use Smith-Waterman Gotoh to align all the contigs onto the references
  • display contigs aligned with references
  • display contigs that extend beyond the references, with 1693P1Y04608-1E10-HIV_S22 as an example
  • display labels on short contigs by using an uncoloured track
  • display contigs grouped by reference, with 2569P1Y02945-1F6-HIV_S28 as an example
  • sort contigs by reference, then by descending coverage
  • add a dashed banner to the "Partial Blast Results" section, with dashes 100 bases long
  • smooth out coverage plots to reduce file sizes
  • show deletions as gaps in tracks
  • show insertions as green blocks
  • show partial segments in the banner above partial blast results, with 90314A-HCV_S49 as an example
  • use Smith-Waterman to align all the primers onto all the contigs moved to Display primers on contig coverage diagram #478.
  • combine counts from all the contigs onto the coordinate references
  • V3LOOP doesn't build a contig, because the amplicon is shorter than 540 bases (HLA will probably have the same problem.)
  • dashes for partial blast results should begin at zero, not left border, with 61832A-HCV-43562A-GRrerun-HLA-B-POSA1211-5-2-V3LOOP_S2 as an example
  • some insertions show up to the left of the contig (not considering the offset for negative positions?), with 61832A-HCV-43562A-GRrerun-HLA-B-POSA1211-5-2-V3LOOP_S2 as an example
  • add HLA gene map
  • support mapping and assembly versions in a single Singularity image
  • make MiCall watcher call resistance after denovo steps
  • let MiCall watcher run denovo without standard pipeline
  • update columns in cascade.csv
  • use denovo pipeline as a backup for denovo combined pipeline in Kive watcher
  • download contig coverage files
  • add nuc_detail.csv
  • combine amino_detail.csv and nuc_detail.csv by seed groups, not by seeds. For example, sample 1693-1IN2C2-HIV_S16 from the 09-Aug-2019.M01841 run. Not needed after embedding contigs in ref.
  • check if assembly is stable - sample 1693-1IN2C2-HIV_S16 from the 09-Aug-2019.M01841 run had two very different sets of contigs. Difference between workstation and cluster? Basespace and Kive?
  • rename contig_coverage files to genome_coverage, and produce them from both the denovo version and the mapped version
  • collect genome_coverage.svg files in a separate folder from the other coverage maps
  • combine deletions, partial deletions, insertions, and low quality counts from the detail files
  • try aligning contigs in the reverse direction to see which aligns better, or look at blast result. HIV0836-P2K10-HIV_S26 in the 30-Aug-2019.M04401 run is an example.
  • embed contigs into a complete reference, to try and map regions that didn't assemble
  • lots of samples have reversed sections near the ends - report reversed contigs with partials
  • de novo assembly is very slow for some samples IVA seems better than savage.
  • bring back G2P analysis
  • should G2P continue to use merged reads, or should it switch to aligned reads? moved to Adapt G2P to de novo assembly? #481
  • should we try to report V3LOOP overlap again? moved to Adapt G2P to de novo assembly? #481
  • remove the built index of genotypes from source control
  • add other viruses to index
  • don't show separate contigs for -partial or -reversed suffixes.
  • show deletions in contigs - 1497DEL-HIV_S119 from 7 Dec 2018 is an example
  • make microtest samples work with de novo assembly
  • add tests for microtest samples, or an option to verify their results in micall_basespace.py
  • should we make the contig coverage diagram match what we used to cut up the gene regions? 73051ANS5A1-HCV-NS5a_S89 from 15-Jul-2016.M01841 is an example where they don't match. moved to Make coverage maps consistent with contigs coverage plot #479.
  • denovo assembly sometimes uses more than one thread, specifically smalt uses up to 8. Either reduce CPU usage, or increase number of threads requested from Slurm.
  • use BLAST results to assemble contigs into a full reference? Haven't found any clear cases where they should be combined. HIV3428P100IN200-C19-HIV-S51 from 20 Sep 2019 run is the closest, but it looks like one contig has primer at the end. Samples HIV0887-P2D21-HIV_S3 and HIV0887-P2C12-HIV_S32 from 30 Aug 2019 looks even better, but have very little overlap. Some of the HCV samples look more promising: 73060A-HCV_S46 from 15 Jul 2016, for example. Moved to issue Merge assembled contigs? #484.
  • display BLAST results on genome coverage diagram: find the top scoring result, then display all others on the same reference, numbering each section
  • put arrow on top of BLAST results in reference landmarks
  • adjust BLAST arrows when diagram has a horizontal offset, as in JRCCC3-GP160NEF-HIV_S83 from 27 Sep 2019
  • bring back amino_details.csv and combination into amino.csv
  • look at BLAST results for recombinant samples from 27 Sep 2019 - are there any where the best blast hit for one region matches a different reference from the best last hit for another region? HCV example: 73087A-HCV_S2 from 15 Jul 2016 has a gap in BLAST matches from C to NS2 in HCV-1b, but has a big match to HCV-2k from 5' to NS2, and 73088A-HCV_S15 is similar.
  • Reference arrows aren't aligned right vertically (B5-VIR5POL-HIV_S12) or horizontally (B4-VIR4POL-HIV_S8), examples are from 27 Sep 2019
  • compare assembled and remapped versions with previous release
  • deal with HIV references that don't reach 5' and 3' ends or bring back the refs we removed Moved to issue Merge assembled contigs? #484.
  • check for similar problems with other seed groups Moved to issue Merge assembled contigs? #484.
  • remove Savage?
  • remove second version of consensus, with lower minimum coverage?
@donkirkby
Copy link
Member Author

donkirkby commented Jul 10, 2018

@jeff-k has been using de novo assembly to look at several samples that got strange results with the current MiCall pipeline. It looks like one advantage of the technique will be that we can distinguish between these two scenarios:

  • high error rates (generates many, unrelated contigs)
  • unknown sequence like a chimera or new strain (generates a small number of related contigs, but the contigs don't map well to known references)

With the current MiCall pipeline, both of those scenarios just look like lousy mapping with gaps in coverage.

We propose this plan for a full MiCall pipeline that includes de novo assembly:

  • trim adapter sequences
  • use de novo assembly to generate a set of contigs
  • combine the contigs into one or more consensus sequences using prelim_map and remap as we combine the reads today
  • do a final remap with all of the reads against the consensus sequences
  • report the reads aligned against the consensus sequences
  • count the coverage and report the mixtures at each position
  • generate coverage maps
  • generate resistance interpretations

As you can see, this just affects the prelim_map and remap steps. Because they're only running on a small number of contigs, they should be much faster.

One risk is that the de novo assembly step might be much slower in some cases than the remap step is currently.

@donkirkby donkirkby added this to the 7.10 - De Novo Assembly milestone Jul 18, 2018
donkirkby added a commit that referenced this issue Oct 11, 2018
donkirkby added a commit that referenced this issue Oct 18, 2018
Remove G2P step for now. It needs to be added back.
Also switch from Python 3.4 to 3.6.
donkirkby added a commit that referenced this issue Oct 26, 2018
This version is just a copy of the original from the iva project. That should make it easier to apply patches if the iva project makes changes.
donkirkby added a commit that referenced this issue Oct 26, 2018
For now, just pick the most common length of merged pairs, and calculate a simple consensus sequence from all of those.
Import IVA, instead of making a subprocess call.
Move the Blast database to a new folder, and construct it out of the projects configuration data.
donkirkby added a commit that referenced this issue Oct 29, 2018
They caused a TypeError in iva assembly.
donkirkby added a commit that referenced this issue Nov 6, 2018
Also add merge-mates to Singularity.
donkirkby added a commit that referenced this issue Nov 6, 2018
Seems to use a lot of memory, and needs split count set carefully.
@donkirkby donkirkby pinned this issue Jun 21, 2019
donkirkby added a commit that referenced this issue Oct 25, 2019
Add an image comparison to the unit tests.
donkirkby added a commit that referenced this issue Oct 29, 2019
Also stop merging contigs into the full-genome reference.
donkirkby added a commit that referenced this issue Oct 29, 2019
donkirkby added a commit that referenced this issue Oct 30, 2019
Include contigs with merged seeds in genome coverage diagram.
donkirkby added a commit that referenced this issue Oct 31, 2019
donkirkby added a commit that referenced this issue Nov 1, 2019
donkirkby added a commit that referenced this issue Nov 1, 2019
donkirkby added a commit that referenced this issue Nov 21, 2019
donkirkby added a commit that referenced this issue Nov 21, 2019
Some of them stopped working when we stopped merging contigs into a full-genome reference.
donkirkby added a commit that referenced this issue Nov 21, 2019
Also change consensus comparison in release tests. Merging multiple contigs loses the original nucleotide positions.
donkirkby added a commit that referenced this issue Nov 22, 2019
@donkirkby donkirkby self-assigned this Dec 11, 2019
@donkirkby donkirkby unpinned this issue Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant