-
Notifications
You must be signed in to change notification settings - Fork 132
Insufficient long read depth
In this example we look at an assembly where we don't have many long reads. The sample is a Klebsiella genome of about 5.5 Mbp, and we only have 19.8 Mbp of long reads. That's about 3.6x depth of coverage - not very good. There could be a number of reasons for low depth - in this case it was due to an uneven barcode distribution (i.e. other samples on the same run had plenty of reads).
After running Unicycler with default settings, here is the resulting graph, visualised in Bandage:
It's not too bad - there is a 3.9 Mbp contig - but it's not complete. In particular, there seems to be a large plasmid in a pretty messy state.
We can try the assembly again using Unicycler's bold mode:
unicycler-runner.py -1 short_1.fastq.gz -2 short_2.fastq.gz -l long.fastq.gz -o assembly --mode bold
This may help to complete the assembly but at an increased risk of misassembly (read more here). Here is the resulting graph:
It's better, but only a little. The main difference with the first assembly is that contigs were more aggressively merged together.
The remaining incomplete part of the chromosome is the RNA operon - not surprising since it is often the longest repeat in the genome. I then grabbed long reads which aligned to this region (see Read extraction) and BLASTed in Bandage.
Unfortunately, there's nothing to be gained. Some long reads look like this, entering the repeat but not spanning it:
Others look like this, completely contained in the repeat:
Will we have more luck in the plasmid? Increasing Bandage's 'Node length per Megabase' setting shows that there are two unresolved spots:
The left one is a big loop where a 4.5 kbp sequence seems to be repeated many times:
The Illumina read depth suggests about 6 or 7 copies of the repeat. So to span it completely would take a long read of more than 30 kbp - not impossible, but maybe too much to hope for considering our low long-read depth. And indeed, when we align relevant long reads we fail to find any which span the entire thing:
The story is much the same for the other incomplete part of the plasmid. This tangle is more than 10 kbp long, and no reads span the entire thing:
In a way this all makes sense - if there were long reads which spanned the unresolved repeats, Unicycler should have used them to complete the assembly. I'm afraid that with only 3.6x long-read depth, a confidence complete assembly may not be possible 😢
However, if all you need is a plausible genome assembly, then you could manually complete it with a bit of guesswork. You could guess the RNA operon resolution, perhaps by comparing to reference genomes. You could guess the loop count in the plasmid based on the Illumina read depth (e.g. six consecutive copies of the repeat). And you could guess a path through the plasmid tangle, again perhaps based on a reference genome with a similar sequence. Depending on what you're planning to do with your assembly, a mostly-complete-but-with-some-guesswork result could be good enough.