Skip to content

Commit

Permalink
Document how IVA assembles contigs, for #442.
Browse files Browse the repository at this point in the history
  • Loading branch information
donkirkby committed Nov 13, 2019
1 parent 283f9e5 commit 7ce984d
Showing 1 changed file with 323 additions and 0 deletions.
323 changes: 323 additions & 0 deletions docs/design/assembly.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,323 @@
# Notes on de novo assembly
So far, we've had the best assembly results with [IVA]. It's not perfect, though,
so this is a summary of what we've learned about how it works.

The best way to see the details is to change the [`iva` command] to write its
details to a log file like this:

with open(os.path.join(tmp_dir, 'iva.log'), 'w') as log_file:
run([IVA, '-vv', '--fr', joined_path, '-t', '2', iva_out_path],
check=True,
stdout=log_file,
stderr=STDOUT)

That will show you the contig sequence as it gets assembled throughout the
process by writing its progress to the `iva.log` file. Below, I'll walk through
a typical assembly, and explain what you see in the log file.

[IVA]: https://sanger-pathogens.github.io/iva
[`iva` command]: https://github.com/cfe-lab/MiCall/blob/283f9e50d6429f6a0d69714afa84b9b91862a6fa/micall/core/denovo.py#L213-L216


## Building the Seed
IVA runs [KMC], the k-mer counter to find the most common k-mer of the requested
length (usually 95). It logs the most common k-mer, and how many times it was
found.

run kmc: ...
Made new seed. kmer coverage 1998 and seed is ACGT...

Then it tries to extend the seed by going through each read to see if the last
95 bases match some part of the seed. If so, it stores the part of that read
that would extend the seed to the left. It does the same thing to the start of
all the reads and stores the part that would extend to the right.

Then it goes through all the possible extensions looking at the two most common
extensions to the left of each length from 50 down. If it finds an extension
that occurs at least 10 times, and at least 4 times more often than second place,
it uses it. Then it does the same thing to the right. Note that the "new length"
at the bottom of this section is actually the old length before the extensions
on left and right.

Seed extension iteration 1
k = 50 commonest two kmers: ('ACGTACGT...', 'CAGTACGT...') have frequency: (343, 567)
k = 49 commonest two kmers: ('ACGTACG...', 'CAGTACG...') have frequency: (576, 609)
k = 48 commonest two kmers: ('ACGTAC...', 'CAGTAC...') have frequency: (579, 620)
k = 47 commonest two kmers: ('ACGTA...', 'CAGTA...') have frequency: (658, 696)
k = 46 commonest two kmers: ('ACGT...', 'CAGT...') have frequency: (199, 1071)
k = 50 commonest two kmers: ('ACGTACGT...', 'CAGTACGT...') have frequency: (132, 1111)
Extend seed. new length=95. Bases added left:46. Bases added right:50
new seed: ACGTACGT...

It continues to extend the seed until it reaches the seed stop length (usually
720), or it stops making progress on both ends.

Seed extension iteration 2
...
Seed extension iteration 3
...
Seed extension iteration 4
...
Seed extension iteration 5
...
Seed extension iteration 6
...
Seed extension iteration 7
k = 50 commonest two kmers: ('ACGTACGT...', 'CAGTACGT...') have frequency: (132, 1111)
k = 50 commonest two kmers: ('ACGTACGT...', 'CAGTACGT...') have frequency: (132, 1111)
Extend seed. new length=680. Bases added left:50. Bases added right:50
new seed: ACGTACGT...
Extended seed OK.

If the seed extended to at least three quarters of the seed stop length, it's a
success and IVA will move on to the next step. If not, it records the failed
seed and starts again with the next most common k-mer as a new seed.

[KMC]: https://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=about


## Extending with Read Pairs
Once the seed is big enough, it becomes a contig, and IVA switches to mapping
read pairs instead of looking for exact matches of the read ends.

______________________________ START ITERATION 1 ______________________________
---------------- iteration.1 start extension subiteration 0001 ----------------
map reads reads_1.fa reads_2.fa
map reads. Index: smalt index -k 19 -s 11 iteration.1.1.1.map.map_index iteration.1.1.1.map.ref.fa
map reads. Mapping: smalt map ...

The mapper is [smalt], and it runs in two steps: building an index of features
in the contig, and then mapping all the read pairs to find where they best align
with the contig. For this mapping step, the reads have to match at least half of
their bases to the contig. The most useful thing in this section is the file
name that shows which iteration, subiteration, and retry you're on.

After trying to map all the reads to the contig, IVA goes through all the
results looking for pairs that matched either end of the contig, and had a
soft-clipped section hanging off the end. It records the soft-clipped portions
in a collection at each end of the contig, and it writes all the useful read
pairs into temporary FASTA files. Useful read pairs fit one of the following:

* One read mapped, the other didn't. Hopefully, the other will map as the contig
gets extended.
* Both reads mapped, and at least one of the pair has a soft-clipped portion off
the end of the contig.

After going through all the mapped read pairs, IVA then tries to extend the left
end of the contig, similar to the way it extended the seed. Starting at length
100 or the longest soft-clipped region that was found, IVA counts up all the
possible extension sequences of that length. If the most common extension was
seen at least ten times, and at least four times more often than the second most
common extension, it gets selected. Otherwise, reduce the length by one, and try
again. Do the same thing for the right end.

trying to extend left end ...
k = 100 commonest two kmers: ('TCCTAGAG...', 'TCCTTGAG...') have frequency: (8, 33)
trying to extend right end ...
k = 100 commonest two kmers: ('ACGTACGT...', 'CAGTACGT...') have frequency: (40, 104)
k = 99 commonest two kmers: ('ACGTACG...', 'CAGTACG...') have frequency: (41, 107)
k = 98 commonest two kmers: ('ACGTAC...', 'CAGTAC...') have frequency: (41, 111)
k = 97 commonest two kmers: ('ACGTA...', 'CAGTA...') have frequency: (31, 125)
new left sequence: TCCTTGAG...
new right sequence: CAGTA...
extend contig seeded.00001 new_length:977 added_left:100 added_right:97

Now the new length really is the new length.

This step repeats up to four times, but it only tries mapping the useful read
pairs saved from the previous step.

map reads /.../tmp.extend.abc123/reads.0_1.fa /.../tmp.extend.abc123/reads.0_2.fa
map reads. Index: smalt index -k 19 -s 11 iteration.1.1.2.map.map_index iteration.1.1.2.map.ref.fa
map reads. Mapping: smalt map ...

You can see that the FASTA files are in a `tmp.extend` folder when IVA is
retrying the mapping with the useful read pairs from the previous step. It
retries up to five times with the same set of read pairs, or until it adds fewer
than 20 bases to the contig.

When it's finished its five retries or given up, IVA decides if the contig is
worth extending. A contig is worth extending if it made some progress in the
previous set of retries, or if it only retried once. If it's worth extending,
IVA starts a new subiteration.

---------------- iteration.1 start extension subiteration 0002 ----------------
map reads reads_1.fa reads_2.fa
map reads. Index: smalt index -k 19 -s 11 iteration.1.2.1.map.map_index iteration.1.2.1.map.ref.fa
map reads. Mapping: smalt map ...

You can see that it goes back to the full set of read pairs. It continues
extending the contig with subiterations and retries.

...
---------------- iteration.1 start extension subiteration 0003 ----------------
...
---------------- iteration.1 start extension subiteration 0004 ----------------
...

When it gets to subiteration 5, IVA does an extra mapping of all the read pairs
to the contig. This time, it requires the reads to match at least 90%, instead
of the 50% it uses when extending. Any reads that map as proper pairs are
discarded, because they match the contig that's already assembled. All the
remaining read pairs are used for the following subiterations. This filtering
gets repeated after every five subiterations.

---------------- iteration.1 start extension subiteration 0005 ----------------
map reads reads_1.fa reads_2.fa
map reads. Index: smalt index -k 19 -s 11 /.../tmp.filter_reads.abc123/out.map_index /.../tmp.filter_reads.abc123/out.ref.fa
map reads. Mapping: smalt map ...
map reads reads.subiter.5.reads_1.fa reads.subiter.5.reads_2.fa
map reads. Index: smalt index -k 19 -s 11 iteration.1.5.1.map.map_index iteration.1.5.1.map.ref.fa
map reads. Mapping: smalt map ...
...

This continues until there is no progress in a subiteration.

---------------- iteration.1 start extension subiteration 0014 ----------------
...
new left sequence:
new right sequence:
extend contig seeded.00001 new_length:5839 added_left:0 added_right:0
No bases added. Try trimming contigs

[smalt]: https://sourceforge.net/projects/smalt/

## Trimming
Now, IVA tries trimming the ends of the contig. It maps the reads and requires
90% again, then sorts, indexes, and builds a pileup with [samtools]. The pileup
reports how many reads are mapped to each position in the contig, or the
coverage. The coverage is tracked separately for forward and reverse reads. The
coverage is checked from the ends inward, until it meets the minimum coverage
(default 10) and the strand bias (minimum fraction from either direction,
default 0). The ends are trimmed and reported.

map reads reads_1.fa reads_2.fa
map reads. Index: smalt index -k 19 -s 11 /.../tmp.trim_strand_biased_ends.abc123/out.map_index /.../tmp.trim_strand_biased_ends.abc123/out.ref.fa
map reads. Mapping: smalt map ...
map reads. sort: samtools sort ...
map reads. index: samtools index /.../tmp.trim_strand_biased_ends.abc123/out.bam
Trimming strand biased ends of contig seeded.00001 - good base range is 26 to 5839 from 5839 bases

Then another subiteration is tried.

---------------- iteration.1 start extension subiteration 0015 ----------------
...
new left sequence: TCCTTGAG...
new right sequence:
extend contig seeded.00001 new_length:5826 added_left:12 added_right:0
map reads reads_1.fa reads_2.fa
map reads. Index: smalt index -k 19 -s 11 /.../tmp.trim_strand_biased_ends.abc123/out.map_index /.../tmp.trim_strand_biased_ends.abc123/out.ref.fa
map reads. Mapping: smalt map ...
map reads. sort: samtools sort ...
map reads. index: samtools index /.../tmp.trim_strand_biased_ends.abc123/out.bam
Trimming strand biased ends of contig seeded.00001 - good base range is 13 to 5826 from 5826 bases

If that doesn't extend at least as far as the section that was trimmed at the
end of the previous subiteration, it's not worth extending.

[samtools]: https://www.htslib.org/doc/samtools.html

## Making a New Seed
The reads are mapped against the contig with a 50% match required, and any
unmapped reads are used to make a new seed.

_____________________________ Try making new seed _____________________________
map reads iteration.1.filtered_1.fa iteration.1.filtered_2.fa
map reads. Index: smalt index -k 19 -s 11 /.../tmp.make_seed.abc123/out.map_index /.../tmp.make_seed.abc123/out.ref.fa
map reads. Mapping: smalt map ...

IVA runs the k-mer counter to find the most common k-mer of the seed length, but
this time, it maps all the k-mers against the first seed and the contig. It then
chooses the most common k-mer that didn't map against the previous seed or
contig.

run kmc: /.../tmp.run_kmc.abc123 kmc -fa -m32 -k95 ...
First try of running kmc failed. Trying again with -m4 instead of -m32...
run kmc: /.../tmp.run_kmc.abc123 kmc -fa -m4 -k95 ...
map reads. Index: smalt index -k 9 -s 1 /.../tmp.common_kmers.abc123/map.map_index /.../tmp.common_kmers.abc123/ref.fa
map reads. Mapping: smalt map ...
Made new seed. kmer coverage 1625 and seed is GACCTT...

The seed is extended in the same way as the first one. If the seed isn't at
least three quarters of the seed stop length, it is discarded, and a new seed
is built.

Seed extension iteration 1
...
Seed extension iteration 2
...
Couldn't extend seed enough. That was attempt 1 of 10
run kmc: ...
First try of running kmc failed. Trying again with -m4 instead of -m32...
run kmc: ...
map reads. Index: smalt index ...
map reads. Mapping: smalt map ...
Made new seed. kmer coverage 1554 and seed is CATG...
Seed extension iteration 1
...
Seed extension iteration 13
...
Extend seed. new length=634. Bases added left:0. Bases added right:0
new seed: ACCTT...
Extended seed OK.

## Extending Multiple Contigs
Now, IVA switches back to mapping reads, but onto both the contigs. It then uses
the soft clipped endings to try extending both contigs on the left and right
ends.

______________________________ START ITERATION 2 ______________________________
---------------- iteration.2 start extension subiteration 0001 ----------------
map reads iteration.1.filtered_1.fa iteration.1.filtered_2.fa
map reads. Index: smalt index -k 19 -s 11 iteration.2.1.1.map.map_index iteration.2.1.1.map.ref.fa
map reads. Mapping: smalt map ...
trying to extend left end ...
k = 100 commonest two kmers: ('TCCTAGAG...', 'TCCTTGAG...') have frequency: (8, 33)
...
k = 13 commonest two kmers: ('TCCTA...', 'TCCTT...') have frequency: (2, 10)
trying to extend right end ...
k = 100 commonest two kmers: ('ACGTACGT...', 'CAGTACGT...') have frequency: (40, 104)
...
k = 1 commonest two kmers: ('A', 'C') have frequency: (351, 551)
new left sequence: TCCTT...
new right sequence:
extend contig seeded.00001 new_length:5827 added_left:13 added_right:0
trying to extend left end ...
k = 100 commonest two kmers: ('TCCTAGAG...', 'TCCTTGAG...') have frequency: (8, 33)
...
k = 47 commonest two kmers: ('TCCTA...', 'TCCTT...') have frequency: (3, 546)
trying to extend right end ...
k = 100 commonest two kmers: ('ACGTACGT...', 'CAGTACGT...') have frequency: (40, 104)
...
k = 41 commonest two kmers: ('ACGTA...', 'CAGTA...') have frequency: (3, 223)
new left sequence: TCCTT...
new right sequence: CAGTA...
extend contig seeded.00002 new_length:722 added_left:47 added_right:41

This continues with subiterations and sets of five retries, filtering out fully
mapped pairs after every five subiterations. When it stops making progress, IVA
trims the ends of the new contig for low coverage, and tries extending again.

If that makes no progress, it's time to see if the new contig is related to the
old contig.

## Combining Assembled Contigs
The contigs are compared to each other using [NUCmer] to find identical regions,
including reverse-complemented matches. If one contig is completely contained
inside the other, the smaller one is discarded. If two contigs have matching
ends, they are combined. If the matching ends are reverse complements of each
other, one of the contigs will be reverse complemented before combining the two.

[NUCmer]: http://mummer.sourceforge.net/manual/#nucmer

## More Seeds
The process continues with finding a seed, extending that seed, extending with
mapped reads, trimming the contigs, and combining the contigs, until it fails
to find or extend another seed.

## Reporting Contigs
IVA looks through all three reading frames in both directions for each contig,
and finds the longest open reading frame (without stop codons) for each contig.
It then names the contigs in order, with the longer open reading frames first.
It will also reverse complement contigs, if their longest open reading frame is
in the reverse direction.

0 comments on commit 7ce984d

Please sign in to comment.