Avoid duplicate work #4

lczech · 2017-10-05T14:10:34Z

Unlike pplace and old RAxML-EPA, epa-ng re-computes the placements of identical sequences. This is not necessary.

Possible solution: Store hashes of the sequences that have already been processed. If a new sequence has a hash that was seen before, add the name to the list of names for the pquery of the previous sequence (or, if that name also already exists, increment its multiplicity). This assumes that hash collisions don't occur, so the hash function should be good enough (SHA1?).

The text was updated successfully, but these errors were encountered:

pierrebarbera · 2017-10-05T15:05:59Z

As a bare minimum, this issue can be partly resolved by including multiplicities in the bfast file, then setting them correctly in the output jplace

amkozlov · 2017-10-05T17:36:51Z

please note, that you don't need a cryptography-strong hash function to do this, any simple and fast-to-compute function would do: if a collision happens, you just compare sequences char-by-char to double-check. a fast function with 1 collision/1M seqs will be actually better than a slow one with 1collision/1G seqs.

t's what I do in RAxML, and it's implemented in pll-modules here:
https://github.com/ddarriba/pll-modules/blob/dev/src/msa/pll_msa.c#L382

lczech · 2017-10-05T20:44:07Z

@amkozlov: That necessitates to re-visit colliding sequences. The current implementation is however reading the sequences in a stream, in order to keep memory usage low. It seems reasonable to keep it that way, if possible. Thus, the hash function should be strong.

Of course, the hash then needs more bits, so the hash list will have a significant memory footprint of its own. Probably, the user should hence be able to deactivate this, particularly if the user already ensured that there are no duplicates. See #5 for a way to still keep abundance information. Still, I think it would be good to activate it by default, to make the average use case simple.

Edit: We could also play around with simple, yet fast hashing based on the sequence data itself. I implemented a prototype for nucleotides in 2-bit representation that simply xors 64-bit words of this representation to get the hash. Not sure if this is collision-free-enough for this use case though.

stamatak · 2017-10-06T09:56:38Z

why don't you just include this in the pre-processing step of the MSA? you could remove duplicate seqs there and then store the multiplicities internally in binary format? alexis

…

On 05.10.2017 16:10, Lucas Czech wrote: Unlike pplace and old RAxML-EPA, epa-ng re-computes the placements of identical sequences. This is not necessary. Possible solution: Store hashes of the sequences that have already been processed. If a new sequence has a has that was seen before, add the name to the list of names for the |pquery| of the previous sequence (or, if that name also already exists, increment its multiplicity). This assumes that hash collisions don't occur, so the hash function should be good enough (SHA1?). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1w-tGVxJWv3U-5bG5BftpDz4m1AsLmks5spONbgaJpZM4PvJP1>.

-- Alexandros (Alexis) Stamatakis Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology www.exelixis-lab.org

lczech · 2017-10-06T10:07:14Z

We thought about that solution, but that means the user always has to create the binary file first. For normal users with not too many sequences, the binary file is not necessary. Thus, a solution that can directly work on a stream of sequences make the life of the average user simpler. However, in terms of implementation, @stamatak solution is simpler, so it could be done first. And then later implement the stream solution, for additional user comfort. Thoughts?

pierrebarbera · 2017-10-06T10:32:34Z

@stamatak @lczech yes that's what I meant with my initial reply.

In general I want to add that if there are a lot of duplicates in the EPA input, the user has likely done something wrong already: used duplicate query sequences for the much more expensive alignment step!

Somewhere down the road, perhaps in an overarching project that deals with all the necessary work from raw query to jplace, this can be dealt with in a much cleaner way, with better file formats incorporating things like:

which sample does the sequence belong to
how many duplicates of this sequence were there for every different sample
metadata associated with each sample

amkozlov · 2017-10-06T10:32:45Z

@lczech fair point! I forgot about the streaming.

Regarding the pre-processing: I guess binary file is also essential to enable parallel I/O, right?

I really like the solution Andre implemented in ExaBayes, namely:

both binary file and FASTA are accepted as input
if FASTA file is provided, it is automatically pre-processed and converted to the binary format -> no overhead for "small dataset" use case
there is a separate option/program to explicitly convert FASTA file to the binary format -> serves the "big dataset" use case

pierrebarbera · 2017-10-06T10:39:42Z

@amkozlov that is how it works in v0.1.0.beta :)

... except there is no deduplication yet with the explicit converison, but should be very simple.

amkozlov · 2017-10-06T10:51:58Z

@Pbdas ah ok, great :)

pierrebarbera · 2019-03-18T15:19:53Z

Revisiting this I still think that the issue really lies upstream with the user. It's essentially solved by using gappa prepare chunkify, which also solves some additional data management issues. EPA-ng has evolved in the direction of (or perhaps always been) a data processing core, instead of a data management tool.

Rather than spending time on this I think it would be more useful to expand the documentation to include a more holistic view on placement and how it fits in relation with all the other tools.

So I'm closing this one :)

lczech · 2019-03-18T15:31:05Z

agreed ;-)

pierrebarbera self-assigned this Oct 5, 2017

pierrebarbera added the enhancement label Oct 5, 2017

pierrebarbera added this to the v0.2.0.beta milestone Oct 5, 2017

pierrebarbera modified the milestones: v0.2.0-beta, v0.3.0-beta Feb 20, 2018

pierrebarbera removed this from the v0.3.0-beta milestone Nov 28, 2018

pierrebarbera closed this as completed Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicate work #4

Avoid duplicate work #4

lczech commented Oct 5, 2017 •

edited

Loading

pierrebarbera commented Oct 5, 2017

amkozlov commented Oct 5, 2017

lczech commented Oct 5, 2017 •

edited

Loading

stamatak commented Oct 6, 2017 via email

lczech commented Oct 6, 2017

pierrebarbera commented Oct 6, 2017

amkozlov commented Oct 6, 2017

pierrebarbera commented Oct 6, 2017

amkozlov commented Oct 6, 2017

pierrebarbera commented Mar 18, 2019

lczech commented Mar 18, 2019

Avoid duplicate work #4

Avoid duplicate work #4

Comments

lczech commented Oct 5, 2017 • edited Loading

pierrebarbera commented Oct 5, 2017

amkozlov commented Oct 5, 2017

lczech commented Oct 5, 2017 • edited Loading

stamatak commented Oct 6, 2017 via email

lczech commented Oct 6, 2017

pierrebarbera commented Oct 6, 2017

amkozlov commented Oct 6, 2017

pierrebarbera commented Oct 6, 2017

amkozlov commented Oct 6, 2017

pierrebarbera commented Mar 18, 2019

lczech commented Mar 18, 2019

lczech commented Oct 5, 2017 •

edited

Loading

lczech commented Oct 5, 2017 •

edited

Loading