-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid duplicate work #4
Comments
As a bare minimum, this issue can be partly resolved by including multiplicities in the |
please note, that you don't need a cryptography-strong hash function to do this, any simple and fast-to-compute function would do: if a collision happens, you just compare sequences char-by-char to double-check. a fast function with 1 collision/1M seqs will be actually better than a slow one with 1collision/1G seqs. t's what I do in RAxML, and it's implemented in pll-modules here: |
@amkozlov: That necessitates to re-visit colliding sequences. The current implementation is however reading the sequences in a stream, in order to keep memory usage low. It seems reasonable to keep it that way, if possible. Thus, the hash function should be strong. Of course, the hash then needs more bits, so the hash list will have a significant memory footprint of its own. Probably, the user should hence be able to deactivate this, particularly if the user already ensured that there are no duplicates. See #5 for a way to still keep abundance information. Still, I think it would be good to activate it by default, to make the average use case simple. Edit: We could also play around with simple, yet fast hashing based on the sequence data itself. I implemented a prototype for nucleotides in 2-bit representation that simply xors 64-bit words of this representation to get the hash. Not sure if this is collision-free-enough for this use case though. |
why don't you just include this in the pre-processing step of the MSA?
you could remove duplicate seqs there and then store the multiplicities
internally in binary format?
alexis
…On 05.10.2017 16:10, Lucas Czech wrote:
Unlike pplace and old RAxML-EPA, epa-ng re-computes the placements of
identical sequences. This is not necessary.
Possible solution: Store hashes of the sequences that have already been
processed. If a new sequence has a has that was seen before, add the
name to the list of names for the |pquery| of the previous sequence (or,
if that name also already exists, increment its multiplicity). This
assumes that hash collisions don't occur, so the hash function should be
good enough (SHA1?).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA1w-tGVxJWv3U-5bG5BftpDz4m1AsLmks5spONbgaJpZM4PvJP1>.
--
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
|
We thought about that solution, but that means the user always has to create the binary file first. For normal users with not too many sequences, the binary file is not necessary. Thus, a solution that can directly work on a stream of sequences make the life of the average user simpler. However, in terms of implementation, @stamatak solution is simpler, so it could be done first. And then later implement the stream solution, for additional user comfort. Thoughts? |
@stamatak @lczech yes that's what I meant with my initial reply. In general I want to add that if there are a lot of duplicates in the EPA input, the user has likely done something wrong already: used duplicate query sequences for the much more expensive alignment step! Somewhere down the road, perhaps in an overarching project that deals with all the necessary work from raw query to
|
@lczech fair point! I forgot about the streaming. Regarding the pre-processing: I guess binary file is also essential to enable parallel I/O, right? I really like the solution Andre implemented in ExaBayes, namely:
|
@amkozlov that is how it works in v0.1.0.beta :) ... except there is no deduplication yet with the explicit converison, but should be very simple. |
@Pbdas ah ok, great :) |
Revisiting this I still think that the issue really lies upstream with the user. It's essentially solved by using Rather than spending time on this I think it would be more useful to expand the documentation to include a more holistic view on placement and how it fits in relation with all the other tools. So I'm closing this one :) |
agreed ;-) |
Unlike pplace and old RAxML-EPA, epa-ng re-computes the placements of identical sequences. This is not necessary.
Possible solution: Store hashes of the sequences that have already been processed. If a new sequence has a hash that was seen before, add the name to the list of names for the
pquery
of the previous sequence (or, if that name also already exists, increment its multiplicity). This assumes that hash collisions don't occur, so the hash function should be good enough (SHA1?).The text was updated successfully, but these errors were encountered: