Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate work #4

Closed
lczech opened this issue Oct 5, 2017 · 11 comments
Closed

Avoid duplicate work #4

lczech opened this issue Oct 5, 2017 · 11 comments
Assignees

Comments

@lczech
Copy link
Collaborator

lczech commented Oct 5, 2017

Unlike pplace and old RAxML-EPA, epa-ng re-computes the placements of identical sequences. This is not necessary.

Possible solution: Store hashes of the sequences that have already been processed. If a new sequence has a hash that was seen before, add the name to the list of names for the pquery of the previous sequence (or, if that name also already exists, increment its multiplicity). This assumes that hash collisions don't occur, so the hash function should be good enough (SHA1?).

@pierrebarbera pierrebarbera self-assigned this Oct 5, 2017
@pierrebarbera pierrebarbera added this to the v0.2.0.beta milestone Oct 5, 2017
@pierrebarbera
Copy link
Owner

As a bare minimum, this issue can be partly resolved by including multiplicities in the bfast file, then setting them correctly in the output jplace

@amkozlov
Copy link
Collaborator

amkozlov commented Oct 5, 2017

please note, that you don't need a cryptography-strong hash function to do this, any simple and fast-to-compute function would do: if a collision happens, you just compare sequences char-by-char to double-check. a fast function with 1 collision/1M seqs will be actually better than a slow one with 1collision/1G seqs.

t's what I do in RAxML, and it's implemented in pll-modules here:
https://github.com/ddarriba/pll-modules/blob/dev/src/msa/pll_msa.c#L382

@lczech
Copy link
Collaborator Author

lczech commented Oct 5, 2017

@amkozlov: That necessitates to re-visit colliding sequences. The current implementation is however reading the sequences in a stream, in order to keep memory usage low. It seems reasonable to keep it that way, if possible. Thus, the hash function should be strong.

Of course, the hash then needs more bits, so the hash list will have a significant memory footprint of its own. Probably, the user should hence be able to deactivate this, particularly if the user already ensured that there are no duplicates. See #5 for a way to still keep abundance information. Still, I think it would be good to activate it by default, to make the average use case simple.

Edit: We could also play around with simple, yet fast hashing based on the sequence data itself. I implemented a prototype for nucleotides in 2-bit representation that simply xors 64-bit words of this representation to get the hash. Not sure if this is collision-free-enough for this use case though.

@stamatak
Copy link
Collaborator

stamatak commented Oct 6, 2017 via email

@lczech
Copy link
Collaborator Author

lczech commented Oct 6, 2017

We thought about that solution, but that means the user always has to create the binary file first. For normal users with not too many sequences, the binary file is not necessary. Thus, a solution that can directly work on a stream of sequences make the life of the average user simpler. However, in terms of implementation, @stamatak solution is simpler, so it could be done first. And then later implement the stream solution, for additional user comfort. Thoughts?

@pierrebarbera
Copy link
Owner

@stamatak @lczech yes that's what I meant with my initial reply.

In general I want to add that if there are a lot of duplicates in the EPA input, the user has likely done something wrong already: used duplicate query sequences for the much more expensive alignment step!

Somewhere down the road, perhaps in an overarching project that deals with all the necessary work from raw query to jplace, this can be dealt with in a much cleaner way, with better file formats incorporating things like:

  • which sample does the sequence belong to
  • how many duplicates of this sequence were there for every different sample
  • metadata associated with each sample

@amkozlov
Copy link
Collaborator

amkozlov commented Oct 6, 2017

@lczech fair point! I forgot about the streaming.

Regarding the pre-processing: I guess binary file is also essential to enable parallel I/O, right?

I really like the solution Andre implemented in ExaBayes, namely:

  • both binary file and FASTA are accepted as input
  • if FASTA file is provided, it is automatically pre-processed and converted to the binary format -> no overhead for "small dataset" use case
  • there is a separate option/program to explicitly convert FASTA file to the binary format -> serves the "big dataset" use case

@pierrebarbera
Copy link
Owner

@amkozlov that is how it works in v0.1.0.beta :)

... except there is no deduplication yet with the explicit converison, but should be very simple.

@amkozlov
Copy link
Collaborator

amkozlov commented Oct 6, 2017

@Pbdas ah ok, great :)

@pierrebarbera pierrebarbera modified the milestones: v0.2.0-beta, v0.3.0-beta Feb 20, 2018
@pierrebarbera pierrebarbera removed this from the v0.3.0-beta milestone Nov 28, 2018
@pierrebarbera
Copy link
Owner

Revisiting this I still think that the issue really lies upstream with the user. It's essentially solved by using gappa prepare chunkify, which also solves some additional data management issues. EPA-ng has evolved in the direction of (or perhaps always been) a data processing core, instead of a data management tool.

Rather than spending time on this I think it would be more useful to expand the documentation to include a more holistic view on placement and how it fits in relation with all the other tools.

So I'm closing this one :)

@lczech
Copy link
Collaborator Author

lczech commented Mar 18, 2019

agreed ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants