-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix umi-tools dependency on python hash function values #470
Conversation
aeb4441
to
6747626
Compare
Hi TyberiousPrime, Thanks for your contribution. I'm afraid I'm not comfortable with the addition of an additional dependency without some research first, especially one that is not on conda as conda is our preferred installation route. As we want a pure conda install, this might mean convincing someone to add siphashc to bioconda or conda-forge, or even maintaining a recipe ourselves. We've worked hard to try to keep the number of dependencies as low as possible. @TomSmithCGAT Thoughts? |
Possibly, you can sort by the actual string in the return of breadth_first_search, Or, to avoid a dependency, rip the siphashc code into the cython you have already. |
Thanks for the PR @TyberiusPrime. It's always a relief when an issue turns into a PR rather than another item on the never ending to-do list! I agree with @IanSudbery that we don't want to add a dependency outside conda. On that basis that the licence permits it, I would have no objection to copying the siphashc code over (with clear accreditation in the code) if we decide to go down that route. I'm in two minds as to whether sorting so it's completely determinative is a good idea. It feels like it should remain non-determinative by default since there are cases where multiple reads are all equally 'good' representative reads for a UMI group. On the other hand, if there's no downside for making it determinative, I can't really see any reason not too? I can't imagine any cases where sorting would detriment downstream analysis, though maybe that's a lack of imagination on my part. @IanSudbery? |
In other places where two reads are equivalent, like which read to keep as representative of a read bundle, the decsision is explicitly random (and therefore can be determined by the random seed). |
Good point. In that case, any objections to adding a I had another quick look into whether it can be set within the script. Appears I may have been mistaken before. Needs to be set before the interpreter starts (https://stackoverflow.com/questions/32538764/unable-to-see-or-modify-value-of-pythonhashseed-through-a-module) so it's either 1. add I favour option 1. |
I guess we'll need to do some performance benchmarks to see what the performance hit is, but I can believe its much. |
Resolved by #550 |
The umi-tools algorithm has an unfortunate dependency
on the actual hash values python generates on strings
With the anti-DOS mitigations from python 3.3,
these change with every run.
This means, unless the user set's PYTHON_HASHSEED=0,
in addition to --random-seed, umi-tools output is non deterministic.
That is surprising, and also, undocumented except in github issues.
This PR forces the set to use the equivalent of PYTHON_HASHSEED=0
in the one place that matters to the test cases.
I have not performed performance benchmarks - couldn't find such a facility
in the repo.
There is one remaining test case failing 'group_adjacency',
but it also fails with PYTHON_HASHSEED=0 on a clean checkout
of either master or 1.1.1 - there is something else going on there.