Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix umi-tools dependency on python hash function values #470

Closed
wants to merge 1 commit into from

Conversation

TyberiusPrime
Copy link

The umi-tools algorithm has an unfortunate dependency
on the actual hash values python generates on strings
With the anti-DOS mitigations from python 3.3,
these change with every run.

This means, unless the user set's PYTHON_HASHSEED=0,
in addition to --random-seed, umi-tools output is non deterministic.

That is surprising, and also, undocumented except in github issues.

This PR forces the set to use the equivalent of PYTHON_HASHSEED=0
in the one place that matters to the test cases.

I have not performed performance benchmarks - couldn't find such a facility
in the repo.

There is one remaining test case failing 'group_adjacency',
but it also fails with PYTHON_HASHSEED=0 on a clean checkout
of either master or 1.1.1 - there is something else going on there.

@IanSudbery
Copy link
Member

Hi TyberiousPrime,

Thanks for your contribution. I'm afraid I'm not comfortable with the addition of an additional dependency without some research first, especially one that is not on conda as conda is our preferred installation route. As we want a pure conda install, this might mean convincing someone to add siphashc to bioconda or conda-forge, or even maintaining a recipe ourselves. We've worked hard to try to keep the number of dependencies as low as possible.

@TomSmithCGAT Thoughts?

@TyberiusPrime
Copy link
Author

Possibly, you can sort by the actual string in the return of breadth_first_search,
fix up the test cases, and declare the new filtering canonical.

Or, to avoid a dependency, rip the siphashc code into the cython you have already.
Its license does permit that. It's only 80 lines of very straight c.

@TomSmithCGAT
Copy link
Member

Thanks for the PR @TyberiusPrime. It's always a relief when an issue turns into a PR rather than another item on the never ending to-do list!

I agree with @IanSudbery that we don't want to add a dependency outside conda. On that basis that the licence permits it, I would have no objection to copying the siphashc code over (with clear accreditation in the code) if we decide to go down that route.

I'm in two minds as to whether sorting so it's completely determinative is a good idea. It feels like it should remain non-determinative by default since there are cases where multiple reads are all equally 'good' representative reads for a UMI group. On the other hand, if there's no downside for making it determinative, I can't really see any reason not too? I can't imagine any cases where sorting would detriment downstream analysis, though maybe that's a lack of imagination on my part. @IanSudbery?

@IanSudbery
Copy link
Member

In other places where two reads are equivalent, like which read to keep as representative of a read bundle, the decsision is explicitly random (and therefore can be determined by the random seed).

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Apr 30, 2021

Good point. In that case, any objections to adding a sort to get around this PYTHONHASHSEED stuff?

I had another quick look into whether it can be set within the script. Appears I may have been mistaken before. Needs to be set before the interpreter starts (https://stackoverflow.com/questions/32538764/unable-to-see-or-modify-value-of-pythonhashseed-through-a-module) so it's either 1. add sort, 2. change hash function (e.g siphashc), or 3. update documentation and leave it as it is.

I favour option 1.

@IanSudbery
Copy link
Member

I guess we'll need to do some performance benchmarks to see what the performance hit is, but I can believe its much.

@TomSmithCGAT
Copy link
Member

Resolved by #550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants