fix umi-tools dependency on python hash function values #470

TyberiusPrime · 2021-04-28T10:28:46Z

The umi-tools algorithm has an unfortunate dependency
on the actual hash values python generates on strings
With the anti-DOS mitigations from python 3.3,
these change with every run.

This means, unless the user set's PYTHON_HASHSEED=0,
in addition to --random-seed, umi-tools output is non deterministic.

That is surprising, and also, undocumented except in github issues.

This PR forces the set to use the equivalent of PYTHON_HASHSEED=0
in the one place that matters to the test cases.

I have not performed performance benchmarks - couldn't find such a facility
in the repo.

There is one remaining test case failing 'group_adjacency',
but it also fails with PYTHON_HASHSEED=0 on a clean checkout
of either master or 1.1.1 - there is something else going on there.

IanSudbery · 2021-04-28T14:03:40Z

Hi TyberiousPrime,

Thanks for your contribution. I'm afraid I'm not comfortable with the addition of an additional dependency without some research first, especially one that is not on conda as conda is our preferred installation route. As we want a pure conda install, this might mean convincing someone to add siphashc to bioconda or conda-forge, or even maintaining a recipe ourselves. We've worked hard to try to keep the number of dependencies as low as possible.

@TomSmithCGAT Thoughts?

TyberiusPrime · 2021-04-28T14:30:59Z

Possibly, you can sort by the actual string in the return of breadth_first_search,
fix up the test cases, and declare the new filtering canonical.

Or, to avoid a dependency, rip the siphashc code into the cython you have already.
Its license does permit that. It's only 80 lines of very straight c.

TomSmithCGAT · 2021-04-30T08:59:41Z

Thanks for the PR @TyberiusPrime. It's always a relief when an issue turns into a PR rather than another item on the never ending to-do list!

I agree with @IanSudbery that we don't want to add a dependency outside conda. On that basis that the licence permits it, I would have no objection to copying the siphashc code over (with clear accreditation in the code) if we decide to go down that route.

I'm in two minds as to whether sorting so it's completely determinative is a good idea. It feels like it should remain non-determinative by default since there are cases where multiple reads are all equally 'good' representative reads for a UMI group. On the other hand, if there's no downside for making it determinative, I can't really see any reason not too? I can't imagine any cases where sorting would detriment downstream analysis, though maybe that's a lack of imagination on my part. @IanSudbery?

IanSudbery · 2021-04-30T09:09:30Z

In other places where two reads are equivalent, like which read to keep as representative of a read bundle, the decsision is explicitly random (and therefore can be determined by the random seed).

TomSmithCGAT · 2021-04-30T09:25:16Z

Good point. In that case, any objections to adding a sort to get around this PYTHONHASHSEED stuff?

I had another quick look into whether it can be set within the script. Appears I may have been mistaken before. Needs to be set before the interpreter starts (https://stackoverflow.com/questions/32538764/unable-to-see-or-modify-value-of-pythonhashseed-through-a-module) so it's either 1. add sort, 2. change hash function (e.g siphashc), or 3. update documentation and leave it as it is.

I favour option 1.

IanSudbery · 2021-04-30T09:31:41Z

I guess we'll need to do some performance benchmarks to see what the performance hit is, but I can believe its much.

TomSmithCGAT · 2024-10-03T08:38:13Z

Resolved by #550

fix umi-tools dependency on python hash function values

6747626

TyberiusPrime force-pushed the fix_hashing branch from aeb4441 to 6747626 Compare April 28, 2021 10:52

This was referenced Aug 1, 2022

Representative sequences after deduplication not consistent between different runs #440

Closed

makes umi_tools deterministic with --random-seed #550

Merged

TomSmithCGAT closed this Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix umi-tools dependency on python hash function values #470

fix umi-tools dependency on python hash function values #470

TyberiusPrime commented Apr 28, 2021

IanSudbery commented Apr 28, 2021

TyberiusPrime commented Apr 28, 2021

TomSmithCGAT commented Apr 30, 2021

IanSudbery commented Apr 30, 2021

TomSmithCGAT commented Apr 30, 2021 •

edited

Loading

IanSudbery commented Apr 30, 2021

TomSmithCGAT commented Oct 3, 2024

fix umi-tools dependency on python hash function values #470

fix umi-tools dependency on python hash function values #470

Conversation

TyberiusPrime commented Apr 28, 2021

IanSudbery commented Apr 28, 2021

TyberiusPrime commented Apr 28, 2021

TomSmithCGAT commented Apr 30, 2021

IanSudbery commented Apr 30, 2021

TomSmithCGAT commented Apr 30, 2021 • edited Loading

IanSudbery commented Apr 30, 2021

TomSmithCGAT commented Oct 3, 2024

TomSmithCGAT commented Apr 30, 2021 •

edited

Loading