-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dependence upon PYTHONHASHSEED #541
Comments
what would that look like? |
Something like orderedset would be sufficient — it just uses insertion order. Of course the predicate selection algorithm still shuffles it, but the initial state is now deterministic. It's not a big deal, but it'd make reproducible runs a little easier since it wouldn't require starting a new python instance. |
what's the performance cost for that? |
I'm not sure; I've not evaluated any potential solutions yet. Theoretically it may not have a major cost — in fact Python's builtin dictionaries recently moved to an ordered implementation since they found its performance to be advantageous. |
If you provide a PR with benchmarks, this can move forward.
…On Wed, Apr 12, 2017 at 1:16 PM, Matt Bauman ***@***.***> wrote:
I'm not sure; I've not evaluated any potential solutions yet.
Theoretically it may not have a major cost — in fact Python's builtin
dictionaries recently moved to an ordered implementation since they found
its performance to be advantageous.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#541 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAgxbY0nyAMhIfYZqEBvr3bMtSIUVrsYks5rvRUWgaJpZM4M7wIV>
.
--
773.888.2718
|
Do you have a benchmark suite? Or any micro-benchmarks? Or would you just be interested in the overall run time for a complete deduplication run? |
dedupe-examples would be fine.
…On Wed, Apr 12, 2017 at 1:59 PM, Matt Bauman ***@***.***> wrote:
Do you have a benchmark suite? Or any micro-benchmarks? Or would you just
be interested in the overall run time for a complete deduplication run?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#541 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAgxba8k0P7ylIuDQKfOFitRlMBAydceks5rvR8ugaJpZM4M7wIV>
.
--
773.888.2718
|
no response on this, so I'm closing |
The result from the search for predicates depends upon the initial order of the set of predicates. That order depends upon the value of
PYTHONHASHSEED
since it's using a set. I've not thoroughly looked for other places where this causes a dependency, but it'd be nice to support fully reproducible runs with justrandom.seed
andnumpy.random.seed
.The text was updated successfully, but these errors were encountered: