Remove dependence upon PYTHONHASHSEED #541

mbauman · 2017-04-12T17:47:09Z

The result from the search for predicates depends upon the initial order of the set of predicates. That order depends upon the value of PYTHONHASHSEED since it's using a set. I've not thoroughly looked for other places where this causes a dependency, but it'd be nice to support fully reproducible runs with just random.seed and numpy.random.seed.

The text was updated successfully, but these errors were encountered:

fgregg · 2017-04-12T17:57:14Z

what would that look like?

mbauman · 2017-04-12T18:04:48Z

Something like orderedset would be sufficient — it just uses insertion order. Of course the predicate selection algorithm still shuffles it, but the initial state is now deterministic. It's not a big deal, but it'd make reproducible runs a little easier since it wouldn't require starting a new python instance.

fgregg · 2017-04-12T18:05:58Z

what's the performance cost for that?

mbauman · 2017-04-12T18:16:54Z

I'm not sure; I've not evaluated any potential solutions yet. Theoretically it may not have a major cost — in fact Python's builtin dictionaries recently moved to an ordered implementation since they found its performance to be advantageous.

fgregg · 2017-04-12T18:21:17Z

If you provide a PR with benchmarks, this can move forward.

…

On Wed, Apr 12, 2017 at 1:16 PM, Matt Bauman ***@***.***> wrote: I'm not sure; I've not evaluated any potential solutions yet. Theoretically it may not have a major cost — in fact Python's builtin dictionaries recently moved to an ordered implementation since they found its performance to be advantageous. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#541 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgxbY0nyAMhIfYZqEBvr3bMtSIUVrsYks5rvRUWgaJpZM4M7wIV> .

-- 773.888.2718

mbauman · 2017-04-12T18:59:58Z

Do you have a benchmark suite? Or any micro-benchmarks? Or would you just be interested in the overall run time for a complete deduplication run?

fgregg · 2017-04-12T19:26:11Z

dedupe-examples would be fine.

…

On Wed, Apr 12, 2017 at 1:59 PM, Matt Bauman ***@***.***> wrote: Do you have a benchmark suite? Or any micro-benchmarks? Or would you just be interested in the overall run time for a complete deduplication run? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#541 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAgxba8k0P7ylIuDQKfOFitRlMBAydceks5rvR8ugaJpZM4M7wIV> .

-- 773.888.2718

fgregg · 2017-12-28T01:54:42Z

no response on this, so I'm closing

fgregg closed this as completed Dec 28, 2017

fgregg mentioned this issue Mar 26, 2018

Setting seed value #643

Closed

jdarling mentioned this issue Feb 26, 2020

A way to set the seed and/or pickup where you left off dedupeio/csvdedupe#96

Open

github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dependence upon PYTHONHASHSEED #541

Remove dependence upon PYTHONHASHSEED #541

mbauman commented Apr 12, 2017 •

edited

Loading

fgregg commented Apr 12, 2017

mbauman commented Apr 12, 2017

fgregg commented Apr 12, 2017

mbauman commented Apr 12, 2017

fgregg commented Apr 12, 2017 via email

mbauman commented Apr 12, 2017

fgregg commented Apr 12, 2017 via email

fgregg commented Dec 28, 2017

Remove dependence upon PYTHONHASHSEED #541

Remove dependence upon PYTHONHASHSEED #541

Comments

mbauman commented Apr 12, 2017 • edited Loading

fgregg commented Apr 12, 2017

mbauman commented Apr 12, 2017

fgregg commented Apr 12, 2017

mbauman commented Apr 12, 2017

fgregg commented Apr 12, 2017 via email

mbauman commented Apr 12, 2017

fgregg commented Apr 12, 2017 via email

fgregg commented Dec 28, 2017

mbauman commented Apr 12, 2017 •

edited

Loading