Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashIdentity.Structural is a bad choice for Dictionary comparer #574

Closed
manofstick opened this issue Aug 8, 2015 · 5 comments
Closed

HashIdentity.Structural is a bad choice for Dictionary comparer #574

manofstick opened this issue Aug 8, 2015 · 5 comments

Comments

@manofstick
Copy link
Contributor

dict, groupBy & dict all use HashIdentity.Structural for the equality comparer, but it is not a good choice as it only supports partial equality relationship. i.e.

let nans = [| nan; nan |]

let dict1 = System.Collections.Generic.Dictionary System.Collections.Generic.EqualityComparer.Default
nans |> Seq.iter (fun n -> dict1.[n] <- n)

let dict2 = System.Collections.Generic.Dictionary HashIdentity.Structural
nans |> Seq.iter (fun n -> dict2.[n] <- n)

printfn "dict1.Count=%d" dict1.Count
printfn "dict2.Count=%d" dict2.Count
printfn "dict1.ContainsKey nan=%b" (dict1.ContainsKey nan)
printfn "dict2.ContainsKey nan=%b" (dict2.ContainsKey nan)

Which outputs the following

dict1.Count=1
dict2.Count=2
dict1.ContainsKey nan=true
dict2.ContainsKey nan=false

Now I'm not recommending the use of System.Collections.Generic.EqualityComparer.Default as this is obviously deficient in other ways, but rather an ER compatible version.

This would be a breaking change, although I think the current version is confusing, as I think most users would think it would act in the same way as the Default EqualityComparer, and most people would be using it in that fashion.

@dsyme
Copy link
Contributor

dsyme commented Aug 14, 2015

It's good to have this documented. However the results you're seeing from dict2 are not entirely unreasonable. NaN's are not equal, so if you store 100 NaNs you get a 100 NaNs - they all count as different. So I don't think a breaking change would be warranted for this at this point.

@manofstick
Copy link
Contributor Author

I wouldn't care either way I guess if this didn't have a performance impact, but it does. It means that PER comparison is used which means that we can't do back to the Iequatable.equals method. And even that I could tolerate if there was a valid reason. But floating point numbers ashould never be used for equality operations (and can and do have strange results on Intel processors given that the underlying calculations occur on 80 bit precision numbers, although they can have equivalent 64 but representations are not equal! Yes this had bitten me in a real system!) So I feel that we are supporting a configuration that is invalid from the beginning...

@dsyme
Copy link
Contributor

dsyme commented Aug 14, 2015

Hi @manofstick - could you add some perf figures for the performance difference here, especially for non-floating point key types? I need to get a picture for how it compares to the other costs for dict, groupBy and distinctBy. thanks

@manofstick
Copy link
Contributor Author

@dsyme

OK, first example. Trying to be a little bit realistic, I have based this a stackoverflow question by ttsiodras from from 2011. His gist is here. My modified version is here.

x64_new x64_old x86_new x86_old
custom dynamic 183.17 359.33 168.00 680.33
custom structural 212.00 359.67 200.33 586.50
custom default 181.00 183.00 165.00 164.67
value dynamic 1147.17 1900.50 1294.17 4892.17
value structural 1034.33 1016.00 1099.50 1072.00
value default 702.17 715.50 674.00 668.83
gen value dynamic 2178.00 4107.83 2010.83 9573.00
gen value structural 1829.17 3263.33 2068.67 6922.00
gen value default 1032.17 2509.67 1158.00 4000.50
ref dynamic 2245.17 2287.50 1565.17 4713.17
ref structural 1350.17 1339.50 1243.17 1242.00
ref default 1407.33 1399.50 1356.83 1408.50
gen ref dynamic 3148.67 4307.17 2661.33 9393.33
gen ref structural 2636.50 3984.17 2090.00 7000.33
gen ref default 2058.50 3600.50 1775.50 4789.83
tuple dynamic 498.50 1074.17 462.00 1955.83
tuple structural 428.83 419.33 404.83 405.83
tuple default 1226.17 1200.00 1237.00 1259.00

The figures to not here are the comparison of the "default" lines in comparison to the "structural" lines (or "dynamic"). The reason for this is that "default" is using the EqualityComparers,Default implementation which is calling IEquatable<> which would be available if we had ER equality symantics rather than PER equality semantics.

What should also be noted is that this example is quite dependent on the GetHashCode () implementation (as shown bty the difference between the "Custom" and the standard "Value" type times). And GetHashCode is unaffected by the PER symantics, so the results would be better than those noted.

(The "tuple default" is bad due to their custom treatment by the f# compiler)

So the results for the various types of ER vs PER is between about 70%-50% of the time, which is to my mind quite significant (although the whole things shows that really the right choice of data type is still the best way to get the best results!)

@dsyme
Copy link
Contributor

dsyme commented Jul 12, 2016

Closing old discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants