-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consistency.get_best_set has unexpected results #5
Comments
Also, separately, def get_best_set(scores):
Qmax = max((score["Q"] for score in scores.values()))
return {[d for d, score in scores.items() if score["Q"] == Qmax]} |
Note that if you simply change from Also note that I didnt really delve too deep in to this (or into the associated paper) so I'm not sure if |
Thanks for your detailed bug report @Andrew-Sheridan! This kind of report makes it really easy to figure out what the problem is, so is much appreciated! I was not aware of the unexpected behaviour of the max function, so thanks for alerting me to that. I've switched the code over to using To briefly expand on why these cases occur: we detect the best dialect by computing what we call a "data consistency measure", Q, which is the product of a pattern score (P) that looks at how consistent the rows in the resulting table are, and the type score (T) that computes the ratio of cells with a known data type. Since T is between 0 and 1 we can skip dialects for which P is lower than the current maximum Q score we've seen already. These dialects get a Q score of I'm preparing a fix now and will release an updated version of the package asap. Thanks again for reporting this problem! |
Summary
In short, if one of the scores has a Q =
nan
, then the max score could benan
, which is weird.Details
I was trying to get the dialect for a CSV file and was getting
dialects = None
. I dug around through the code and found two functions which may be part of the issue:get_best_set
andconsistency_scores
.I got some dialects using
clevercsv.potential_dialects.get_dialects
, and then some scores usingclevercsv.consistency.consistency_scores
I had a set of scores that looked like this:
There were many other scores I have just grabbed the first three here.
I would expect the first dialect to be the "best" one, but that is not the output :(
Passing those scores in
get_best_set
returns an empty set.get_best_set
currently looks like this:It just picks out the item which has the best Q score.
The line
Qmax = max((score["Q"] for score in scores.values()))
depends on the builtinmax
function. That function produces unexpected results when some of the values it is checking includenan
. See: https://stackoverflow.com/questions/4237914/python-max-min-builtin-functions-depend-on-parameter-orderBecause of that,
Qmax
could equalnan
, and then the output ofget_best_set
will be the empty set. (It will not return the other entries that do have Q = 'nan', becausefloat("nan") == float("nan")
is False...)Why are there
nan
values?consistency_scores
hasnan
as default values.The defaults are set here:
So if scores has one score with a
nan
, thennan
could be the result.. surely that is not correct.I think that the defaults should be
None
, and then checks forif Q is None
etc could be made elsewhere.Thoughts?
Full set of scores:
The text was updated successfully, but these errors were encountered: