-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FullCaseCitation.group have badly formatted reporter #147
Comments
I'm actually not sure how groups is used these days, so I'm not sure how to respond to this without looking at the code. If groups is used to re-create the linkified text, then you don't want corrections in there. We did that once years ago, and it was a mistake because sometimes the corrections are imperfect, and that can be a real problem if you take a good citation with a bad reporter and convert it to something else (with a valid, but wrong reporter). Can you remind me how groups is used? Depending on that, it's either working correctly, or broken, like you say. |
|
The Would it work instead for you to change how I.e., instead of making a tuple of |
I'm over my head and explicitly bowing out, if you guys have a sense of the direction to go, but if the conversation goes stale or you need a BDFL, please feel free to flag it for me and I can try to understand what y'all are talking about. :) |
Yeah, I agree it would be technically possible there, but the idea of the |
Noted 👍 By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ? |
Created the PR |
Whoops, it seems like we have another problem with the comparison hash. The following code : frm eyecite import get_citations
cit_str = "482 S.E.2d 805"
cit = get_citations(cit_str)[0]
print(cit.comparison_hash()) Will print a new value every time you run it, because python adds randomness to the hash function (as stated here). What do you think of using the hashlib library instead ? from hashlib import sha1
import json
def comparison_hash(self) :
tup = (str(type(self)), tuple(self.groups["volume"], self.corrected_reporter(), self.groups["page"]))
return sha1(json.dumps(tup).encode('utf-8')) (Probably not a valid syntax but you get the idea) |
Ah, interesting! It does seem to make sense to me that the hashes should be reproducible across runs. The |
You mean the |
Sure, using the hashlib seems fine to me. We have some simple wrappers for it in CL: https://github.com/freelawproject/courtlistener/blob/main/cl/lib/crypto.py I think I'd suggest SHA256 for this, though md5 would produce shorter hashes and be negligibly faster. It's less secure, but that doesn't matter here. |
Hey,
Thanks again for the library, we're using it everyday :)
We're currently trying to come up with a unique string representation for citations, one that would be unique for all citations being semantically the same, but would still be human-readable.
We've tried to use the
corrected_citation()
, but it fails in some cases, here are a few examples where the output is the same asmatched_text()
:5 U.S.C. §702
,5 U.S.C. § 702
,5 U.S.C. §§702
,5 U.S.C. §702
Note that the
corrected_citation_full
function also fails in mapping these examples to the same string.Because these limitations seem to be hard to fix, we gave up on the human-readable requirement and just want to have a unique representation, so exactly what
comparison_hash
does.This issue is about fixing a corner case that we spotted :
Here the corrected citation for the second string has the corrected reporter, but this was not propagated to the groups attribute, thus the comparison_hash is different.
Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.
I'm happy to open a PR if you think it makes sense
The text was updated successfully, but these errors were encountered: