FullCaseCitation.group have badly formatted reporter #147

overmode · 2023-03-24T18:38:50Z

Hey,
Thanks again for the library, we're using it everyday :)

We're currently trying to come up with a unique string representation for citations, one that would be unique for all citations being semantically the same, but would still be human-readable.

We've tried to use the corrected_citation(), but it fails in some cases, here are a few examples where the output is the same as matched_text() : 5 U.S.C. §702, 5 U.S.C. § 702, 5 U.S.C. §§702, 5 U.S.C. §702
Note that the corrected_citation_full function also fails in mapping these examples to the same string.

Because these limitations seem to be hard to fix, we gave up on the human-readable requirement and just want to have a unique representation, so exactly what comparison_hash does.

This issue is about fixing a corner case that we spotted :

Cit1 : FullCaseCitation('506 U.S. 534', groups={'volume': '506', 'reporter': 'U.S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='539', year='1993', court='scotus', plaintiff='STATES', defendant='STINSONStates', extra=None))
Cit2 : FullCaseCitation('506 U. S. 534', groups={'volume': '506', 'reporter': 'U. S.', 'page': '534'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year=None, court='scotus', plaintiff='STATES', defendant='STINSONStates', extra=None))]
Cit1.corrected_citation() : 506 U.S. 534
Cit2.corrected_citation() : 506 U.S. 534

Here the corrected citation for the second string has the corrected reporter, but this was not propagated to the groups attribute, thus the comparison_hash is different.
Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

I'm happy to open a PR if you think it makes sense

The text was updated successfully, but these errors were encountered:

mlissner · 2023-03-27T13:49:58Z

Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

I'm actually not sure how groups is used these days, so I'm not sure how to respond to this without looking at the code. If groups is used to re-create the linkified text, then you don't want corrections in there. We did that once years ago, and it was a mistake because sometimes the corrections are imperfect, and that can be a real problem if you take a good citation with a bad reporter and convert it to something else (with a valid, but wrong reporter).

Can you remind me how groups is used? Depending on that, it's either working correctly, or broken, like you say.

overmode · 2023-03-28T11:06:20Z

The following insights were derived from looking for '.groups' in the repo
- The reporter is not used in resolve for shortcase citations, nor in any other resolve method
- Seems like you explicitly wanted groups['reporter'] to store the reporter_found field, as shown in this update from 2021-06-14 (search for 'groups')
based on a search for 'reporter' in the repo, there seem to be no reason why we would need to store the original reporter in groups
About annotating citations, this is done using the span, which is stored in the Token class and based on what was matched by the regex. it seems like you never reuse the reporter or any other field in groups to recover the span. Thus, changing the content of reporter in groups seems to be safe.

mattdahl · 2023-03-28T12:04:39Z

Ideally, I guess we would expect the reporter to be corrected before writing it in the groups.

The Citation.groups dict is set on init to the content of the Token.groups dict, which is itself generated from calling groupdict() on the matched regex object during parsing. (Hence the variable name "groups" bubbles up, though it is a bit of a misnomer at the Citation object level.) So there is no way to "correct" the reporter prior to writing it into the groups dict; indeed, the correction process presupposes that the matched reporter key is already in the groups dict.

Would it work instead for you to change how CitationBase objects are hashed here? https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L117

I.e., instead of making a tuple of self.groups.items(), make a tuple of self.groups['volume'], self.corrected_reporter(), self.groups['page']? N.b. this might have to be overridden at the ResourceCitation level instead because not all citation object types have a corrected_reporter() method.

overmode · 2023-03-28T12:35:34Z

@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the post_init function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go

mlissner · 2023-03-28T14:10:17Z

I'm over my head and explicitly bowing out, if you guys have a sense of the direction to go, but if the conversation goes stale or you need a BDFL, please feel free to flag it for me and I can try to understand what y'all are talking about. :)

mattdahl · 2023-03-28T15:38:54Z

@mattdahl thanks for the explanation. I don't see why it would not be possible to "correct" the reporter in the post_init function just after self.groups is set, but modifying the hash should indeed be sufficient to fix this comparison_hash, I can do the change if you think it's the way to go

Yeah, I agree it would be technically possible there, but the idea of the groups dict is to store the actual results from calling m.groupdict(). I can't give a specific example right now of why we need to retain this, but conceptually it seems dangerous to modify the dict contents post-hoc to be something that wasn't actually matched from the regex. So I would suggest changing the hash function instead, if that's all right with you!

overmode · 2023-03-28T16:49:26Z

Noted 👍 By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?

overmode · 2023-03-29T10:17:38Z

Created the PR

overmode · 2023-03-29T10:58:53Z

Whoops, it seems like we have another problem with the comparison hash.
My colleague @alessandro-ciffo found that the comparison hash changes every time you run a new script.

The following code :

frm eyecite import get_citations
cit_str = "482 S.E.2d 805"
cit = get_citations(cit_str)[0]
print(cit.comparison_hash())

Will print a new value every time you run it, because python adds randomness to the hash function (as stated here).

What do you think of using the hashlib library instead ?
We could use something like

from hashlib import sha1
import json
def comparison_hash(self) :
    tup = (str(type(self)), tuple(self.groups["volume"], self.corrected_reporter(), self.groups["page"]))
    return sha1(json.dumps(tup).encode('utf-8'))

(Probably not a valid syntax but you get the idea)

mattdahl · 2023-03-29T14:00:14Z

Ah, interesting! It does seem to make sense to me that the hashes should be reproducible across runs. The hashlib module (part of the standard library, right?) seems reasonable to me, but I would summon @mlissner back for this

mattdahl · 2023-03-29T14:10:18Z

By the way, what do you think of the corner cases of the corrected_citation ? Any chance that we get them fixed in the near future ?

You mean the FullLawCitation corner cases with 5 U.S.C. §702, 5 U.S.C. § 702, 5 U.S.C. §§702, 5 U.S.C. §702, etc.? I agree that it does seem like those should all be normalized to a common string (rather than just returning the result of matched_text()). The spacing issue could be fixed easily, but the double § symbol might be more complicated. We would have to make sure that we only normalized §§ to § in the case where there is indeed only one section listed (as in the example). In any case, the trick would be to override the corrected_citation() method on the FullLawCitation class here: https://github.com/freelawproject/eyecite/blob/main/eyecite/models.py#L280

mlissner · 2023-03-29T22:45:29Z

Sure, using the hashlib seems fine to me. We have some simple wrappers for it in CL:

https://github.com/freelawproject/courtlistener/blob/main/cl/lib/crypto.py

I think I'd suggest SHA256 for this, though md5 would produce shorter hashes and be negligibly faster. It's less secure, but that doesn't matter here.

mattdahl mentioned this issue Jul 15, 2023

Issue 147, 154 - Improve citation object hashing behavior #155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FullCaseCitation.group have badly formatted reporter #147

FullCaseCitation.group have badly formatted reporter #147

overmode commented Mar 24, 2023 •

edited

Loading

mlissner commented Mar 27, 2023

overmode commented Mar 28, 2023

mattdahl commented Mar 28, 2023 •

edited

Loading

overmode commented Mar 28, 2023 •

edited

Loading

mlissner commented Mar 28, 2023

mattdahl commented Mar 28, 2023 •

edited

Loading

overmode commented Mar 28, 2023

overmode commented Mar 29, 2023

overmode commented Mar 29, 2023

mattdahl commented Mar 29, 2023

mattdahl commented Mar 29, 2023

mlissner commented Mar 29, 2023 •

edited

Loading

FullCaseCitation.group have badly formatted reporter #147

FullCaseCitation.group have badly formatted reporter #147

Comments

overmode commented Mar 24, 2023 • edited Loading

mlissner commented Mar 27, 2023

overmode commented Mar 28, 2023

mattdahl commented Mar 28, 2023 • edited Loading

overmode commented Mar 28, 2023 • edited Loading

mlissner commented Mar 28, 2023

mattdahl commented Mar 28, 2023 • edited Loading

overmode commented Mar 28, 2023

overmode commented Mar 29, 2023

overmode commented Mar 29, 2023

mattdahl commented Mar 29, 2023

mattdahl commented Mar 29, 2023

mlissner commented Mar 29, 2023 • edited Loading

overmode commented Mar 24, 2023 •

edited

Loading

mattdahl commented Mar 28, 2023 •

edited

Loading

overmode commented Mar 28, 2023 •

edited

Loading

mattdahl commented Mar 28, 2023 •

edited

Loading

mlissner commented Mar 29, 2023 •

edited

Loading