Getting full citation span #135

overmode · 2023-01-12T15:27:31Z

Hi, thank you for the great library !

Problem description

I am preparing a dataset, in which I would like to mask some citations, e.g. replacing them by "[CITATION]".
I could not find a way to get the full span of the citation. Indeed, only the normalized part is covered by the builtin span() function (see below)

import eyecite         
       
citations = [
   'Commonwealth v. Gibson, 561 A.2d 1240 1242',
   'Commonwealth v. Bauer, 604 A.2d 1098 (Pa.Super. 1992)'
]

for citation in citations :
   print('\n', '='*20)
   extracted_citation = eyecite.get_citations(citation)[0]
   print(extracted_citation)
   
   start_idx = extracted_citation.span()[0]
   end_idx = extracted_citation.span()[1]
   
   before_cit = citation[:start_idx]
   cit_text = citation[start_idx:end_idx]
   after_cit = citation[end_idx:]
   print(f"{before_cit} [BEGIN] {cit_text} [END] {after_cit}")

output :

====================
FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
Commonwealth v. Gibson,  [BEGIN] 561 A.2d 1240 [END]  1242

 ====================
FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
Commonwealth v. Bauer,  [BEGIN] 604 A.2d 1098 [END]  (Pa.Super. 1992)

One can see that the span only partially covers the citation text.
If possible, I would like to avoid using regex for recovering the full span.
Concatenating the lengths of the citation's attributes (plaintiff, defendant, etc.) does not seem to be a viable solution as well, because the second example misses the "Pa. Super" text.

Desired behavior

It would be nice to have a 'full_span()' function such that, if I use it instead of span() in the above example, I get

====================
FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
 [BEGIN]Commonwealth v. Gibson, 561 A.2d 1240 1242[END]

 ====================
FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
[BEGIN]Commonwealth v. Bauer,  604 A.2d 1098 (Pa.Super. 1992)[END]

Specs

eyecite version : 2.4.0

The text was updated successfully, but these errors were encountered:

flooie · 2023-01-12T16:25:32Z

Hey @overmode

Thanks for the write up. There is a method for FullCaseCitations called corrected_citation_full

It returns the full normalized string.

        citations = [
            'the asdf asdf the asdfa sd Commonwealth v. Gibson, 561 A.2d 1240 1242 asdf asdf asdf ',
            'Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992)'
        ]
        for cite in citations:
            cite = get_citations(cite)[0].corrected_citation_full())

When you run it - it provides the full citation including names, but I believe there is a bug in it when it uses dates and courts.

if you wanted to take a look at eyecite.models.FullCaseCitation.corrected_citation_full and fix the bug related to date and court it would return something like

Commonwealth v. Gibson, 561 A.2d 1240
Commonwealth v. Bauer, 604 A.2d 1098 (pasuperct 1992)

for the example above.

mlissner · 2023-01-12T17:43:13Z

Is the idea, @overmode, to remove all citations to make it better training data?

mlissner · 2023-01-12T17:51:37Z

One other thing to know, @overmode, is that the way we identify the name of the case is very sloppy. It just uses heuristics around where it finds a v., if it finds one, and otherwise, just grabs the average length of a case name, I think. It's hardcoded around 30 tokens IIRC>

overmode · 2023-01-13T09:45:15Z

Hey, thanks for the quick reply.
@mlissner Indeed, the idea is to build a training set for some machine learning application.

I took note of your method, it's ok if the recall of citation extraction is not excellent because I have many documents anyway, but I will need a way to tell whether the parsing went well to at least have a good precision.

@flooie I tried the eyecite.models.FullCaseCitation.corrected_citation_full method, it does break at the second example :

====================
EXTRACTED : FullCaseCitation('561 A.2d 1240', groups={'volume': '561', 'reporter': 'A.2d', 'page': '1240'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite='1242', year=None, court=None, plaintiff='Commonwealth', defendant='Gibson', extra=None))
CORRECTED_CITATION_FULL : Commonwealth v. Gibson, 561 A.2d 1240, 1242
CITATION SPAN : Commonwealth v. Gibson,  [BEGIN] 561 A.2d 1240 [END]  1242

 ====================
EXTRACTED : FullCaseCitation('604 A.2d 1098', groups={'volume': '604', 'reporter': 'A.2d', 'page': '1098'}, metadata=FullCaseCitation.Metadata(parenthetical=None, pin_cite=None, year='1992', court=None, plaintiff='Commonwealth', defendant='Bauer', extra=None))
Error executing job with overrides: []
Traceback (most recent call last):
  File "check_samples.py", line 59, in main
    print('CORRECTED_CITATION_FULL :', extracted_citation.corrected_citation_full())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in corrected_citation_full
    publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/eyecite/models.py", line 361, in <genexpr>
    publisher_date = " ".join(m[i] for i in (m.court, m.year) if i)
TypeError: 'Metadata' object is not subscriptable

This is not exactly what I would like, though, because it is not exact text that was matched (notice the added comma between page numbers).
Is there a better way to find back the latter text ?

Also, Is this the bug you pointed out ? I'm open to a PR in case there is no better workaround, so I would appreciate if you have insights to share already.

[UPDATE]
I fixed the bug by replacing the line by publisher_date = " ".join(i for i in (m.court, m.year) if i)
The extracted full citation for the second example becomes
Commonwealth v. Bauer, 604 A.2d 1098 (1992

The parenthesis is not closed because in eyecite.models, line 362, we have

if publisher_date:
            parts.append(f" ({publisher_date}")

I assume that a parenthesis is missing at the end.
Does it make sense for the Pa. Super. not to be included here ?

mattdahl · 2023-01-18T02:54:25Z

Just chiming in here since I saw your PR (#136) and was surprised that this wasn't already possible! Thanks for implementing it!

Separate from your changes in the PR, I was also curious about the court issue. It seems that the Pa.Super. is not being extracted properly because the citation_string listed for the PA Superior Court is "Pa. Super. Ct." (line 46902 here: https://github.com/freelawproject/courts-db/blob/main/courts_db/data/courts.json). The problem is the space between the Pa. and the Super.. This also seems like something that should be fixed -- would it cause problems to just ignore whitespace when matching court abbreviations here: https://github.com/freelawproject/eyecite/blob/main/eyecite/helpers.py#L52? May be related to the changes proposed in #129

overmode · 2023-01-18T09:11:30Z

@mattdahl Thanks !
Maybe we should consider moving away from exact string matching and use simple regex instead ?
For instance in r'\s*pa\s*super\s*', we would not be dependent on the spacing, and we could also make it robust to punctuation.
I don't think it would hurt speed a lot

flooie · 2023-01-18T15:01:30Z

@overmode every time I see the words simple and regex I get nervous. I'm not sure I see how this relatively simple situation is resolved with regex.

overmode · 2023-01-18T15:36:57Z

I understand, regexes are powerful but scale badly.
Well, the equivalent in python here would be to remove all punctuation and spaces, and then look for 'pasuper'.
I think the question was more whether it would not work in some corner cases, and you are much more knowledgeable than I am.

mlissner · 2023-01-19T00:35:07Z

For the court issue, the question is essentially, "What bad things will happen if we broaden how we match court strings against the text?"

Honestly, I don't think anybody knows. Right now we do two things. We:

Strip the punctuation with string_puc, and we
Use startswith to strip terminal periods, which sometimes seem to interfere

If we went a step further and matched with regexes or by taking out whitespace, would we have false matches? I don't know, but I know how to check!

If we want to run this down, I think the trick is to look at the citation_string values for every court in the courts DB and see what happens if you strip out spaces in addition to stripping out punctuation. I think it might be fine, but what we'd want to watch out for are two courts with nearly identical citation strings that overlap due to this. If there's no collisions caused by that analysis, I'd say yeah, let's add a third step to how we normalize and compare citation strings.

mattdahl · 2023-01-19T02:33:21Z

Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae

It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

mlissner · 2023-01-19T15:03:46Z

However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

Yeah, that jumped out at me too. @flooie what's your take on that?

flooie · 2023-01-25T19:42:51Z

Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae

It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented.

flooie · 2023-01-25T19:44:29Z

@mattdahl - we had imported a lot of courts - that were low level county, town courts and in ny a few of courts had been generated with the parent citation string.

For example, New York County Court -> has like 50+ County courts and they were generated with N.Y. Cty. Ct. as the citation string instead of NY Cty. Ct., Suffolk Cty. ... etc. I went thru and fixed the 100 or so collisions

mattdahl · 2023-01-25T20:11:51Z

Nice!! The only duplicate left is N.Y. Cty. Ct., Nassau Cty. -- is that intentional?

flooie · 2023-01-25T20:25:03Z

no- ha - thats just a duplicate court. I'll strip that in a second. I have a few things to add about courts and citation strings. Ill add momentarily

…project#135).

flooie added bug Something isn't working enhancement New feature or request labels Jan 12, 2023

mattdahl added a commit to mattdahl/eyecite that referenced this issue Feb 23, 2023

test(find): Adds failing test for court string without space (freelaw…

7e24369

…project#135).

mattdahl added a commit to mattdahl/eyecite that referenced this issue Feb 23, 2023

test(find): Adds failing test for court string without space (freelaw…

3b2fe09

…project#135).

mattdahl mentioned this issue Feb 23, 2023

Fix court string matching with whitespace #144

Open

mattdahl added a commit to mattdahl/eyecite that referenced this issue Jul 6, 2023

test(find): Adds failing test for court string without space (freelaw…

3fb350d

…project#135).

flooie self-assigned this Jul 6, 2023

flooie added this to @flooie's backlog Jul 6, 2023

github-project-automation bot moved this to 🆕 New in @flooie's backlog Jul 6, 2023

mattdahl added a commit to mattdahl/eyecite that referenced this issue Sep 22, 2023

test(find): Adds failing test for court string without space (freelaw…

1f408cc

…project#135).

mattdahl mentioned this issue Sep 22, 2023

Court name issues #129

Open

flooie added this to Case Law Sprint Nov 18, 2024

flooie moved this to General Backlog in Case Law Sprint Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting full citation span #135

Getting full citation span #135

overmode commented Jan 12, 2023

flooie commented Jan 12, 2023 •

edited

Loading

mlissner commented Jan 12, 2023

mlissner commented Jan 12, 2023

overmode commented Jan 13, 2023 •

edited

Loading

mattdahl commented Jan 18, 2023

overmode commented Jan 18, 2023 •

edited

Loading

flooie commented Jan 18, 2023

overmode commented Jan 18, 2023

mlissner commented Jan 19, 2023

mattdahl commented Jan 19, 2023

mlissner commented Jan 19, 2023

flooie commented Jan 25, 2023

flooie commented Jan 25, 2023

mattdahl commented Jan 25, 2023

flooie commented Jan 25, 2023

Getting full citation span #135

Getting full citation span #135

Comments

overmode commented Jan 12, 2023

Problem description

Desired behavior

Specs

flooie commented Jan 12, 2023 • edited Loading

mlissner commented Jan 12, 2023

mlissner commented Jan 12, 2023

overmode commented Jan 13, 2023 • edited Loading

mattdahl commented Jan 18, 2023

overmode commented Jan 18, 2023 • edited Loading

flooie commented Jan 18, 2023

overmode commented Jan 18, 2023

mlissner commented Jan 19, 2023

mattdahl commented Jan 19, 2023

mlissner commented Jan 19, 2023

flooie commented Jan 25, 2023

flooie commented Jan 25, 2023

mattdahl commented Jan 25, 2023

flooie commented Jan 25, 2023

flooie commented Jan 12, 2023 •

edited

Loading

overmode commented Jan 13, 2023 •

edited

Loading

overmode commented Jan 18, 2023 •

edited

Loading