Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cl.recap.mergers.find_docket_object sometimes matches dockets when it shouldn't #4256

Open
grossir opened this issue Jul 30, 2024 · 4 comments

Comments

@grossir
Copy link
Contributor

grossir commented Jul 30, 2024

An example:

Docket 68295573 already has a case_name Van Camp v. Van Camp, different than new value State v. Snyder

The docket in Courtlistener, that would have been overwritten has docket number "1 CA-CV 23-0297-FC", case name Van Camp v. Van Camp

The docket number for the Snyder case is "1 CA-CR 23-0297", which is a different case

So, the "docket_number_core" with value "230297" matches, but it shouldn't

This is a single example for Arizona, but on Sentry there are more records.

There is another example where the mismatch doesn't have a straightforward solution:

  • The oral argument with case name "In re: NEWMAN" has docket number 21-1228
    image

  • The opinion with case name "Edgar G. C. v. Garland" has the same docket number

There are some cases when it is a correct match, but the case name or other data point is slightly different: ca3


The offending logic is in this function

async def find_docket_object(
court_id: str,
pacer_case_id: str | None,
docket_number: str,
using: str = "default",
) -> Docket:
"""Attempt to find the docket based on the parsed docket data. If cannot be
found, create a new docket. If multiple are found, return the oldest.
:param court_id: The CourtListener court_id to lookup
:param pacer_case_id: The PACER case ID for the docket
:param docket_number: The docket number to lookup.
:param using: The database to use for the lookup queries.
:return The docket found or created.
"""
# Attempt several lookups of decreasing specificity. Note that
# pacer_case_id is required for Docket and Docket History uploads.
d = None
docket_number_core = make_docket_number_core(docket_number)
lookups = []
if pacer_case_id:
# Appellate RSS feeds don't contain a pacer_case_id, avoid lookups by
# blank pacer_case_id values.
lookups = [
{
"pacer_case_id": pacer_case_id,
"docket_number_core": docket_number_core,
},
{"pacer_case_id": pacer_case_id},
]
if docket_number_core and not pacer_case_id:
# Sometimes we don't know how to make core docket numbers. If that's
# the case, we will have a blank value for the field. We must not do
# lookups by blank values. See: freelawproject/courtlistener#1531
lookups.extend(
[
{
"pacer_case_id": None,
"docket_number_core": docket_number_core,
},
{"docket_number_core": docket_number_core},
]
)
elif docket_number and not pacer_case_id:
# Finally, as a last resort, we can try the docket number. It might not
# match b/c of punctuation or whatever, but we can try. Avoid lookups
# by blank docket_number values.
lookups.append(
{"pacer_case_id": None, "docket_number": docket_number},
)
for kwargs in lookups:
ds = Docket.objects.filter(court_id=court_id, **kwargs).using(using)
count = await ds.acount()
if count == 0:
continue # Try a looser lookup.
if count == 1:
d = await ds.afirst()
if kwargs.get("pacer_case_id") is None and kwargs.get(
"docket_number_core"
):
d = confirm_docket_number_core_lookup_match(d, docket_number)
if d:
break # Nailed it!
elif count > 1:
# Choose the oldest one and live with it.
d = await ds.aearliest("date_created")
if kwargs.get("pacer_case_id") is None and kwargs.get(
"docket_number_core"
):
d = confirm_docket_number_core_lookup_match(d, docket_number)
if d:
break
if d is None:
# Couldn't find a docket. Return a new one.
return Docket(
source=Docket.RECAP,
pacer_case_id=pacer_case_id,
court_id=court_id,
)
if using != "default":
# Get the item from the default DB
d = await Docket.objects.aget(pk=d.pk)
return d


Sentry Issue: COURTLISTENER-7XG

Docket 68295573 already has a case_name Van Camp v. Van Camp, different than new value State v. Snyder
@mlissner
Copy link
Member

So:

  • probably we shouldn't rely on docket_number_core for state court docket numbers we don't understand very well.
  • I don't understand the CA9 example. Their docket numbers should be consistent. Maybe a typo on the court website?
  • CA3, I guess it's reasonable to go with the latest value, maybe?

@grossir
Copy link
Contributor Author

grossir commented Aug 27, 2024

I downloaded the details for 1236 events / logs from Sentry, which map to 486 unique dockets. Then, I manually inspected each court group, and found that some are indeed mixing up dockets, and some others are matching the correct docket, but bring updated values which may not be better than the old values

Incorrect docket match

The fladistctapp and az errors are due to an error when getting the docket_number_core , which ignores the parts of the docket that signal differences between districts or process types

In the case of ohio, the docket number is exactly the same, the scraper should be fixed to return a more detailed docket number

Court domain Dockets with error logs Reason Example
1dca.flcourts.gov 67 Missmatch across districts '5D2023-0888' and '2D2023-0888'
4dca.flcourts.gov 44
5dca.flcourts.gov 37
2dca.flcourts.gov 36
6dca.flcourts.gov 25
3dca.flcourts.gov 24
www.supremecourt.ohio.gov 20 Missmatch across counties Docket number is the same '22CA15' Doc 1, Doc 2
www.azcourts.gov 11 Mixing up Criminal and Civil docket numbers '1 CA-CR 23-0297' and '1 CA-CV 23-0297-FC' are matched

There are also some one-off mix ups. This one is due to a merger with harvard and lawbox

https://www.courtlistener.com/api/rest/v3/dockets/1558364/
Original:  In Re Pauley
New     :  Travis Norwood v. Jonathan Frame, Superintendent, Mount Olive Correctional Complex and Jail

This one was caused by a typo on the scraped web page, where they put a 22 instead of a 20

https://www.courtlistener.com/api/rest/v3/dockets/67836231/
Original:  1417 Belmont Community Dev., LLC v. District of Columbia
New     :  Lynch v. Ghaida

After fixing docket matching, we should find a way to separate the clusters mixed by this error. Hopefully, it is limited to the courts on the above table

Correct match, updated information

Assuming the matching problem is solved, we could decide to update the case name based on the length of the names. Sometimes newer case names are shorter; sometimes longer; and I think longer case names have more information by having the fuller party names

Examples of updates where the names are worse

https://www.courtlistener.com/api/rest/v3/dockets/68730521/
Original:  Kevin Kulak v. Itshak On
New     :  Kulak v. Itshak On

=========================
https://www.courtlistener.com/api/rest/v3/dockets/2615014/
Original:  State of Delaware v. Hobbs.
New     :  State v. Amir Fatir f/k/a Sterling Hobbs

=========================
https://www.courtlistener.com/api/rest/v3/dockets/66774469/
Original:  Sunil M. Malkani v. Gemma Cunningham
New     :  Malkani v. Cunningham

Examples of updates where the names are better:

https://www.courtlistener.com/api/rest/v3/dockets/68437417/
Original:  Overwell Harvest, Limited v. Trading Technologies Internati
New     :  Overwell Harvest, Limited v. Trading Technologies International, Inc.
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68454533/
Original:  Kalispell v. Diablo Investments
New     :  City of Kalispell v. Diablo Investments

=========================
https://www.courtlistener.com/api/rest/v3/dockets/68941229/
Original:  Matter of M.N., YINC
New     :  Matter of M.N. and M.N., Youths in Need of Care.

Related, we could improve the case name parsing for these courts:

  • publicportal-api.alappeals.gov
https://www.courtlistener.com/api/rest/v3/dockets/68561913/
Original:  Ex parte The Housing Authority of the City of Talladega. PETITION FOR WRIT OF CERTIORARI TO THE COURT OF CIVIL APPEALS (In re: Harold Wallace v. The Housing Authority of the City of Talladega) (Talladega Circuit Court: CV-18-900509 Civil Appeals: 2210486).
New     :  Ex parte Housing Authority of the City of Talladega. PETITION FOR WRIT OF CERTIORARI TO THE COURT OF CIVIL APPEALS (In re: Harold Wallace v. The Housing Authority of the City of Talladega) (Talladega Circuit Court: CV-18-900509 Court of Civil Appeals: 2210486).
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68538816/
Original:  Ex parte Morgan Stanford and Matthew Hogue. PETITION FOR WRIT OF MANDAMUS: CIVIL (In re: Morgan Stanford and Matthew Hogue v. HCP Properties, LLC)(Jefferson Circuit Court: 22-901106).
New     :  Ex parte Morgan Stanford and Matthew Hogue. PETITION FOR WRIT OF MANDAMUS (In re: Morgan Stanford and Matthew Hogue v. HCP Properties, LLC)(Jefferson Circuit Court: 22-901106).
=========================
https://www.courtlistener.com/api/rest/v3/dockets/68206850/
Original:  State v. Yuen
New     :  State v. Yuen. Dissenting Opinion by Recktenwald, C.J., in Which Ginoza, J., Joins. ICA Order of Correction, filed 09/26/2023 [ada]. ICA s.d.o., filed 09/22/2023 [ada]. Application for Writ of Certiorari, filed 12/18/2023. S.Ct. Order Accepting Application for Writ of Certiorari, filed 01/30/2024 [ada].

Case names end with a code

https://www.courtlistener.com/api/rest/v3/dockets/68979773/
Original:  Riversiders Against Increased Taxes v. City of Riverside CA4/2
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68975608/
Original:  Holguin Family Ventures v. County of Ventura CA2/6

@mlissner
Copy link
Member

Super helpful analysis. I don't know the solution half as well as you do, but one thing I'll note is that the shorter case names tend to be the better ones, actually, but this is essentially the difference between case_name and case_name_full:

  • case_name: Shows the simplified case name: Lissner v. Fox
  • case_name_full: Shows the full case name: Michael Lissner v. Michael Fox

There's also case_name_short, of course, which is usually just the first party: Lissner.

grossir added a commit to grossir/courtlistener that referenced this issue Aug 28, 2024
… scrapers

Related to freelawproject#4256

Docket numbers were mismatched for `az`, due to the order of queries in `recap.mergers.find_docket_object`

- created a new function `get_existing_docket` to be used by scrapers
- refactored the logger.error call for update values different than existing values. Now it will
only trigger for case_name, and only when it is too diferent (less than 50% of words in common).
grossir added a commit that referenced this issue Sep 16, 2024
…er is an empty string

Related to #4256

Doing a Docket lookup when the docket number is empty will match false positives
@grossir
Copy link
Contributor Author

grossir commented Sep 16, 2024

Other courts that need special lookups for docket matching:

  • ny

Example docket, with docket number "No. 86", has 2 clusters assigned, one from 1983, another from 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants