`cl.recap.mergers.find_docket_object` sometimes matches dockets when it shouldn't #4256

grossir · 2024-07-30T17:44:40Z

An example:

Docket 68295573 already has a case_name Van Camp v. Van Camp, different than new value State v. Snyder

The docket in Courtlistener, that would have been overwritten has docket number "1 CA-CV 23-0297-FC", case name Van Camp v. Van Camp

The docket number for the Snyder case is "1 CA-CR 23-0297", which is a different case

So, the "docket_number_core" with value "230297" matches, but it shouldn't

This is a single example for Arizona, but on Sentry there are more records.

There is another example where the mismatch doesn't have a straightforward solution:

The oral argument with case name "In re: NEWMAN" has docket number 21-1228
The opinion with case name "Edgar G. C. v. Garland" has the same docket number

There are some cases when it is a correct match, but the case name or other data point is slightly different: ca3

The offending logic is in this function

courtlistener/cl/recap/mergers.py

Lines 84 to 169 in 723b7ec

    
           async def find_docket_object( 
        
               court_id: str, 
        
               pacer_case_id: str | None, 
        
               docket_number: str, 
        
               using: str = "default", 
        
           ) -> Docket: 
        
               """Attempt to find the docket based on the parsed docket data. If cannot be 
        
               found, create a new docket. If multiple are found, return the oldest. 
        
               :param court_id: The CourtListener court_id to lookup 
        
               :param pacer_case_id: The PACER case ID for the docket 
        
               :param docket_number: The docket number to lookup. 
        
               :param using: The database to use for the lookup queries. 
        
               :return The docket found or created. 
        
               """ 
        
               # Attempt several lookups of decreasing specificity. Note that 
        
               # pacer_case_id is required for Docket and Docket History uploads. 
        
               d = None 
        
               docket_number_core = make_docket_number_core(docket_number) 
        
               lookups = [] 
        
               if pacer_case_id: 
        
                   # Appellate RSS feeds don't contain a pacer_case_id, avoid lookups by 
        
                   # blank pacer_case_id values. 
        
                   lookups = [ 
        
                       { 
        
                           "pacer_case_id": pacer_case_id, 
        
                           "docket_number_core": docket_number_core, 
        
                       }, 
        
                       {"pacer_case_id": pacer_case_id}, 
        
                   ] 
        
               if docket_number_core and not pacer_case_id: 
        
                   # Sometimes we don't know how to make core docket numbers. If that's 
        
                   # the case, we will have a blank value for the field. We must not do 
        
                   # lookups by blank values. See: freelawproject/courtlistener#1531 
        
                   lookups.extend( 
        
                       [ 
        
                           { 
        
                               "pacer_case_id": None, 
        
                               "docket_number_core": docket_number_core, 
        
                           }, 
        
                           {"docket_number_core": docket_number_core}, 
        
                       ] 
        
                   ) 
        
               elif docket_number and not pacer_case_id: 
        
                   # Finally, as a last resort, we can try the docket number. It might not 
        
                   # match b/c of punctuation or whatever, but we can try. Avoid lookups 
        
                   # by blank docket_number values. 
        
                   lookups.append( 
        
                       {"pacer_case_id": None, "docket_number": docket_number}, 
        
                   ) 
        
               for kwargs in lookups: 
        
                   ds = Docket.objects.filter(court_id=court_id, **kwargs).using(using) 
        
                   count = await ds.acount() 
        
                   if count == 0: 
        
                       continue  # Try a looser lookup. 
        
                   if count == 1: 
        
                       d = await ds.afirst() 
        
                       if kwargs.get("pacer_case_id") is None and kwargs.get( 
        
                           "docket_number_core" 
        
                       ): 
        
                           d = confirm_docket_number_core_lookup_match(d, docket_number) 
        
                       if d: 
        
                           break  # Nailed it! 
        
                   elif count > 1: 
        
                       # Choose the oldest one and live with it. 
        
                       d = await ds.aearliest("date_created") 
        
                       if kwargs.get("pacer_case_id") is None and kwargs.get( 
        
                           "docket_number_core" 
        
                       ): 
        
                           d = confirm_docket_number_core_lookup_match(d, docket_number) 
        
                       if d: 
        
                           break 
        
               if d is None: 
        
                   # Couldn't find a docket. Return a new one. 
        
                   return Docket( 
        
                       source=Docket.RECAP, 
        
                       pacer_case_id=pacer_case_id, 
        
                       court_id=court_id, 
        
                   ) 
        
               if using != "default": 
        
                   # Get the item from the default DB 
        
                   d = await Docket.objects.aget(pk=d.pk) 
        
               return d

Sentry Issue: COURTLISTENER-7XG

Docket 68295573 already has a case_name Van Camp v. Van Camp, different than new value State v. Snyder

The text was updated successfully, but these errors were encountered:

mlissner · 2024-07-30T19:51:39Z

So:

probably we shouldn't rely on docket_number_core for state court docket numbers we don't understand very well.
I don't understand the CA9 example. Their docket numbers should be consistent. Maybe a typo on the court website?
CA3, I guess it's reasonable to go with the latest value, maybe?

grossir · 2024-08-27T14:38:17Z

I downloaded the details for 1236 events / logs from Sentry, which map to 486 unique dockets. Then, I manually inspected each court group, and found that some are indeed mixing up dockets, and some others are matching the correct docket, but bring updated values which may not be better than the old values

Incorrect docket match

The fladistctapp and az errors are due to an error when getting the docket_number_core , which ignores the parts of the docket that signal differences between districts or process types

In the case of ohio, the docket number is exactly the same, the scraper should be fixed to return a more detailed docket number

Court domain	Dockets with error logs	Reason	Example
1dca.flcourts.gov	67	Missmatch across districts	'5D2023-0888' and '2D2023-0888'
4dca.flcourts.gov	44
5dca.flcourts.gov	37
2dca.flcourts.gov	36
6dca.flcourts.gov	25
3dca.flcourts.gov	24
www.supremecourt.ohio.gov	20	Missmatch across counties	Docket number is the same '22CA15' Doc 1, Doc 2
www.azcourts.gov	11	Mixing up Criminal and Civil docket numbers	'1 CA-CR 23-0297' and '1 CA-CV 23-0297-FC' are matched

There are also some one-off mix ups. This one is due to a merger with harvard and lawbox

https://www.courtlistener.com/api/rest/v3/dockets/1558364/
Original:  In Re Pauley
New     :  Travis Norwood v. Jonathan Frame, Superintendent, Mount Olive Correctional Complex and Jail

This one was caused by a typo on the scraped web page, where they put a 22 instead of a 20

https://www.courtlistener.com/api/rest/v3/dockets/67836231/
Original:  1417 Belmont Community Dev., LLC v. District of Columbia
New     :  Lynch v. Ghaida

After fixing docket matching, we should find a way to separate the clusters mixed by this error. Hopefully, it is limited to the courts on the above table

Correct match, updated information

Assuming the matching problem is solved, we could decide to update the case name based on the length of the names. Sometimes newer case names are shorter; sometimes longer; and I think longer case names have more information by having the fuller party names

Examples of updates where the names are worse

https://www.courtlistener.com/api/rest/v3/dockets/68730521/
Original:  Kevin Kulak v. Itshak On
New     :  Kulak v. Itshak On

=========================
https://www.courtlistener.com/api/rest/v3/dockets/2615014/
Original:  State of Delaware v. Hobbs.
New     :  State v. Amir Fatir f/k/a Sterling Hobbs

=========================
https://www.courtlistener.com/api/rest/v3/dockets/66774469/
Original:  Sunil M. Malkani v. Gemma Cunningham
New     :  Malkani v. Cunningham

Examples of updates where the names are better:

https://www.courtlistener.com/api/rest/v3/dockets/68437417/
Original:  Overwell Harvest, Limited v. Trading Technologies Internati
New     :  Overwell Harvest, Limited v. Trading Technologies International, Inc.
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68454533/
Original:  Kalispell v. Diablo Investments
New     :  City of Kalispell v. Diablo Investments

=========================
https://www.courtlistener.com/api/rest/v3/dockets/68941229/
Original:  Matter of M.N., YINC
New     :  Matter of M.N. and M.N., Youths in Need of Care.

Related, we could improve the case name parsing for these courts:

publicportal-api.alappeals.gov

https://www.courtlistener.com/api/rest/v3/dockets/68561913/
Original:  Ex parte The Housing Authority of the City of Talladega. PETITION FOR WRIT OF CERTIORARI TO THE COURT OF CIVIL APPEALS (In re: Harold Wallace v. The Housing Authority of the City of Talladega) (Talladega Circuit Court: CV-18-900509 Civil Appeals: 2210486).
New     :  Ex parte Housing Authority of the City of Talladega. PETITION FOR WRIT OF CERTIORARI TO THE COURT OF CIVIL APPEALS (In re: Harold Wallace v. The Housing Authority of the City of Talladega) (Talladega Circuit Court: CV-18-900509 Court of Civil Appeals: 2210486).
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68538816/
Original:  Ex parte Morgan Stanford and Matthew Hogue. PETITION FOR WRIT OF MANDAMUS: CIVIL (In re: Morgan Stanford and Matthew Hogue v. HCP Properties, LLC)(Jefferson Circuit Court: 22-901106).
New     :  Ex parte Morgan Stanford and Matthew Hogue. PETITION FOR WRIT OF MANDAMUS (In re: Morgan Stanford and Matthew Hogue v. HCP Properties, LLC)(Jefferson Circuit Court: 22-901106).
=========================

www.courts.state.hi.us

https://www.courtlistener.com/api/rest/v3/dockets/68206850/
Original:  State v. Yuen
New     :  State v. Yuen. Dissenting Opinion by Recktenwald, C.J., in Which Ginoza, J., Joins. ICA Order of Correction, filed 09/26/2023 [ada]. ICA s.d.o., filed 09/22/2023 [ada]. Application for Writ of Certiorari, filed 12/18/2023. S.Ct. Order Accepting Application for Writ of Certiorari, filed 01/30/2024 [ada].

www.courts.ca.gov

Case names end with a code

https://www.courtlistener.com/api/rest/v3/dockets/68979773/
Original:  Riversiders Against Increased Taxes v. City of Riverside CA4/2
=========================

https://www.courtlistener.com/api/rest/v3/dockets/68975608/
Original:  Holguin Family Ventures v. County of Ventura CA2/6

mlissner · 2024-08-27T18:14:53Z

Super helpful analysis. I don't know the solution half as well as you do, but one thing I'll note is that the shorter case names tend to be the better ones, actually, but this is essentially the difference between case_name and case_name_full:

case_name: Shows the simplified case name: Lissner v. Fox
case_name_full: Shows the full case name: Michael Lissner v. Michael Fox

There's also case_name_short, of course, which is usually just the first party: Lissner.

… scrapers Related to freelawproject#4256 Docket numbers were mismatched for `az`, due to the order of queries in `recap.mergers.find_docket_object` - created a new function `get_existing_docket` to be used by scrapers - refactored the logger.error call for update values different than existing values. Now it will only trigger for case_name, and only when it is too diferent (less than 50% of words in common).

…er is an empty string Related to #4256 Doing a Docket lookup when the docket number is empty will match false positives

grossir · 2024-09-16T20:13:23Z

Other courts that need special lookups for docket matching:

ny

Example docket, with docket number "No. 86", has 2 clusters assigned, one from 1983, another from 2024

grossir mentioned this issue Aug 27, 2024

Disambiguate ohioctapp docket numbers freelawproject/juriscraper#1135

Open

grossir mentioned this issue Aug 28, 2024

Disambiguate fladistctapp docket numbers freelawproject/juriscraper#1136

Closed

grossir mentioned this issue Aug 28, 2024

fix(scrapers.utils.update_or_create_docket): correct docket match for scrapers #4365

Merged

grossir added a commit that referenced this issue Sep 16, 2024

fix(scrapers.utils.get_existing_docket): return None when docket_numb…

7f2e950

…er is an empty string Related to #4256 Doing a Docket lookup when the docket number is empty will match false positives

grossir mentioned this issue Sep 16, 2024

fix(scrapers.utils.get_existing_docket): return None when docket_number is an empty string #4462

Merged

grossir mentioned this issue Oct 9, 2024

Some dcd docket numbers need disambiguation freelawproject/juriscraper#1199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cl.recap.mergers.find_docket_object` sometimes matches dockets when it shouldn't #4256

`cl.recap.mergers.find_docket_object` sometimes matches dockets when it shouldn't #4256

grossir commented Jul 30, 2024

mlissner commented Jul 30, 2024

grossir commented Aug 27, 2024

mlissner commented Aug 27, 2024

grossir commented Sep 16, 2024

cl.recap.mergers.find_docket_object sometimes matches dockets when it shouldn't #4256

cl.recap.mergers.find_docket_object sometimes matches dockets when it shouldn't #4256

Comments

grossir commented Jul 30, 2024

mlissner commented Jul 30, 2024

grossir commented Aug 27, 2024

Incorrect docket match

Correct match, updated information

mlissner commented Aug 27, 2024

grossir commented Sep 16, 2024

`cl.recap.mergers.find_docket_object` sometimes matches dockets when it shouldn't #4256

`cl.recap.mergers.find_docket_object` sometimes matches dockets when it shouldn't #4256