-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cl.recap.mergers.find_docket_object
sometimes matches dockets when it shouldn't
#4256
Comments
So:
|
I downloaded the details for 1236 events / logs from Sentry, which map to 486 unique dockets. Then, I manually inspected each court group, and found that some are indeed mixing up dockets, and some others are matching the correct docket, but bring updated values which may not be better than the old values Incorrect docket matchThe In the case of ohio, the docket number is exactly the same, the scraper should be fixed to return a more detailed docket number
There are also some one-off mix ups. This one is due to a merger with harvard and lawbox
This one was caused by a typo on the scraped web page, where they put a 22 instead of a 20
After fixing docket matching, we should find a way to separate the clusters mixed by this error. Hopefully, it is limited to the courts on the above table Correct match, updated informationAssuming the matching problem is solved, we could decide to update the case name based on the length of the names. Sometimes newer case names are shorter; sometimes longer; and I think longer case names have more information by having the fuller party names Examples of updates where the names are worse
Examples of updates where the names are better:
Related, we could improve the case name parsing for these courts:
Case names end with a code
|
Super helpful analysis. I don't know the solution half as well as you do, but one thing I'll note is that the shorter case names tend to be the better ones, actually, but this is essentially the difference between
There's also |
… scrapers Related to freelawproject#4256 Docket numbers were mismatched for `az`, due to the order of queries in `recap.mergers.find_docket_object` - created a new function `get_existing_docket` to be used by scrapers - refactored the logger.error call for update values different than existing values. Now it will only trigger for case_name, and only when it is too diferent (less than 50% of words in common).
…er is an empty string Related to #4256 Doing a Docket lookup when the docket number is empty will match false positives
An example:
Docket 68295573 already has a case_name Van Camp v. Van Camp, different than new value State v. Snyder
The docket in Courtlistener, that would have been overwritten has docket number "1 CA-CV 23-0297-FC", case name Van Camp v. Van Camp
The docket number for the Snyder case is "1 CA-CR 23-0297", which is a different case
So, the "docket_number_core" with value "230297" matches, but it shouldn't
This is a single example for Arizona, but on Sentry there are more records.
There is another example where the mismatch doesn't have a straightforward solution:
The oral argument with case name "In re: NEWMAN" has docket number 21-1228
The opinion with case name "Edgar G. C. v. Garland" has the same docket number
There are some cases when it is a correct match, but the case name or other data point is slightly different: ca3
The offending logic is in this function
courtlistener/cl/recap/mergers.py
Lines 84 to 169 in 723b7ec
Sentry Issue: COURTLISTENER-7XG
The text was updated successfully, but these errors were encountered: