Enhance Juriscraper to Support Bundling of Separate Opinions #883

flooie · 2024-01-24T16:28:23Z

Issue Description:

Currently, a handful of courts provide separate opinions in their opinion lists, which are not currently supported by juriscraper and CourtListener (CL). This lack of support for bundling separate opinions can lead to incomplete or segmented case information being scraped and processed.

Suggested Enhancement:

I propose updating juriscraper to allow for the bundling of separate opinions. This enhancement would ensure that all opinions related to a case are collected and processed together, providing a more comprehensive view of the case proceedings and decisions.

Courts: (in progress list)

mlissner · 2024-01-24T20:57:53Z

To be clear here, what you're proposing is upgrading Juriscraper to return multiple opinion objects under one key, like we have with clusters/opinions in CL itself, right? Assuming so, can you provide a link or screenshot or something as an example?

flooie · 2024-01-24T21:25:39Z

Yes - I was working this thru in my head - before I laid out my vision.

flooie · 2024-01-24T21:32:39Z

{'date': '2/14/2023', 'docket': 'SC20164', 'name': 'State v. Juan A. G.-P.', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20627', 'name': 'CT Freedom Alliance, LLC v. Dept. of Education', 'opinion_type': '010combined'}

{'date': '1/31/2023', 'docket': 'SC20633', 'name': 'Devine v. Fusaro', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20679', 'name': 'Grant v. Commissioner of Correction', 'opinion_type': '010combined'}

{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '010combined'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '030concurrence'}
{'date': '1/24/2023', 'docket': 'SC20371', 'name': 'State v. Brandon', 'opinion_type': '040dissent'}

{'date': '1/24/2023', 'docket': 'SC20597', 'name': 'Solon v. Slater', 'opinion_type': '010combined'}

{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '010combined'}
{'date': '1/17/2023', 'docket': 'SC20453', 'name': 'State v. James A.', 'opinion_type': '030concurrence'}

I fixed and rewrote part of Connecticut - to take advantage of the opinion_type changes. Here are some excepts from self.cases

We can take these results and either call a method to combine the multiple opinions here into clusters and only slightly modify CL to save each opinion together with the cluster

mlissner · 2024-01-24T21:53:06Z

I'd expect this to mirror the fields in CL pretty closely. Why not do the joining in JS so that CL has a nice JSON object of clusters with nested opinions?

Supports new juriscraper scraper class and returned objects, and also keeps legacy interface - Supports: freelawproject/juriscraper#883 - Supports: freelawproject/juriscraper#889

grossir · 2024-01-31T17:32:14Z

I checked the changes required on Courtlistener to support this new paradigm, while still supporting the legacy scrapers. I found the following:

We can return a nested object, but we must keep a minimal interface (dict keys) for compatibility with cl_scrape_opinions tasks of dup checking

{
    "Docket": {...},
    "OpinionCluster": {...},
    "Opinion": {...},
    "case_names": "",  # for site.hash dup checking
    "download_urls: "",  # for sha1 checking of url content 
    "precedential_statuses": "" # for sha1 checking in case of nev
    "case_dates": "", # for sorting and dup checking,
}

Even if we return objects of the following shape we would have to return an item for each opinion (because of dup checking), causing a somewhat ugly duplication

{
    "OpinionCluster": {
             "Opinion": [
                     {...},
                     {...}
              ]
    }
}

Returning the objects ready to create means we must pass all required values.
This object approach allows greater flexibility to add more fields as we found them, without having to modify CL each time
The objects returned can be validated by a JSON Schema as discussed in Add date format validation to test_extract_from_text_properly_implemented on test_ScraperExtractFromTextTest.py #838

Here is a branch where I show the changes needed in CL, which turned out rather small. This is still a concept, would have to be tested and improved

https://github.com/freelawproject/courtlistener/compare/main...grossir:courtlistener:support_juriscraper_nested_objects?expand=1

mlissner · 2024-01-31T21:38:04Z

Gianfranco, it's very OK to change CL as part of this, if it means making the interface better while hitting our design requirements. I'd rather do this now and have something we like instead of being stuck with half measures. Does that change your thinking about approach?

grossir · 2024-03-06T22:05:59Z

It took quite some time but I have a draft working on integration with Courtlistener (which will be another parallel PR)
First I will paste some nice screenshots, then I will dive into some problems and opportunities I found while working on this

Results

I used tex as a working scraper to test the new class. As a useful example, we have this recent Supreme Court case, which has a OpinionCluster of 3 opinions.
This is how the cluster looks on my local docker env:

How it currently looks on Courtlistener

Also, the scraper captures search_originating_court_information

and extra columns for our usual objects

Implementation details

It's better to look at the code, even if there is still pending work. I have written comments extensively.

#952

On Courtlistener:
freelawproject/courtlistener#3864

Besides the "code" code review, I will need some "data" code review, to see if I am using properly the nature_of_suit, cause, opinion.type, etc fields

Of note, I found a way to keep tests of secondary/deferred page's examples. For tex it was as simple as tweaking the href leading to the secondary page, so that it points to the precise example file.

Pending work

I still have a bunch of bugs to solve and tests to write for this to be mergeable

writing tests for the JSON Validator (I know it is currently not validating nested objects)
writing a custom type checker for python dates
support deferring attributes
adapting texcrimapp and texapp_* to the new tex class

Further work

There is a clear opportunity to scrape people_db objects, like Person, Party, Attorney, and to support them in cl_scrape_opinions. However, this would take more work and testing since lookups for this objects have to be used

Some bugs found on the way

Bugs on OpinionSite[Linear] integration with CL: Attributes that we can return but are never picked up in CL (defined on OpinionSite class)

            "dispositions",
            "causes",
            "divisions",
            "docket_attachment_numbers",
            "docket_document_numbers",
            "lower_courts",
            "lower_court_judges",
            "lower_court_numbers",

These are actually used on some sources, so we are not inserting data we do collect. For example, lower_courts is used in tenn, nev, ind, bap1, etc

grossir mentioned this issue Jan 30, 2024

Texas Supreme and Appellate scrapers enhacements #902

Open

grossir mentioned this issue Feb 1, 2024

Down Scraper Fla 6th District Ct of Appeals #870

Closed

This was referenced Mar 6, 2024

feat(NewOpinionSite): support returning nested data structures #952

Open

feat(cl_scrape_opinions): support nested objects scraper freelawproject/courtlistener#3864

Draft

grossir mentioned this issue May 14, 2024

feat(pa): dynamic backscraper and update to new source #968

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Juriscraper to Support Bundling of Separate Opinions #883

Enhance Juriscraper to Support Bundling of Separate Opinions #883

flooie commented Jan 24, 2024 •

edited

Loading

mlissner commented Jan 24, 2024

flooie commented Jan 24, 2024

flooie commented Jan 24, 2024

mlissner commented Jan 24, 2024

grossir commented Jan 31, 2024

mlissner commented Jan 31, 2024

grossir commented Mar 6, 2024 •

edited

Loading

Enhance Juriscraper to Support Bundling of Separate Opinions #883

Enhance Juriscraper to Support Bundling of Separate Opinions #883

Comments

flooie commented Jan 24, 2024 • edited Loading

mlissner commented Jan 24, 2024

flooie commented Jan 24, 2024

flooie commented Jan 24, 2024

mlissner commented Jan 24, 2024

grossir commented Jan 31, 2024

mlissner commented Jan 31, 2024

grossir commented Mar 6, 2024 • edited Loading

Results

Implementation details

Pending work

Further work

Some bugs found on the way

flooie commented Jan 24, 2024 •

edited

Loading

grossir commented Mar 6, 2024 •

edited

Loading