Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support getting multiple fields from a secondary page #889

Open
grossir opened this issue Jan 25, 2024 · 6 comments
Open

Support getting multiple fields from a secondary page #889

grossir opened this issue Jan 25, 2024 · 6 comments

Comments

@grossir
Copy link
Contributor

grossir commented Jan 25, 2024

The current method to get data from a secondary page is to use a DeferringList. This method is designed for parsing a single field. However, we may want to get multiple fields from the same page, as we saw most recently in a nev issue. In its current implementation, that would mean requesting the same page many times, which is not desirable.

I list 3 possible solutions to this problem. The first one is the easiest. The second one is hacky. I would prefer the third one, but it takes work beyond the DeferringList proper

Store DeferringList's HTML

A possible solution would be to store the url's html in the site object. This does not require any changes to the current code, just a case by case implementation. Then, define one DeferringList for each attribute we may want. It may cause memory issues in some backscrapers, if the backlog is big

class Site(OpinionSiteLinear):
    url_to_deferred_html = {}

    ...

    def _get_case_names(self) -> List[str]:
        def fetcher(case):
            if case["name"] != "":
                # Return the name we extracted without using fetcher
                return case["name"]
            elif self.test_mode_enabled():
                # if we're in test mode, return a dummy name
                return "Test Name"
            else:
                # Else, query the API and return the name of the case
                self.url = f"https://www.courts.michigan.gov/api/CaseSearch/SearchCaseSearchContent/?searchQuery={case['title']}"
                self.html = self._download()

                if not self.url_to_deferred_html.get(self.url):
                       self.url_to_deferred_html[self.url] = self.html

                case["name"] = self.html["opinionResults"]["searchItems"][0][
                    "title"
                ].title()
            return case["name"]

        return DeferringList(seed=self.cases, fetcher=fetcher)

Modify case inside a single DeferringList

A solution that requires no storing would be to modify the case object already passed to the fetcher function. However, this would be awkward on the AbstractSite._clean_attributes step, since it is supposed that the DeferringList is not executed yet. However, there is currently a bug in the behavior of DeferringList, so this would work

Abstract away the AbstractSite lists pattern

In the end, courtlistener iterates over objects or python dictionaries. We could abstract away the interface of AbstractSite and buid an AbstractSite that manages a list of objects, not lists of attributes. This, while still supporting the list paradigm for current and old scrapers. We could create a new DeferringClass that updates many fields at the same time, and that interacts with the list of objects architecture. That way, updating multiple fields of a single object would be trivial

@mlissner
Copy link
Member

Option three makes sense to me. In other words, we have Site objects, which yield Case objects or something like that?

@grossir
Copy link
Contributor Author

grossir commented Jan 27, 2024

I took some time writing this, without complete testing, but it runs and works as a concept. The code can be seen here:
https://github.com/freelawproject/juriscraper/compare/main...grossir:juriscraper:new_opinion_site_subclass?expand=1

Basically, it is a new class for OpinionSite. it inherits from AbstractSite but overrides __iter__, __getitem__, __len__ and parse. Taking care of keeping the same interface: for example, the ordering function is important because it affects the value of the hash

I also tested a new way to get multiple deferring values on a single call, but couldn't manage to keep the convenience of __geitem__ from DeferringList without it being buggy. I chose a explicit function that must be used to consume the deferring values, if they exist. There is an example of this working with nev, and I think it looks good and has less boilerplate.

Finally, in order to replace AbstractSite._check_sanity I took the opportunity to try out the JSON Schema Validator, for which I will write a longer comment on #838 . To run, install pip install jsonschema==4.21.1. It will not be appreciated too much on this branch, since it raises an Exception for the "deferred" fields, since they should be strings and they are functions. This can be solved by writing a custom validator, but it requires more work.

@mlissner
Copy link
Member

This looks like a generally good direction to me. One thing I wonder is whether we should go all-in on this, and leave all our old parsers behind. It's kind of a bummer that this would mean we have three generations of object in the codebase:

  • This (gen 3)
  • AbstractSite (gen 1)
  • AbstractSiteLinear (gen 2)

Hm...

@grossir
Copy link
Contributor Author

grossir commented Jan 29, 2024

I think OpinionSiteLinear sites can be updated to this new base class easily, since it is following some of the same usage conventions (using self.sites to store records; conversion of short to long names on the base class), as shown on the nev example.

1st Gen would be harder to change and require more testing.

So, we could still keep 2 classes and update the OpinioSite / 1st gen sites on a case by case basis, as they get outdated

@mlissner
Copy link
Member

Sounds good. Down with 1st Gen OpinionSite!

@flooie
Copy link
Contributor

flooie commented Jan 30, 2024

The rise of the clustersite

grossir added a commit to grossir/courtlistener that referenced this issue Jan 31, 2024
Supports new juriscraper scraper class and returned objects, and also keeps legacy interface

- Supports: freelawproject/juriscraper#883
- Supports: freelawproject/juriscraper#889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Miscellaneous
Development

No branches or pull requests

3 participants