-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support getting multiple fields from a secondary page #889
Comments
Option three makes sense to me. In other words, we have Site objects, which yield Case objects or something like that? |
I took some time writing this, without complete testing, but it runs and works as a concept. The code can be seen here: Basically, it is a new class for I also tested a new way to get multiple deferring values on a single call, but couldn't manage to keep the convenience of Finally, in order to replace |
This looks like a generally good direction to me. One thing I wonder is whether we should go all-in on this, and leave all our old parsers behind. It's kind of a bummer that this would mean we have three generations of object in the codebase:
Hm... |
I think 1st Gen would be harder to change and require more testing. So, we could still keep 2 classes and update the |
Sounds good. Down with 1st Gen OpinionSite! |
The rise of the clustersite |
Supports new juriscraper scraper class and returned objects, and also keeps legacy interface - Supports: freelawproject/juriscraper#883 - Supports: freelawproject/juriscraper#889
The current method to get data from a secondary page is to use a
DeferringList
. This method is designed for parsing a single field. However, we may want to get multiple fields from the same page, as we saw most recently in anev
issue. In its current implementation, that would mean requesting the same page many times, which is not desirable.I list 3 possible solutions to this problem. The first one is the easiest. The second one is hacky. I would prefer the third one, but it takes work beyond the
DeferringList
properStore DeferringList's HTML
A possible solution would be to store the url's html in the site object. This does not require any changes to the current code, just a case by case implementation. Then, define one DeferringList for each attribute we may want. It may cause memory issues in some backscrapers, if the backlog is big
Modify
case
inside a single DeferringListA solution that requires no storing would be to modify the
case
object already passed to the fetcher function. However, this would be awkward on theAbstractSite._clean_attributes
step, since it is supposed that theDeferringList
is not executed yet. However, there is currently a bug in the behavior ofDeferringList
, so this would workAbstract away the AbstractSite lists pattern
In the end, courtlistener iterates over objects or python dictionaries. We could abstract away the interface of AbstractSite and buid an AbstractSite that manages a list of objects, not lists of attributes. This, while still supporting the list paradigm for current and old scrapers. We could create a new DeferringClass that updates many fields at the same time, and that interacts with the list of objects architecture. That way, updating multiple fields of a single object would be trivial
The text was updated successfully, but these errors were encountered: