You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These are general ideas to improve Juriscraper. Some of them are to improve the quality / quantity of the data returned per scraper. Some of them are for improving developer experience
Scrapers
Separate downloading logic from parsing / cleaning logic on AbstractSite
The downloading methods on AbstractSite should be on a separate class. Then AbstractSite can inherit from that class to access those methods.
Also, a good opportunity for refactoring them, since they are awkward to use and or access. I always found confusing the self.request naming, and the accessors like self.request["session"]self.request["request"], may be improved via dot notation
Move get binary content into Juriscraper
This is related to the previous point. Getting the opinion / audio documents is a task corresponding to juriscraper, and it being placed on Courtlistener causes code duplication. The Site object's attributes are used generously there to handle the download anyway
Make full scraper cycle testing easier
Sometimes it's a hassle to start up the docker compose, uninstall the juriscraper dependency and install the branch we are working on. A minimal testing environment only requires doctor to be running to see how the extracted documents end up looking like, and to live test extract_from_text, so we could tweak the sample caller on juriscraper to make this straightforward
Reduce boilerplate
Some of the advantages of using base classes and inheritance is lost because of the AbstractSite / OpinionSite initialization
A bunch of classes may not need to define a init . For example, self.url, self.status could be class attributes, and self.court_id could be auto-defined by proper use of base classes, that is, not setting those attribute to None on the base init. That would save 2 lines (init def and self.court_id), which does not sound like much, but improves developer experience. An example were init is not needed
Use explicit Courtlistener objects, and automate naming, cleaning and validation
Changes for this section have explicit examples on this PR #952
We currently return a plain dictionary {"case_name": "....", "docket_number": "....", "download_url": "..."}. We should return a dictionary that maps into Courtlistener models {"Docket": {"docket_number": "...", "OpinionCluster": [{"judges": "", "Opinion": []}]. This would solve all naming confusions, since we would return the explicit object. However, this does not imply we have to build a complicated nested object on each scraper, it can be automated in the base class following simple conventions
Currently DeferringList is buggy, and it doesn't defer. Also, its current design only supports returning 1 field.
For some scrapers, the opinion's download_url is itself in a secondary page. This datapoint is more important than others, since we use the hash of the binary content to decide if it is duplicated or not
Also, sometimes we could get more than 1 extra page to get richer data.
So, we can separate deferring in at least 2 stages:
getting the download_url: this would be run after the case_names hash has been computed.
getting other secondary pages: this would be run after
And, we can handle this explicitly in a function, instead of doing the more complicated pattern of a class. This function would be executed by the caller:
if abort_by_url_hash(site):
continue
# get the download url
site.download_deferred_url(case)
# get binary content, get the hash
...
if abort_by_content_hash(case):
continue
# if hash is not duplicated, get the full data
site.get_deferred_values(case)
...
Compute hash
To check if we have already scraped an HTML page, we compute the hash of the ordered case names in the current page. This is done on Courtlistener, and is associated with the Site.url attribute.
This should be done in Juriscraper, to show that there is a side effect on modifying self.url. We could also use a reserved keyword for the hash url key.
Integration with CL
Support OpinionCluster and other objects ingestion
This follows from changing the data shape we return. Given that we return explicit names, this should simplify ingestion functions
Can CL dependencies point to Juriscraper main branch?
Save manual release in Juriscraper
Save triggering auto update dependencies action
The text was updated successfully, but these errors were encountered:
I'll just note from our meeting today, we discussed the idea of using main of Juriscraper in CL, but Bill raised the fact that sometimes JS is ahead of CL, so that could be dangerous. Our compromise, for now, is to lock CL to the minor release of JS, and that way it'll pick up new releases whenever we deploy. We'll still need to cut releases of JS, but I'm not sure there's a better way for JS to tell CL that the code is safe to use (or not!). But hopefully this will help and we can look at it again later and see if we have new ideas for handling things.
These are general ideas to improve Juriscraper. Some of them are to improve the quality / quantity of the data returned per scraper. Some of them are for improving developer experience
Scrapers
Separate downloading logic from parsing / cleaning logic on AbstractSite
The downloading methods on AbstractSite should be on a separate class. Then AbstractSite can inherit from that class to access those methods.
Also, a good opportunity for refactoring them, since they are awkward to use and or access. I always found confusing the
self.request
naming, and the accessors likeself.request["session"]
self.request["request"]
, may be improved via dot notationMove get binary content into Juriscraper
This is related to the previous point. Getting the opinion / audio documents is a task corresponding to juriscraper, and it being placed on Courtlistener causes code duplication. The
Site
object's attributes are used generously there to handle the download anywayMake full scraper cycle testing easier
Sometimes it's a hassle to start up the docker compose, uninstall the juriscraper dependency and install the branch we are working on. A minimal testing environment only requires
doctor
to be running to see how the extracted documents end up looking like, and to live testextract_from_text
, so we could tweak the sample caller on juriscraper to make this straightforwardReduce boilerplate
Some of the advantages of using base classes and inheritance is lost because of the
AbstractSite
/OpinionSite
initializationjuriscraper/juriscraper/AbstractSite.py
Lines 60 to 62 in cc2494b
A bunch of classes may not need to define a init . For example,
self.url
,self.status
could be class attributes, andself.court_id
could be auto-defined by proper use of base classes, that is, not setting those attribute to None on the base init. That would save 2 lines (init def and self.court_id), which does not sound like much, but improves developer experience. An example were init is not neededjuriscraper/juriscraper/opinions/united_states/state/va.py
Lines 6 to 11 in cc2494b
Use explicit Courtlistener objects, and automate naming, cleaning and validation
Changes for this section have explicit examples on this PR #952
We currently return a plain dictionary
{"case_name": "....", "docket_number": "....", "download_url": "..."}
. We should return a dictionary that maps into Courtlistener models{"Docket": {"docket_number": "...", "OpinionCluster": [{"judges": "", "Opinion": []}]
. This would solve all naming confusions, since we would return the explicit object. However, this does not imply we have to build a complicated nested object on each scraper, it can be automated in the base class following simple conventionsjuriscraper/juriscraper/NewOpinionSite.py
Line 145 in c52d88b
This would be auto validated by a json schema
juriscraper/juriscraper/schemas/OpinionCluster.json
Lines 86 to 96 in c52d88b
Auto cleaning
We shouldn't be stripping whitespace, or formatting case names manually, this should be automatically done by the base class on the cleaning step
juriscraper/juriscraper/opinions/united_states/state/la.py
Line 59 in cc2494b
Example of auto cleaning depending on key name
juriscraper/juriscraper/NewOpinionSite.py
Lines 269 to 273 in c52d88b
Make deferred actions explicit
Currently
DeferringList
is buggy, and it doesn't defer. Also, its current design only supports returning 1 field.For some scrapers, the opinion's download_url is itself in a secondary page. This datapoint is more important than others, since we use the hash of the binary content to decide if it is duplicated or not
Also, sometimes we could get more than 1 extra page to get richer data.
So, we can separate deferring in at least 2 stages:
download_url
: this would be run after the case_names hash has been computed.And, we can handle this explicitly in a function, instead of doing the more complicated pattern of a class. This function would be executed by the caller:
Compute hash
To check if we have already scraped an HTML page, we compute the hash of the ordered case names in the current page. This is done on Courtlistener, and is associated with the
Site.url
attribute.This should be done in Juriscraper, to show that there is a side effect on modifying
self.url
. We could also use a reserved keyword for the hash url key.Integration with CL
Support OpinionCluster and other objects ingestion
This follows from changing the data shape we return. Given that we return explicit names, this should simplify ingestion functions
Can CL dependencies point to Juriscraper main branch?
Save manual release in Juriscraper
Save triggering auto update dependencies action
The text was updated successfully, but these errors were encountered: