-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make backscraper interval size a caller keyword argument #1095
Comments
grossir
added a commit
to grossir/juriscraper
that referenced
this issue
Aug 6, 2024
…rval dynamic Solves freelawproject#1095 - Update sample_caller to catch `--days-interval` optional keyword argument - Refactor make_backscrape_iterable that used days_interval as the AbstractSite default; all scrapers that used the same pattern are affected - Changed default behaviour of make_backscrape_iterable to assume dates are passed in %Y/%m/%d a more sensible format than %m/%d/%Y - Also, add logger.info calls for the start and end date of download_backwards to all the scrapers that did not have it
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A bunch of backscrapers use the
date_utils.make_date_range_tuples
function to create theback_scrape_iterable
, which takes agap
value for the size in days of each intervaljuriscraper/juriscraper/lib/date_utils.py
Lines 123 to 152 in 01b0309
As of now, we hard code the
gap
value. However, we could make this a dynamic variable from the caller keyword arguments, with a sensible default in case it is not passedjuriscraper/juriscraper/opinions/united_states/state/colo.py
Lines 139 to 161 in 01b0309
When running backscraper, I have found that the
self.days_interval
I defined was to big in some scrapers for some time periods, and the backscraper is not getting all documents due to page size. This would be easily solved by a dynamic argumentAlso, we could take this opportunity to refactor the most common case of creating the
back_scrape_iterable
, which takes 2datetime.date
as start and end dates, anddays_interval: int
, and save it as a function to be reused. This same pattern is being used in 14 scrapershttps://github.com/search?q=repo%3Afreelawproject%2Fjuriscraper%20self.back_scrape_iterable%20%3D%20make_date_range_tuples&type=code
The text was updated successfully, but these errors were encountered: