Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make backscraper interval size a caller keyword argument #1095

Open
grossir opened this issue Jul 30, 2024 · 0 comments
Open

Make backscraper interval size a caller keyword argument #1095

grossir opened this issue Jul 30, 2024 · 0 comments

Comments

@grossir
Copy link
Contributor

grossir commented Jul 30, 2024

A bunch of backscrapers use the date_utils.make_date_range_tuples function to create the back_scrape_iterable, which takes a gap value for the size in days of each interval

def make_date_range_tuples(start, end, gap):
"""Make an iterable of date tuples for use in iterating forms
For example, a form might allow start and end dates and you want to iterate
it one week at a time starting on Jan 1 and ending on Feb 3:
>>> make_date_range_tuples(date(2017, 1, 1), date(2017, 2, 3), 7)
[(Jan 1, Jan 7), (Jan 8, Jan 14), (Jan 15, Jan 21), (Jan 22, Jan 28),
(Jan 29, Feb 3)]
:param start: date when the query should start.
:param end: date when the query should end.
:param gap: the number of days, inclusive, that a query should span at a
time.
:rtype list(tuple)
:returns: list of start, end tuples
"""
# We create a list of start dates and a list of end dates, then zip them
# together. If end_dates is shorter than start_dates, fill the last value
# with the original end date.
start_dates = [
d.date() for d in rrule(DAILY, interval=gap, dtstart=start, until=end)
]
end_start = start + datetime.timedelta(days=gap - 1)
end_dates = [
d.date()
for d in rrule(DAILY, interval=gap, dtstart=end_start, until=end)
]
return list(zip_longest(start_dates, end_dates, fillvalue=end))

As of now, we hard code the gap value. However, we could make this a dynamic variable from the caller keyword arguments, with a sensible default in case it is not passed

def make_backscrape_iterable(self, kwargs: dict) -> None:
"""Checks if backscrape start and end arguments have been passed
by caller, and parses them accordingly
:param kwargs: passed when initializing the scraper, may or
may not contain backscrape controlling arguments
:return None
"""
start = kwargs.get("backscrape_start")
end = kwargs.get("backscrape_end")
if start:
start = datetime.strptime(start, "%m/%d/%Y")
else:
start = self.first_opinion_date
if end:
end = datetime.strptime(end, "%m/%d/%Y")
else:
end = datetime.now()
self.back_scrape_iterable = make_date_range_tuples(
start, end, self.days_interval
)

When running backscraper, I have found that the self.days_interval I defined was to big in some scrapers for some time periods, and the backscraper is not getting all documents due to page size. This would be easily solved by a dynamic argument

Also, we could take this opportunity to refactor the most common case of creating the back_scrape_iterable, which takes 2 datetime.date as start and end dates, and days_interval: int, and save it as a function to be reused. This same pattern is being used in 14 scrapers
https://github.com/search?q=repo%3Afreelawproject%2Fjuriscraper%20self.back_scrape_iterable%20%3D%20make_date_range_tuples&type=code

grossir added a commit to grossir/juriscraper that referenced this issue Aug 6, 2024
…rval dynamic

Solves freelawproject#1095

- Update sample_caller to catch `--days-interval` optional keyword argument
- Refactor make_backscrape_iterable that used days_interval as the AbstractSite default;  all scrapers that used the same pattern are affected
- Changed default behaviour of make_backscrape_iterable to assume dates are passed in %Y/%m/%d a more sensible format than %m/%d/%Y
- Also, add logger.info calls for the start and end date of download_backwards to all the scrapers that did not have it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant