Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

site.request["parameters"] defined by an inheriting scraper is never used #1112

Open
grossir opened this issue Aug 9, 2024 · 2 comments
Open

Comments

@grossir
Copy link
Contributor

grossir commented Aug 9, 2024

A user of Juriscraper may want to use the defined interface through self.request. However, it does not behave as expected. I was trying to understand why the minn scraper had to redefine request_url_get, instead of just using self.request["parameters"], self.request["headers"] and self.request["verify"]. The explanation is below

def _request_url_get(self, url):
"""Execute GET request and assign appropriate request dictionary
values
"""
self.request["session"].get(self.url, headers=self.headers, timeout=30)
self.request["response"] = self.request["session"].get(
self.search_url,
params=self.parameters,
headers=self.headers,
timeout=30,
)


Somewhat confusingly, there are 2 "parameters" attributes defined on AbstractSite

self.parameters which is used only on POST requests as the data argument

self.parameters = None

self.request["parameters"] which, if defined explicitly in an inheriting Site, will never be used

self.request = {
"verify": certifi.where(),
"session": requests.session(),
"headers": {"User-Agent": "Juriscraper"},
# Disable CDN caching on sites like SCOTUS (ahem)
"cache-control": "no-cache, no-store, max-age=1",
"parameters": {},
"request": None,
"status": None,
"url": None,
}

This is due to _download being called by _parse without a request_dict argument

def parse(self):
if not self.downloader_executed:
# Run the downloader if it hasn't been run already
self.html = self._download()

The effect of this, is that the existing self.request["parameters"] will always be deleted by process_request_parameters

def _download(self, request_dict={}):
"""Download the latest version of Site"""
self.downloader_executed = True
if self.method == "POST":
truncated_params = {}
for k, v in self.parameters.items():
truncated_params[k] = trunc(v, 50, ellipsis="...[truncated]")
logger.info(
"Now downloading case page at: %s (params: %s)"
% (self.url, truncated_params)
)
else:
logger.info(f"Now downloading case page at: {self.url}")
self._process_request_parameters(request_dict)

Since the default is always an empty dict

def _process_request_parameters(self, parameters={}):
"""Hook for processing injected parameter overrides"""
if parameters.get("verify") is not None:
self.request["verify"] = parameters["verify"]
del parameters["verify"]
self.request["parameters"] = parameters


Related to #1106 and #1064

@mlissner
Copy link
Member

mlissner commented Aug 9, 2024

Nice. Looks like you've got the fix figured out too, right? I'll put this on your backlog. Thank you.

@flooie
Copy link
Contributor

flooie commented Aug 12, 2024

this is something I've wanted to get to for a while. thanks for tackling this. this should make things much nicer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Priority
Development

No branches or pull requests

3 participants