`site.request["parameters"]` defined by an inheriting scraper is never used #1112

grossir · 2024-08-09T15:31:26Z

A user of Juriscraper may want to use the defined interface through self.request. However, it does not behave as expected. I was trying to understand why the minn scraper had to redefine request_url_get, instead of just using self.request["parameters"], self.request["headers"] and self.request["verify"]. The explanation is below

juriscraper/juriscraper/opinions/united_states/state/minn.py

Lines 43 to 53 in 6a64534

    
               def _request_url_get(self, url): 
        
                   """Execute GET request and assign appropriate request dictionary 
        
                   values 
        
                   """ 
        
                   self.request["session"].get(self.url, headers=self.headers, timeout=30) 
        
                   self.request["response"] = self.request["session"].get( 
        
                       self.search_url, 
        
                       params=self.parameters, 
        
                       headers=self.headers, 
        
                       timeout=30, 
        
                   )

Somewhat confusingly, there are 2 "parameters" attributes defined on AbstractSite

self.parameters which is used only on POST requests as the data argument

juriscraper/juriscraper/AbstractSite.py

Line 68 in 6a64534

self.parameters = None

self.request["parameters"] which, if defined explicitly in an inheriting Site, will never be used

juriscraper/juriscraper/AbstractSite.py

Lines 48 to 58 in 6a64534

    
           self.request = { 
        
               "verify": certifi.where(), 
        
               "session": requests.session(), 
        
               "headers": {"User-Agent": "Juriscraper"}, 
        
               # Disable CDN caching on sites like SCOTUS (ahem) 
        
               "cache-control": "no-cache, no-store, max-age=1", 
        
               "parameters": {}, 
        
               "request": None, 
        
               "status": None, 
        
               "url": None, 
        
           }

This is due to _download being called by _parse without a request_dict argument

juriscraper/juriscraper/AbstractSite.py

Lines 139 to 142 in 6a64534

    
           def parse(self): 
        
               if not self.downloader_executed: 
        
                   # Run the downloader if it hasn't been run already 
        
                   self.html = self._download()

The effect of this, is that the existing self.request["parameters"] will always be deleted by process_request_parameters

juriscraper/juriscraper/AbstractSite.py

Lines 322 to 335 in 6a64534

    
           def _download(self, request_dict={}): 
        
               """Download the latest version of Site""" 
        
               self.downloader_executed = True 
        
               if self.method == "POST": 
        
                   truncated_params = {} 
        
                   for k, v in self.parameters.items(): 
        
                       truncated_params[k] = trunc(v, 50, ellipsis="...[truncated]") 
        
                   logger.info( 
        
                       "Now downloading case page at: %s (params: %s)" 
        
                       % (self.url, truncated_params) 
        
                   ) 
        
               else: 
        
                   logger.info(f"Now downloading case page at: {self.url}") 
        
               self._process_request_parameters(request_dict)

Since the default is always an empty dict

juriscraper/juriscraper/AbstractSite.py

Lines 354 to 359 in 6a64534

    
           def _process_request_parameters(self, parameters={}): 
        
               """Hook for processing injected parameter overrides""" 
        
               if parameters.get("verify") is not None: 
        
                   self.request["verify"] = parameters["verify"] 
        
                   del parameters["verify"] 
        
               self.request["parameters"] = parameters

Related to #1106 and #1064

The text was updated successfully, but these errors were encountered:

mlissner · 2024-08-09T17:07:36Z

Nice. Looks like you've got the fix figured out too, right? I'll put this on your backlog. Thank you.

flooie · 2024-08-12T09:23:26Z

this is something I've wanted to get to for a while. thanks for tackling this. this should make things much nicer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`site.request["parameters"]` defined by an inheriting scraper is never used #1112

`site.request["parameters"]` defined by an inheriting scraper is never used #1112

grossir commented Aug 9, 2024

mlissner commented Aug 9, 2024

flooie commented Aug 12, 2024

site.request["parameters"] defined by an inheriting scraper is never used #1112

site.request["parameters"] defined by an inheriting scraper is never used #1112

Comments

grossir commented Aug 9, 2024

mlissner commented Aug 9, 2024

flooie commented Aug 12, 2024

`site.request["parameters"]` defined by an inheriting scraper is never used #1112

`site.request["parameters"]` defined by an inheriting scraper is never used #1112