You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I strongly suspect that the timestamp comparison is not working in its current state, and because of that the entire sitemap is being crawled on each harvest. Ensuring that the timestamp comparison function is working may cut down significantly on unneccessary traffic.
iannesbitt
changed the title
Test workingness of soscan.spiders.jsonldspider.JsonldSpider.lastmod_filter timestamp comparison
Test workingness of timestamp comparison in JsonldSpiderOct 27, 2023
Relevant docs are here and describe filtering based on a static value. This could be useful if we want to set the lastmod_filter to the last time we successfully harvested the site.
It would be much harder and more resource intensive, but perhaps more useful to be able to get this value from the database. However, in the case of redirects, the database would have to store the pre-redirect URL which it does not at the moment. The pre-redirect URL could be used in a query to potentially find the last time it was updated and whether the lastmod value has changed since it was stored.
Tested and working. lastmod_filter can be set to any date and time handled by dateparser. I set "lastmod_filter": "two years ago" in instance/nodes/mnTestOPENTOPO/settings.json and it requested only 71 of the 790 or so datasets.
I strongly suspect that the timestamp comparison is not working in its current state, and because of that the entire sitemap is being crawled on each harvest. Ensuring that the timestamp comparison function is working may cut down significantly on unneccessary traffic.
Related:
The text was updated successfully, but these errors were encountered: