Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lastmod timestamp comparison in JsonldSpider #41

Closed
iannesbitt opened this issue Oct 27, 2023 · 3 comments
Closed

Add lastmod timestamp comparison in JsonldSpider #41

iannesbitt opened this issue Oct 27, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request v0.1.1 Version 0.1.1 item
Milestone

Comments

@iannesbitt
Copy link
Contributor

I strongly suspect that the timestamp comparison is not working in its current state, and because of that the entire sitemap is being crawled on each harvest. Ensuring that the timestamp comparison function is working may cut down significantly on unneccessary traffic.

Related:

@iannesbitt iannesbitt added bug Something isn't working v0.1.1 Version 0.1.1 item labels Oct 27, 2023
@iannesbitt iannesbitt added this to the 0.1.1 milestone Oct 27, 2023
@iannesbitt iannesbitt self-assigned this Oct 27, 2023
@iannesbitt iannesbitt changed the title Test workingness of soscan.spiders.jsonldspider.JsonldSpider.lastmod_filter timestamp comparison Test workingness of timestamp comparison in JsonldSpider Oct 27, 2023
@iannesbitt
Copy link
Contributor Author

Relevant docs are here and describe filtering based on a static value. This could be useful if we want to set the lastmod_filter to the last time we successfully harvested the site.

It would be much harder and more resource intensive, but perhaps more useful to be able to get this value from the database. However, in the case of redirects, the database would have to store the pre-redirect URL which it does not at the moment. The pre-redirect URL could be used in a query to potentially find the last time it was updated and whether the lastmod value has changed since it was stored.

@iannesbitt
Copy link
Contributor Author

At the very least, lastmod_filter should be set to the last successful harvest date and time.

iannesbitt added a commit that referenced this issue Oct 31, 2023
@iannesbitt
Copy link
Contributor Author

Tested and working. lastmod_filter can be set to any date and time handled by dateparser. I set "lastmod_filter": "two years ago" in instance/nodes/mnTestOPENTOPO/settings.json and it requested only 71 of the 790 or so datasets.

@iannesbitt iannesbitt changed the title Test workingness of timestamp comparison in JsonldSpider Add lastmod timestamp comparison in JsonldSpider Oct 31, 2023
@iannesbitt iannesbitt added enhancement New feature or request and removed bug Something isn't working labels Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v0.1.1 Version 0.1.1 item
Projects
None yet
Development

No branches or pull requests

1 participant