Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart a large harvest at a settable number of records through a sitemap #45

Closed
iannesbitt opened this issue Nov 9, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request v0.1.1 Version 0.1.1 item
Milestone

Comments

@iannesbitt
Copy link
Contributor

Running a large harvest such as Harvard Dataverse often takes days. If the harvest needs to be restarted, it would be much cheaper and easier on our end and the server in question to start a certain number of records down the sitemap.

For example, we notice issues with HD at record 10,000. Rather than spending more than a day scraping records 1-10,000 which already exist in mnlite, it would be nice to set a start point so that the scraper could start right at the 10,000th record. Seems like it would be an drop-in to settings.json and then easily implemented in soscan.spiders.jsonldspider.JsonldSpider.sitemap_filter.

Setting could be called "start_point" or something similar.

@iannesbitt iannesbitt added the enhancement New feature or request label Nov 9, 2023
@iannesbitt iannesbitt self-assigned this Nov 9, 2023
iannesbitt added a commit that referenced this issue Nov 10, 2023
@iannesbitt iannesbitt added this to the 0.1.1 milestone Nov 10, 2023
@iannesbitt iannesbitt added the v0.1.1 Version 0.1.1 item label Nov 10, 2023
iannesbitt added a commit that referenced this issue Nov 10, 2023
iannesbitt added a commit that referenced this issue Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v0.1.1 Version 0.1.1 item
Projects
None yet
Development

No branches or pull requests

1 participant