You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running a large harvest such as Harvard Dataverse often takes days. If the harvest needs to be restarted, it would be much cheaper and easier on our end and the server in question to start a certain number of records down the sitemap.
For example, we notice issues with HD at record 10,000. Rather than spending more than a day scraping records 1-10,000 which already exist in mnlite, it would be nice to set a start point so that the scraper could start right at the 10,000th record. Seems like it would be an drop-in to settings.json and then easily implemented in soscan.spiders.jsonldspider.JsonldSpider.sitemap_filter.
Setting could be called "start_point" or something similar.
The text was updated successfully, but these errors were encountered:
Running a large harvest such as Harvard Dataverse often takes days. If the harvest needs to be restarted, it would be much cheaper and easier on our end and the server in question to start a certain number of records down the sitemap.
For example, we notice issues with HD at record 10,000. Rather than spending more than a day scraping records 1-10,000 which already exist in mnlite, it would be nice to set a start point so that the scraper could start right at the 10,000th record. Seems like it would be an drop-in to
settings.json
and then easily implemented insoscan.spiders.jsonldspider.JsonldSpider.sitemap_filter
.Setting could be called
"start_point"
or something similar.The text was updated successfully, but these errors were encountered: