Restart a large harvest at a settable number of records through a sitemap #45

iannesbitt · 2023-11-09T23:14:09Z

Running a large harvest such as Harvard Dataverse often takes days. If the harvest needs to be restarted, it would be much cheaper and easier on our end and the server in question to start a certain number of records down the sitemap.

For example, we notice issues with HD at record 10,000. Rather than spending more than a day scraping records 1-10,000 which already exist in mnlite, it would be nice to set a start point so that the scraper could start right at the 10,000th record. Seems like it would be an drop-in to settings.json and then easily implemented in soscan.spiders.jsonldspider.JsonldSpider.sitemap_filter.

Setting could be called "start_point" or something similar.

The text was updated successfully, but these errors were encountered:

iannesbitt added the enhancement New feature or request label Nov 9, 2023

iannesbitt self-assigned this Nov 9, 2023

iannesbitt added a commit that referenced this issue Nov 10, 2023

logic for #45

3e16aac

iannesbitt added this to the 0.1.1 milestone Nov 10, 2023

iannesbitt added the v0.1.1 Version 0.1.1 item label Nov 10, 2023

iannesbitt added a commit that referenced this issue Nov 10, 2023

Added debug logging to sitemap_filter (#45)

a2235cb

iannesbitt added a commit that referenced this issue Nov 10, 2023

adding timestamp to debug message (#45)

08ac602

iannesbitt added a commit that referenced this issue Nov 10, 2023

updating test sitemap (#45, DataONEorg/sonormal#4)

ed7b18b

iannesbitt added a commit that referenced this issue Nov 10, 2023

switching gt/lt signs (#45)

9b2c0c0

iannesbitt added a commit that referenced this issue Nov 10, 2023

adding more verbose logging for #45

afee143

iannesbitt closed this as completed Nov 13, 2023

This was referenced Nov 16, 2023

Release v0.1.1 #44

Merged

Add option to read sitemap entry iterator in reverse order #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart a large harvest at a settable number of records through a sitemap #45

Restart a large harvest at a settable number of records through a sitemap #45

iannesbitt commented Nov 9, 2023

Restart a large harvest at a settable number of records through a sitemap #45

Restart a large harvest at a settable number of records through a sitemap #45

Comments

iannesbitt commented Nov 9, 2023