Resume a crawl for later #500

JenPho · 2022-09-10T10:04:42Z

I am currently downloading a site using Heritrix, and I don't exactly want to leave my computer on overnight. Can I simply just stop a crawl, and resume it later?

Taking a look at https://heritrix.readthedocs.io/en/latest/operating.html#full-recovery, I determined that if I were to 'accidentally' crash the java program, I can put the /jobs/x/date/logs/frontier.recover.gz file in /jobs/x/action, create the server and launch the job again to resume it. Is this correct, or were crawls meant to be a do-it-all-right-now thing? I've tried this and it didn't really work. I used kill PID on the server in my Terminal, relaunched it to see that it started scraping under a new directory and that it moved my frontier.recover.gz file to /jobs/x/action/done, so I'm not sure if it worked.

The text was updated successfully, but these errors were encountered:

Fixes #500 #506

Fixes internetarchive#500 internetarchive#506

ato added the question label Sep 26, 2022

internetarchive locked and limited conversation to collaborators Sep 30, 2022

ato converted this issue into discussion #506 Sep 30, 2022

ato added a commit that referenced this issue Sep 30, 2022

docs: Add 'Checkpointing' section to operating guide

b328ded

Fixes #500 #506

ato added a commit to nla/heritrix3 that referenced this issue Jan 23, 2023

docs: Add 'Checkpointing' section to operating guide

14e5566

Fixes internetarchive#500 internetarchive#506

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Resume a crawl for later #500

Resume a crawl for later #500

JenPho commented Sep 10, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Resume a crawl for later #500

Resume a crawl for later #500

Comments

JenPho commented Sep 10, 2022

This issue was moved to a discussion.