Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume a crawl for later #500

Closed
JenPho opened this issue Sep 10, 2022 · 0 comments
Closed

Resume a crawl for later #500

JenPho opened this issue Sep 10, 2022 · 0 comments
Labels

Comments

@JenPho
Copy link

JenPho commented Sep 10, 2022

I am currently downloading a site using Heritrix, and I don't exactly want to leave my computer on overnight. Can I simply just stop a crawl, and resume it later?

Taking a look at https://heritrix.readthedocs.io/en/latest/operating.html#full-recovery, I determined that if I were to 'accidentally' crash the java program, I can put the /jobs/x/date/logs/frontier.recover.gz file in /jobs/x/action, create the server and launch the job again to resume it. Is this correct, or were crawls meant to be a do-it-all-right-now thing? I've tried this and it didn't really work. I used kill PID on the server in my Terminal, relaunched it to see that it started scraping under a new directory and that it moved my frontier.recover.gz file to /jobs/x/action/done, so I'm not sure if it worked.

@ato ato added the question label Sep 26, 2022
@internetarchive internetarchive locked and limited conversation to collaborators Sep 30, 2022
@ato ato converted this issue into discussion #506 Sep 30, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

2 participants