harvest-agent-h1 does not close je http_cookies database when agent paused, started and stopped. Possibly http_cookies database not used after restart. #104

kurtlenfesty · 2019-03-14T21:15:26Z

When harvest-agent-h1 starts a harvest job, a FetchHTTP crawler is started to process the crawl (one of several actually). This particular FetchHTTP crawler is added to the processor chain and it creates a cookieDb (http_cookies) as part of its initialTasks(). In a normal crawl without pauses, when the crawl is stopped, this FetchHTTP crawler has finalTasks() called by the CrawlController completeStop(), which in turn calls cleanupHttp(), which syncs and closes the http_cookies database.

However, if the crawl job is paused and resumed, the FetchHTTP crawler in the processor chain is no longer there (and in fact, initialTasks() is not called on this new FetchHTTP crawler instance. That means that when the CrawlController completeStop() is called, no FetchHTTP instance is called as part of CrawlController runProcessorFinalTasks(). That means that the http_cookies database is not closed.

That means when the CrawlController completeStop() finished its processing and closes the bdbEnvironment (the environment where the sleepycat je databases are running), it finds that there is a http_cookies that still exists and is unclosed and produces the following exception, which appears in the logs:

Mar 15, 2019 7:18:15 AM org.archive.crawler.framework.CrawlController completeStop
WARNING: Problem syncing or closing bdbEnvironment
com.sleepycat.je.DatabaseException: (JE 3.3.74) There is 1 open Database in the Environment.
Closing the following databases:
http_cookies 
    at com.sleepycat.je.Environment.close(Environment.java:378)
    at org.archive.util.bdbje.EnhancedEnvironment.close(EnhancedEnvironment.java:82)
    at org.archive.crawler.framework.CrawlController.completeStop(CrawlController.java:1076)
    at org.archive.crawler.admin.CrawlJob$MBeanCrawlController.completeStop(CrawlJob.java:801)
    at org.archive.crawler.framework.CrawlController.toeEnded(CrawlController.java:1823)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:186)

In some sense this error is benign, although it could cause concern by appearing in the logs. And it does point to a problem with pausing and resuming Heritrix 1 crawls (since the FetchHTTP crawlers are not being managed properly through the CrawlController and the harvest-agent-h1) and it implies that the http_cookies database is not being used after a paused crawl is started again.

Given that the Heritrix 1 crawler is being phased out, this is a low priority fix. However, should someone decide to try and fix this issue:

The github project https://github.com/WebCuratorTool/heritrix-1-14-adjust has a branch: DEBUG/fetchhttp-crawlcontroller-http-cookies-database-closing with a bunch of debug statements and some slight code changes to help reveal the issue (this includes the stack dumped when a FetchHTTP instance is created). It has a version of the jar checked in as well:
heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar with a pom.

This version of the jar can be installed by using:
mvn install:install-file -Dfile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar -DpomFile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.pom

And the harvest-agent-h1 pom would need to be changed to use this debug version:

        <dependency>
            <groupId>org.archive</groupId>
            <artifactId>heritrix</artifactId>
            <version>1.14.3-webcuratortool-2.0.1-SNAPSHOT</version>
            <scope>compile</scope>
        </dependency>

Note that the debug output would appear in the harvest-agent-h1 logs.

The text was updated successfully, but these errors were encountered:

Debug on FetchHTTP to help understand why the 'http_cookies' database is not being closed. This issue is discussed in more detail at: WebCuratorTool/webcurator-v2-legacy#104. This particular branch prints debug messages and stack traces to provide insight as to when a FetchHTTP crawler is created and when initialTasks() and cleanupHttp() are being called, as well as the state of the cookieDb instance. Note that a version of the debug jar has been checked in to the branch in the release_archive folder. It can be installed using: mvn install:install-file -Dfile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar -DpomFile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.pom

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harvest-agent-h1 does not close je http_cookies database when agent paused, started and stopped. Possibly http_cookies database not used after restart. #104

harvest-agent-h1 does not close je http_cookies database when agent paused, started and stopped. Possibly http_cookies database not used after restart. #104

kurtlenfesty commented Mar 14, 2019 •

edited

Loading

harvest-agent-h1 does not close je http_cookies database when agent paused, started and stopped. Possibly http_cookies database not used after restart. #104

harvest-agent-h1 does not close je http_cookies database when agent paused, started and stopped. Possibly http_cookies database not used after restart. #104

Comments

kurtlenfesty commented Mar 14, 2019 • edited Loading

kurtlenfesty commented Mar 14, 2019 •

edited

Loading