You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
harvest-agent-h1 does not close je http_cookies database when agent paused, started and stopped. Possibly http_cookies database not used after restart.
#104
Open
kurtlenfesty opened this issue
Mar 14, 2019
· 0 comments
When harvest-agent-h1 starts a harvest job, a FetchHTTP crawler is started to process the crawl (one of several actually). This particular FetchHTTP crawler is added to the processor chain and it creates a cookieDb (http_cookies) as part of its initialTasks(). In a normal crawl without pauses, when the crawl is stopped, this FetchHTTP crawler has finalTasks() called by the CrawlControllercompleteStop(), which in turn calls cleanupHttp(), which syncs and closes the http_cookies database.
However, if the crawl job is paused and resumed, the FetchHTTP crawler in the processor chain is no longer there (and in fact, initialTasks() is not called on this new FetchHTTP crawler instance. That means that when the CrawlController completeStop() is called, no FetchHTTP instance is called as part of CrawlController runProcessorFinalTasks(). That means that the http_cookies database is not closed.
That means when the CrawlController completeStop() finished its processing and closes the bdbEnvironment (the environment where the sleepycat je databases are running), it finds that there is a http_cookies that still exists and is unclosed and produces the following exception, which appears in the logs:
Mar 15, 2019 7:18:15 AM org.archive.crawler.framework.CrawlController completeStop
WARNING: Problem syncing or closing bdbEnvironment
com.sleepycat.je.DatabaseException: (JE 3.3.74) There is 1 open Database in the Environment.
Closing the following databases:
http_cookies
at com.sleepycat.je.Environment.close(Environment.java:378)
at org.archive.util.bdbje.EnhancedEnvironment.close(EnhancedEnvironment.java:82)
at org.archive.crawler.framework.CrawlController.completeStop(CrawlController.java:1076)
at org.archive.crawler.admin.CrawlJob$MBeanCrawlController.completeStop(CrawlJob.java:801)
at org.archive.crawler.framework.CrawlController.toeEnded(CrawlController.java:1823)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:186)
In some sense this error is benign, although it could cause concern by appearing in the logs. And it does point to a problem with pausing and resuming Heritrix 1 crawls (since the FetchHTTP crawlers are not being managed properly through the CrawlController and the harvest-agent-h1) and it implies that the http_cookies database is not being used after a paused crawl is started again.
Given that the Heritrix 1 crawler is being phased out, this is a low priority fix. However, should someone decide to try and fix this issue:
The github project https://github.com/WebCuratorTool/heritrix-1-14-adjust has a branch: DEBUG/fetchhttp-crawlcontroller-http-cookies-database-closing with a bunch of debug statements and some slight code changes to help reveal the issue (this includes the stack dumped when a FetchHTTP instance is created). It has a version of the jar checked in as well:
heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar with a pom.
This version of the jar can be installed by using: mvn install:install-file -Dfile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar -DpomFile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.pom
And the harvest-agent-h1 pom would need to be changed to use this debug version:
Debug on FetchHTTP to help understand why the 'http_cookies' database is not
being closed. This issue is discussed in more detail at:
WebCuratorTool/webcurator-v2-legacy#104.
This particular branch prints debug messages and stack traces to provide
insight as to when a FetchHTTP crawler is created and when initialTasks()
and cleanupHttp() are being called, as well as the state of the cookieDb
instance.
Note that a version of the debug jar has been checked in to the branch
in the release_archive folder. It can be installed using:
mvn install:install-file -Dfile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar -DpomFile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.pom
When
harvest-agent-h1
starts a harvest job, aFetchHTTP
crawler is started to process the crawl (one of several actually). This particular FetchHTTP crawler is added to the processor chain and it creates a cookieDb (http_cookies
) as part of its initialTasks(). In a normal crawl without pauses, when the crawl is stopped, this FetchHTTP crawler has finalTasks() called by theCrawlController
completeStop()
, which in turn callscleanupHttp()
, which syncs and closes thehttp_cookies
database.However, if the crawl job is paused and resumed, the FetchHTTP crawler in the processor chain is no longer there (and in fact,
initialTasks()
is not called on this new FetchHTTP crawler instance. That means that when the CrawlControllercompleteStop()
is called, no FetchHTTP instance is called as part of CrawlControllerrunProcessorFinalTasks()
. That means that thehttp_cookies
database is not closed.That means when the CrawlController
completeStop()
finished its processing and closes the bdbEnvironment (the environment where the sleepycat je databases are running), it finds that there is ahttp_cookies
that still exists and is unclosed and produces the following exception, which appears in the logs:In some sense this error is benign, although it could cause concern by appearing in the logs. And it does point to a problem with pausing and resuming Heritrix 1 crawls (since the FetchHTTP crawlers are not being managed properly through the CrawlController and the harvest-agent-h1) and it implies that the
http_cookies
database is not being used after a paused crawl is started again.Given that the Heritrix 1 crawler is being phased out, this is a low priority fix. However, should someone decide to try and fix this issue:
The github project https://github.com/WebCuratorTool/heritrix-1-14-adjust has a branch:
DEBUG/fetchhttp-crawlcontroller-http-cookies-database-closing
with a bunch of debug statements and some slight code changes to help reveal the issue (this includes the stack dumped when a FetchHTTP instance is created). It has a version of the jar checked in as well:heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar with a pom.
This version of the jar can be installed by using:
mvn install:install-file -Dfile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar -DpomFile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.pom
And the harvest-agent-h1 pom would need to be changed to use this debug version:
Note that the debug output would appear in the harvest-agent-h1 logs.
The text was updated successfully, but these errors were encountered: