Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harvest-agent-h1 does not close je http_cookies database when agent paused, started and stopped. Possibly http_cookies database not used after restart. #104

Open
kurtlenfesty opened this issue Mar 14, 2019 · 0 comments

Comments

@kurtlenfesty
Copy link
Contributor

kurtlenfesty commented Mar 14, 2019

When harvest-agent-h1 starts a harvest job, a FetchHTTP crawler is started to process the crawl (one of several actually). This particular FetchHTTP crawler is added to the processor chain and it creates a cookieDb (http_cookies) as part of its initialTasks(). In a normal crawl without pauses, when the crawl is stopped, this FetchHTTP crawler has finalTasks() called by the CrawlController completeStop(), which in turn calls cleanupHttp(), which syncs and closes the http_cookies database.

However, if the crawl job is paused and resumed, the FetchHTTP crawler in the processor chain is no longer there (and in fact, initialTasks() is not called on this new FetchHTTP crawler instance. That means that when the CrawlController completeStop() is called, no FetchHTTP instance is called as part of CrawlController runProcessorFinalTasks(). That means that the http_cookies database is not closed.

That means when the CrawlController completeStop() finished its processing and closes the bdbEnvironment (the environment where the sleepycat je databases are running), it finds that there is a http_cookies that still exists and is unclosed and produces the following exception, which appears in the logs:

Mar 15, 2019 7:18:15 AM org.archive.crawler.framework.CrawlController completeStop
WARNING: Problem syncing or closing bdbEnvironment
com.sleepycat.je.DatabaseException: (JE 3.3.74) There is 1 open Database in the Environment.
Closing the following databases:
http_cookies 
    at com.sleepycat.je.Environment.close(Environment.java:378)
    at org.archive.util.bdbje.EnhancedEnvironment.close(EnhancedEnvironment.java:82)
    at org.archive.crawler.framework.CrawlController.completeStop(CrawlController.java:1076)
    at org.archive.crawler.admin.CrawlJob$MBeanCrawlController.completeStop(CrawlJob.java:801)
    at org.archive.crawler.framework.CrawlController.toeEnded(CrawlController.java:1823)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:186)

In some sense this error is benign, although it could cause concern by appearing in the logs. And it does point to a problem with pausing and resuming Heritrix 1 crawls (since the FetchHTTP crawlers are not being managed properly through the CrawlController and the harvest-agent-h1) and it implies that the http_cookies database is not being used after a paused crawl is started again.

Given that the Heritrix 1 crawler is being phased out, this is a low priority fix. However, should someone decide to try and fix this issue:

The github project https://github.com/WebCuratorTool/heritrix-1-14-adjust has a branch: DEBUG/fetchhttp-crawlcontroller-http-cookies-database-closing with a bunch of debug statements and some slight code changes to help reveal the issue (this includes the stack dumped when a FetchHTTP instance is created). It has a version of the jar checked in as well:
heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar with a pom.

This version of the jar can be installed by using:
mvn install:install-file -Dfile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar -DpomFile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.pom

And the harvest-agent-h1 pom would need to be changed to use this debug version:

        <dependency>
            <groupId>org.archive</groupId>
            <artifactId>heritrix</artifactId>
            <version>1.14.3-webcuratortool-2.0.1-SNAPSHOT</version>
            <scope>compile</scope>
        </dependency>

Note that the debug output would appear in the harvest-agent-h1 logs.

kurtlenfesty added a commit to WebCuratorTool/heritrix-1-14-adjust that referenced this issue Mar 14, 2019
Debug on FetchHTTP to help understand why the 'http_cookies' database is not
being closed. This issue is discussed in more detail at:
WebCuratorTool/webcurator-v2-legacy#104.

This particular branch prints debug messages and stack traces to provide
insight as to when a FetchHTTP crawler is created and when initialTasks()
and cleanupHttp() are being called, as well as the state of the cookieDb
instance.

Note that a version of the debug jar has been checked in to the branch
in the release_archive folder. It can be installed using:
mvn install:install-file   -Dfile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.jar   -DpomFile=<project-location>/release_archive/heritrix-1.14.3-webcuratortool-2.0.1-SNAPSHOT.pom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant