You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. First and foremost, Thank you for developing such amazing tool !
I'm using the Fess docker version 11.0.1 and I was able to replicate the issue in the codelibs/fess:latest (11.2) as well.
I can crawl and index several sites without any issues, but when I try to get this particular site Fess only gets the base path, the robots.txt file and then it ends the job.
This is the crawler configuration:
ID AVzIY5P0GSBWSHlT4_Uo
Name www.durect.com
URLs http://www.durect.com/
Included URLs For Crawling http://www.durect.com/.*
Excluded URLs For Crawling
Included URLs For Indexing
Excluded URLs For Indexing
Config Parameters
Depth
Max Access Count
User Agent Mozilla/5.0 (compatible; Fess/11.0; +http://fess.codelibs.org/bot.html)
The number of Tread 3
Interval time 1500 ms
Boost 1.0
Permissions {role}www.durect.com
Label
Status Enabled
Description
This is the Job
Name Web Crawler - www.durect.com
Target all
Schedule 10 5 * * 1,3,5
Executor groovy
Script return container.getComponent("crawlJob").logLevel("info").sessionId("AVzIY5P0GSBWSHlT4_Uo").webConfigIds(["AVzIY5P0GSBWSHlT4_Uo"] as String[]).fileConfigIds([] as String[]).dataConfigIds([] as String[]).jobExecutor(executor).execute();
Logging Enabled
Crawler Job Enabled
Status Enabled
Display Order 10
And this is what the logs are saying (Just pasting after the first warning shown)
2017-06-21 02:04:34,675 [main] WARN Failed to find a usable hardware address from the network interfaces; using random bytes: 25:95:84:bc:bf:40:a8:c7
2017-06-21 02:04:38,896 [main] INFO Lasta Di boot successfully.
2017-06-21 02:04:38,898 [main] INFO SmartDeploy Mode: Warm Deploy
2017-06-21 02:04:38,899 [main] INFO Smart Package: org.codelibs.fess.app
2017-06-21 02:04:38,945 [main] INFO Starting Crawler..
2017-06-21 02:04:38,998 [WebFsCrawler] INFO no modules loaded
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.transport.Netty3Plugin]
2017-06-21 02:04:38,999 [WebFsCrawler] INFO loaded plugin [org.elasticsearch.transport.Netty4Plugin]
2017-06-21 02:04:39,078 [WebFsCrawler] INFO Connected to localhost:9301
2017-06-21 02:04:39,163 [WebFsCrawler] INFO Target URL: http://www.durect.com/
2017-06-21 02:04:39,163 [WebFsCrawler] INFO Included URL: http://www.durect.com/.*
2017-06-21 02:04:39,273 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO Crawling URL: http://www.durect.com/
2017-06-21 02:04:39,353 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO Checking URL: http://www.durect.com/robots.txt
2017-06-21 02:04:49,191 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:04:59,184 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:09,185 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:19,186 [IndexUpdater] INFO Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:22,267 [WebFsCrawler] INFO [EXEC TIME] crawling time: 43289ms
2017-06-21 02:05:29,186 [IndexUpdater] INFO Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:29,186 [IndexUpdater] INFO [EXEC TIME] index update time: 19ms
2017-06-21 02:05:29,205 [main] INFO Finished Crawler
2017-06-21 02:05:29,233 [main] INFO [CRAWL INFO] CrawlerEndTime=2017-06-21T02:05:29.205+0000,WebFsCrawlExecTime=43289,CrawlerStatus=true,CrawlerStartTime=2017-06-21T02:04:38.945+0000,WebFsCrawlEndTime=2017-06-21T02:05:29.204+0000,WebFsIndexExecTime=19,WebFsIndexSize=0,CrawlerExecTime=50260,WebFsCrawlStartTime=2017-06-21T02:04:38.963+0000
2017-06-21 02:05:34,255 [main] INFO Disconnected to elasticsearch:localhost:9301
2017-06-21 02:05:35,790 [main] INFO Destroyed LaContainer.
Can you please help me to figure out what might be happening?
Thanks in advance,
Enrique
The text was updated successfully, but these errors were encountered:
Hello. First and foremost, Thank you for developing such amazing tool !
I'm using the Fess docker version 11.0.1 and I was able to replicate the issue in the codelibs/fess:latest (11.2) as well.
I can crawl and index several sites without any issues, but when I try to get this particular site Fess only gets the base path, the robots.txt file and then it ends the job.
This is the crawler configuration:
This is the Job
And this is what the logs are saying (Just pasting after the first warning shown)
Can you please help me to figure out what might be happening?
Thanks in advance,
Enrique
The text was updated successfully, but these errors were encountered: