Unable to crawl a site (while others are OK) #1117

qmaxquique · 2017-06-21T02:08:40Z

Hello. First and foremost, Thank you for developing such amazing tool !

I'm using the Fess docker version 11.0.1 and I was able to replicate the issue in the codelibs/fess:latest (11.2) as well.

I can crawl and index several sites without any issues, but when I try to get this particular site Fess only gets the base path, the robots.txt file and then it ends the job.

This is the crawler configuration:

ID	AVzIY5P0GSBWSHlT4_Uo
Name	www.durect.com
URLs	http://www.durect.com/
Included URLs For Crawling	http://www.durect.com/.*
Excluded URLs For Crawling	
Included URLs For Indexing	
Excluded URLs For Indexing	
Config Parameters	
Depth	
Max Access Count	
User Agent	Mozilla/5.0 (compatible; Fess/11.0; +http://fess.codelibs.org/bot.html)
The number of Tread	3
Interval time	1500 ms
Boost	1.0
Permissions	{role}www.durect.com
Label	
Status	Enabled
Description

This is the Job

Name	Web Crawler - www.durect.com
Target	all
Schedule	10 5 * * 1,3,5
Executor	groovy
Script	return container.getComponent("crawlJob").logLevel("info").sessionId("AVzIY5P0GSBWSHlT4_Uo").webConfigIds(["AVzIY5P0GSBWSHlT4_Uo"] as String[]).fileConfigIds([] as String[]).dataConfigIds([] as String[]).jobExecutor(executor).execute();
Logging	Enabled
Crawler Job	Enabled
Status	Enabled
Display Order	10

And this is what the logs are saying (Just pasting after the first warning shown)

2017-06-21 02:04:34,675 [main] WARN  Failed to find a usable hardware address from the network interfaces; using random bytes: 25:95:84:bc:bf:40:a8:c7
2017-06-21 02:04:38,896 [main] INFO  Lasta Di boot successfully.
2017-06-21 02:04:38,898 [main] INFO    SmartDeploy Mode: Warm Deploy
2017-06-21 02:04:38,899 [main] INFO    Smart Package: org.codelibs.fess.app
2017-06-21 02:04:38,945 [main] INFO  Starting Crawler..
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  no modules loaded
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.transport.Netty3Plugin]
2017-06-21 02:04:38,999 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.transport.Netty4Plugin]
2017-06-21 02:04:39,078 [WebFsCrawler] INFO  Connected to localhost:9301
2017-06-21 02:04:39,163 [WebFsCrawler] INFO  Target URL: http://www.durect.com/
2017-06-21 02:04:39,163 [WebFsCrawler] INFO  Included URL: http://www.durect.com/.*
2017-06-21 02:04:39,273 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO  Crawling URL: http://www.durect.com/
2017-06-21 02:04:39,353 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO  Checking URL: http://www.durect.com/robots.txt
2017-06-21 02:04:49,191 [IndexUpdater] INFO  Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:04:59,184 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:09,185 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:19,186 [IndexUpdater] INFO  Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:22,267 [WebFsCrawler] INFO  [EXEC TIME] crawling time: 43289ms
2017-06-21 02:05:29,186 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:29,186 [IndexUpdater] INFO  [EXEC TIME] index update time: 19ms
2017-06-21 02:05:29,205 [main] INFO  Finished Crawler
2017-06-21 02:05:29,233 [main] INFO  [CRAWL INFO] CrawlerEndTime=2017-06-21T02:05:29.205+0000,WebFsCrawlExecTime=43289,CrawlerStatus=true,CrawlerStartTime=2017-06-21T02:04:38.945+0000,WebFsCrawlEndTime=2017-06-21T02:05:29.204+0000,WebFsIndexExecTime=19,WebFsIndexSize=0,CrawlerExecTime=50260,WebFsCrawlStartTime=2017-06-21T02:04:38.963+0000
2017-06-21 02:05:34,255 [main] INFO  Disconnected to elasticsearch:localhost:9301
2017-06-21 02:05:35,790 [main] INFO  Destroyed LaContainer.

Can you please help me to figure out what might be happening?

Thanks in advance,
Enrique

The text was updated successfully, but these errors were encountered:

marevol · 2017-06-21T13:44:02Z

<link rel="canonical" href="http://http://www.durect.com/" />

canonical seems to be broken.
If ignoring canonical tag, set empty to crawler.document.html.canonical.xpath in fess_config.properties.

crawler.document.html.canonical.xpath=

qmaxquique · 2017-06-21T13:52:34Z

Thank you @marevol. I really appreciate your help!

marevol added the question label Jun 21, 2017

marevol mentioned this issue Jun 21, 2017

Validate canonical URL #1118

Closed

marevol closed this as completed Feb 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to crawl a site (while others are OK) #1117

Unable to crawl a site (while others are OK) #1117

qmaxquique commented Jun 21, 2017

marevol commented Jun 21, 2017

qmaxquique commented Jun 21, 2017

Unable to crawl a site (while others are OK) #1117

Unable to crawl a site (while others are OK) #1117

Comments

qmaxquique commented Jun 21, 2017

marevol commented Jun 21, 2017

qmaxquique commented Jun 21, 2017