Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to crawl a site (while others are OK) #1117

Closed
qmaxquique opened this issue Jun 21, 2017 · 2 comments
Closed

Unable to crawl a site (while others are OK) #1117

qmaxquique opened this issue Jun 21, 2017 · 2 comments
Labels

Comments

@qmaxquique
Copy link

Hello. First and foremost, Thank you for developing such amazing tool !

I'm using the Fess docker version 11.0.1 and I was able to replicate the issue in the codelibs/fess:latest (11.2) as well.

I can crawl and index several sites without any issues, but when I try to get this particular site Fess only gets the base path, the robots.txt file and then it ends the job.

This is the crawler configuration:

ID	AVzIY5P0GSBWSHlT4_Uo
Name	www.durect.com
URLs	http://www.durect.com/
Included URLs For Crawling	http://www.durect.com/.*
Excluded URLs For Crawling	
Included URLs For Indexing	
Excluded URLs For Indexing	
Config Parameters	
Depth	
Max Access Count	
User Agent	Mozilla/5.0 (compatible; Fess/11.0; +http://fess.codelibs.org/bot.html)
The number of Tread	3
Interval time	1500 ms
Boost	1.0
Permissions	{role}www.durect.com
Label	
Status	Enabled
Description

This is the Job

Name	Web Crawler - www.durect.com
Target	all
Schedule	10 5 * * 1,3,5
Executor	groovy
Script	return container.getComponent("crawlJob").logLevel("info").sessionId("AVzIY5P0GSBWSHlT4_Uo").webConfigIds(["AVzIY5P0GSBWSHlT4_Uo"] as String[]).fileConfigIds([] as String[]).dataConfigIds([] as String[]).jobExecutor(executor).execute();
Logging	Enabled
Crawler Job	Enabled
Status	Enabled
Display Order	10

And this is what the logs are saying (Just pasting after the first warning shown)

2017-06-21 02:04:34,675 [main] WARN  Failed to find a usable hardware address from the network interfaces; using random bytes: 25:95:84:bc:bf:40:a8:c7
2017-06-21 02:04:38,896 [main] INFO  Lasta Di boot successfully.
2017-06-21 02:04:38,898 [main] INFO    SmartDeploy Mode: Warm Deploy
2017-06-21 02:04:38,899 [main] INFO    Smart Package: org.codelibs.fess.app
2017-06-21 02:04:38,945 [main] INFO  Starting Crawler..
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  no modules loaded
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
2017-06-21 02:04:38,998 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.transport.Netty3Plugin]
2017-06-21 02:04:38,999 [WebFsCrawler] INFO  loaded plugin [org.elasticsearch.transport.Netty4Plugin]
2017-06-21 02:04:39,078 [WebFsCrawler] INFO  Connected to localhost:9301
2017-06-21 02:04:39,163 [WebFsCrawler] INFO  Target URL: http://www.durect.com/
2017-06-21 02:04:39,163 [WebFsCrawler] INFO  Included URL: http://www.durect.com/.*
2017-06-21 02:04:39,273 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO  Crawling URL: http://www.durect.com/
2017-06-21 02:04:39,353 [Crawler-AVzIY5P0GSBWSHlT4_Uo-1-1] INFO  Checking URL: http://www.durect.com/robots.txt
2017-06-21 02:04:49,191 [IndexUpdater] INFO  Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:04:59,184 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:09,185 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:19,186 [IndexUpdater] INFO  Processing no docs (Doc:{access 3ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:22,267 [WebFsCrawler] INFO  [EXEC TIME] crawling time: 43289ms
2017-06-21 02:05:29,186 [IndexUpdater] INFO  Processing no docs (Doc:{access 2ms}, Mem:{used 160MB, heap 239MB, max 494MB})
2017-06-21 02:05:29,186 [IndexUpdater] INFO  [EXEC TIME] index update time: 19ms
2017-06-21 02:05:29,205 [main] INFO  Finished Crawler
2017-06-21 02:05:29,233 [main] INFO  [CRAWL INFO] CrawlerEndTime=2017-06-21T02:05:29.205+0000,WebFsCrawlExecTime=43289,CrawlerStatus=true,CrawlerStartTime=2017-06-21T02:04:38.945+0000,WebFsCrawlEndTime=2017-06-21T02:05:29.204+0000,WebFsIndexExecTime=19,WebFsIndexSize=0,CrawlerExecTime=50260,WebFsCrawlStartTime=2017-06-21T02:04:38.963+0000
2017-06-21 02:05:34,255 [main] INFO  Disconnected to elasticsearch:localhost:9301
2017-06-21 02:05:35,790 [main] INFO  Destroyed LaContainer.

Can you please help me to figure out what might be happening?

Thanks in advance,
Enrique

@marevol
Copy link
Contributor

marevol commented Jun 21, 2017

<link rel="canonical" href="http://http://www.durect.com/" />

canonical seems to be broken.
If ignoring canonical tag, set empty to crawler.document.html.canonical.xpath in fess_config.properties.

crawler.document.html.canonical.xpath=

@qmaxquique
Copy link
Author

Thank you @marevol. I really appreciate your help!

@marevol marevol closed this as completed Feb 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants