-
Notifications
You must be signed in to change notification settings - Fork 210
Tutorial: Nutch
The following information was contributed by Praful Bagai and J. Gobel.
This tutorial assumes that you are customizing the Reuters tutorial. It has been tested with Solr 3.6 and Nutch 1.6.
In $NUTCH_RUNTIME_HOME/conf/schema.xml
, replace:
<field name="content" type="text" stored="false" indexed="true"/>
with:
<field name="content" type="text" stored="true" indexed="true"/>
to make the value of the content
field retrievable during a search.
Check the following properties in your nutch-default.xml
:
<property>
<name>fetcher.store.content</name>
<value>true</value>
<description>If true, fetcher will store content.</description>
</property>
<property>
<name>parser.caching.forbidden.policy</name>
<value>content</value>
<description>If a site (or a page) requests through its robot metatags
that it should not be shown as cached content, apply this policy. Currently
three keywords are recognized: "none" ignores any "noarchive" directives.
"content" doesn't show the content, but shows summaries (snippets).
"all" doesn't show either content or summaries.</description>
</property>
You may also need to copy fields from your Nutch schema to your Solr schema.
Next, follow this tutorial up to step 3.1. At step 3.1, do not run the command below, then continue up to step 6. You should then be able to log in to your Solr server and search for what Nutch crawled.
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
Download AJAX Solr and unpack the ZIP file into its own directory where your web server can find it.
In examples/reuters/js/reuters.js
, set solrUrl
to point to your Solr server, and update the Solr parameters in var params
to reflect the structure of your Solr documents:
- Update
facet.field
with the fields on which you want to facet, e.g.[ 'title' ]
- Remove
f.topics.facet.limit
andf.countryCodes.facet.limit
unless your Solr documents havetopics
orcountryCodes
fields - Remove all
facet.date
parameters unless your Solr documents have a date field on which you want to facet
Either update or remove the tag cloud, autocomplete, country code and calendar widgets. For the tag cloud, you can set the associated Solr fields by changing the value of var fields
, e.g. [ 'title', 'url', 'content' ]
.
Nutch uses a content
field, instead of a text
field like Reuters. In examples/reuters/widgets/ResultWidget.js
, in the template
method, replace all occurrences of doc.text
with doc.content
. Nutch has no dateline
field, so remove all occurrences of doc.dateline + ' ' +
.
You should now be able to open examples/reuters/index.html
in a browser.