-
Notifications
You must be signed in to change notification settings - Fork 256
Search engine harvesting
If you want your site to be harvested by search engines, you will need to consider the effect this will have on server load. Excessive crawling can have a negative impact on search performance for all users.
It is a good idea to create a sitemap which will tell crawlers which pages you want to be harvested. The SitemapGenerator gem can be used to create a sitemap periodically and ping search engines to trigger new harvests. Here is an example implementation that creates links to all relevant documents within Solr (developed for for the Danish Research database).
SitemapGenerator::Sitemap.create do
# We set a boolean value in our environment files to prevent generation in staging or development
break unless Rails.application.config.sitemap[:generate]
# Add static pages
# This is quite primitive - could perhaps be improved by querying the Rails routes in the about namespace
['', 'search-and-get', 'data', 'faq'].each do |page|
add "/about/#{page}"
end
# Add single record pages
cursorMark = '*'
loop do
response = Blacklight.solr.get('/solr/blacklight/select', :params => { # you may need to change the request handler
'q' => '*:*', # all docs
'fl' => 'id', # we only need the ids
'fq' => '', # optional filter query
'cursorMark' => cursorMark, # we need to use the cursor mark to handle paging
'rows' => ENV['BATCH_SIZE'] || 1000,
'sort' => 'id asc'
})
response['response']['docs'].each do |doc|
add "/catalog/#{doc['cluster_id_ss'].first}"
end
break if response['nextCursorMark'] == cursorMark # this means the result set is finished
cursorMark = response['nextCursorMark']
end
end
It is a good idea to trigger sitemap generation via CRON to happen at times of low activity (for example at the weekend) so that harvesting doesn't impact human users. For example:
0 2 * * 6 cd <app_root> && RAILS_ENV=production /usr/bin/bundle exec rake sitemap:clean sitemap:refresh
If you expose a sitemap with all the pages you do want to be harvested, it is a good idea to tell crawlers which pages you do not want to be harvested. Some crawlers will construct urls for search results pages leading to a potentially infinite number of crawl targets. Therefore you should include a robots.txt file which will disallow search results pages. Here is an example:
# robots.txt
# Load the sitemap if it is present
<%- if File.exists? "#{Rails.root}/public/sitemap.xml.gz" -%>
Sitemap: <%= "#{root_url :locale => nil}sitemap.xml.gz" %>
<%- end -%>
User-agent: *
Disallow: /catalog? # blocks search results pages
Disallow: /catalog.html? # sometimes they use .html to get searches, Sneaky Google!
Disallow: /catalog/facet # blocks facet pages
Disallow: /catalog/range_limit
Google Scholar uses Highwire Press tags for parsing academic metadata. For example:
<meta name="citation_title" content="Association between regional cerebral blood flow during hypoglycemia and genetic and phenotypic traits of the renin-angiotensin system" />
<meta name="citation_author" content="Lise Grimmeshave Bie-Olsen" />
<meta name="citation_author" content="Ulrik Pedersen-Bjergaard" />
<meta name="citation_author" content="Troels Wesenberg Kjaer" />
<meta name="citation_author" content="Markus Nowak Lonsdale" />
<meta name="citation_author" content="Ian Law" />
<meta name="citation_author" content="Birger Thorsteinsson" />
<meta name="citation_publication_date" content="2009" />
<meta name="citation_journal_title" content="Journal of Cerebral Blood Flow and Metabolism" />
<meta name="citation_language" content="eng" />
<meta name="citation_doi" content="10.1038/jcbfm.2009.94" />
<meta name="citation_issn" content="1559-7016" />
<meta name="citation_issn" content="0271-678x" />