Skip to content

Spiders (Crawler)

tsimon edited this page Apr 21, 2020 · 18 revisions

A Spider represents a python class which will run to crawl data from your source. Crawling is usually done by running over the contents of a sitemap or an API result which provides a list of identifiers or entries.

Before starting writing a new spider from scratch, please be aware what spiders we already offer.

Please note that while overriding each basic class, you can of course still provide individual processing for specific attributes or data.

Quick start

All spiders are located in etl/converter/spiders. You may copy our sample_spider.py file to get started.

Make sure to change the class name and filename, as well as the name attribute inside the file.

You can then run your spider calling scrapy crawl choosen_name_spider inside the etl folder. Please refer to the README for the baic setup instructions. For more details, please take a look at the sample_spider class

View SampleSpider class

Integrated spiders in this project

OAI-PMH (LOM)

If your system provides a OAI API, you can use our OAIBase class for it. You simply override the baseUrl, metadataPrefix and set.

Learn more about OAI-PMH

View OAIBase class

LRMI

If your page includes LRMI data into its sites and provides a sitemap, you can make use of our LRMIBase class to get started.

Learn more about LRMI

View LRMIBase class

RSS

If you provide data via RSS, then we offer two approches:

If you only have a single RSS-Feed, we recommend to use the RSSBase class. It provides basic handling for the main metadata of RSS. Additional metadata should be added by your class (either static or may by crawling additional data from your webpage or a seperate endpoint)

If your source consists of multiple RSS-Feeds, take a look at the RSSListBase class. You have to provide a (csv)-file for all your RSS-URL's and can also include several default metadata for each feed as described in the class file.

View RSSBase class

View RSSListBase class

Spiders provided by Scrapy

If none of the above spiders matches your API, there's no need to worry! Since we use Scrapy as a framework, you can use all the base spiders included to collect your data. You'll find the list here.

We strongly recommend you to also inherit the LOMBase class to already have all basic callbacks for the individual metadata groups. Furthermore, if your source makes use of JSON, you may also will find the JSONBase class helpful. A sample override could look like this:

class MySpider(CrawlSpider, LomBase):
  name = 'my_spider'
  start_urls = ['https://edu-sharing.com']
  # more data comes here

  def parse(self, response):
    return LomBase.parse(self, response)

For more details, please take a look at the sample_spider class

View SampleSpider class