Skip to content

Spiders (Crawler)

tsimon edited this page Apr 21, 2020 · 18 revisions

A Spider represents a python class which will run to crawl data from your source. Crawling is usually done by running over a sitemap or an API which provides a list of identifiers or entries.

Before starting writing a new spider, please be aware what spiders we already offer.

Please note that while overriding each basic class, you can of course still provide individual processing for specific attributes or data.

Integrated spiders in this project

OAI-PMH (LOM)

If your system provides a OAI API, you can use our OAIBase class for it. You simply override the baseUrl, metadataPrefix and set.

Learn more about OAI-PMH

View OAIBase class

LRMI

If your page includes LRMI data into its sites and provides a Sitemap, you can make use of our LRMIBase class to get started.

Learn more about LRMI

View LRMIBase class

RSS

If you provide data via RSS, then we offer two approches:

If you only have a single RSS-Feed, we recommend to use the RSSBase class. It provides basic handling for the main metadata of RSS.

If your source consists of multiple RSS-Feeds, take a look at the RSSListBase class. You have to provide a (csv)-file for all your RSS-URL's and can also include several default metadata for each feed as described in the class file.

View RSSBase class View RSSListBase class