-
Notifications
You must be signed in to change notification settings - Fork 9
Spiders (Crawler)
A Spider represents a python class which will run to crawl data from your source. Crawling is usually done by running over a sitemap or an API which provides a list of identifiers or entries.
Before starting writing a new spider, please be aware what spiders we already offer.
Please note that while overriding each basic class, you can of course still provide individual processing for specific attributes or data.
If your system provides a OAI API, you can use our OAIBase
class for it. You simply override the baseUrl
, metadataPrefix
and set
.
If your page includes LRMI data into its sites and provides a Sitemap, you can make use of our LRMIBase
class to get started.
If you provide data via RSS, then we offer two approches:
If you only have a single RSS-Feed, we recommend to use the RSSBase
class. It provides basic handling for the main metadata of RSS.
If your source consists of multiple RSS-Feeds, take a look at the RSSListBase
class. You have to provide a (csv)-file for all your RSS-URL's and can also include several default metadata for each feed as described in the class file.