-
Notifications
You must be signed in to change notification settings - Fork 9
Spiders (Crawler)
A Spider represents a python class which will run to crawl data from your source. Crawling is usually done by running over the contents of a sitemap or an API result which provides a list of identifiers or entries.
Before starting writing a new spider from scratch, please be aware what spiders we already offer.
Please note that while overriding each basic class, you can of course still provide individual processing for specific attributes or data.
All spiders are located in etl/converter/spiders
. You may copy our sample_spider.py
file to get started.
Make sure to change the class name and filename, as well as the name
attribute inside the file.
You can then run your spider calling scrapy crawl choosen_name_spider
inside the etl
folder. Please refer to the README for the baic setup instructions.
For more details, please take a look at the sample_spider
class
If your system provides a OAI API, you can use our OAIBase
class for it. You simply override the baseUrl
, metadataPrefix
and set
.
If your page includes LRMI data into its sites and provides a sitemap, you can make use of our LRMIBase
class to get started.
If you provide data via RSS, then we offer two approches:
If you only have a single RSS-Feed, we recommend to use the RSSBase
class. It provides basic handling for the main metadata of RSS. Additional metadata should be added by your class (either static or may by crawling additional data from your webpage or a seperate endpoint)
If your source consists of multiple RSS-Feeds, take a look at the RSSListBase
class. You have to provide a (csv)-file for all your RSS-URL's and can also include several default metadata for each feed as described in the class file.
If none of the above spiders matches your API, there's no need to worry! Since we use Scrapy as a framework, you can use all the base spiders included to collect your data. You'll find the list here.
We strongly recommend you to also inherit the LOMBase
class to already have all basic callbacks for the individual metadata groups. Furthermore, if your source makes use of JSON, you may also will find the JSONBase
class helpful. A sample override could look like this:
class MySpider(CrawlSpider, LomBase):
name = 'my_spider'
start_urls = ['https://edu-sharing.com']
# more data comes here
def parse(self, response):
return LomBase.parse(self, response)
For more details, please take a look at the sample_spider
class