-
Notifications
You must be signed in to change notification settings - Fork 9
Spiders (Crawler)
A Spider represents a python class which will run and crawl data from your source. Crawling is usually done by running over the contents of a sitemap or an API result which provides a list of identifiers or entries.
Before starting writing a new spider from scratch, please be aware what spiders we already offer.
Please note that while overriding each basic class, you can of course still provide individual processing for specific attributes or data.
All spiders are located in etl/converter/spiders
. You may copy our sample_spider.py
file to get started.
Make sure to change the class name and filename, as well as the name
attribute inside the file.
You can then run your spider calling scrapy crawl choosen_name_spider
inside the etl
folder.
Please refer to the README for the basic setup instructions and make sure that you have started all the necessary docker containers first.
You need to create some class properties to get the crawler recognized correctly inside the search. Please also check out the SampleSpider class
name
The internal name of your spider (how it is called and identified), please postfix it with _spider
friendlyName
The readable / real name of your source. This is how it will be displayed in the Frontend
url
The base url of your source. This will be linked in the Frontend when someone clicks on the source name
version
The version of your spider. Please include this variable when overriding getHash()
. This makes sure that as soon as the version changes, all data will be re-crawled to adopt for changes in your spider.
If your system provides a OAI API, you can use our OAIBase
class for it. You simply override the baseUrl
, metadataPrefix
and set
.
If your page includes LRMI data into its sites and provides a sitemap, you can make use of our LRMIBase
class to get started.
If you provide data via RSS, then we offer two approches:
If you only have a single RSS-Feed, we recommend to use the RSSBase
class. It provides basic handling for the main metadata of RSS. Additional metadata should be added by your class (either static or may by crawling additional data from your webpage or a seperate endpoint)
If your source consists of multiple RSS-Feeds, take a look at the RSSListBase
class. You have to provide a (csv)-file for all your RSS-URL's and can also include several default metadata for each feed as described in the class file.
If none of the above spiders matches your API, there's no need to worry! Since we use Scrapy as a framework, you can use all the base spiders included to collect your data. You'll find the list here.
We strongly recommend you to also inherit the LOMBase
class to already have all basic callbacks for the individual metadata groups. Furthermore, if your source makes use of JSON, you may also will find the JSONBase
class helpful. A sample override could look like this:
class MySpider(CrawlSpider, LomBase):
name = 'my_spider'
start_urls = ['https://edu-sharing.com']
# more data comes here
def parse(self, response):
return LomBase.parse(self, response)
For more details, please take a look at the sample_spider
class
For making debugging easier, the spiders can accept several arguments for specific behaviour
scrapy crawl <spider_name> -a uuid=<specific_uuid>
To make sure that this works, make sure your spider does correctly calls hasChanged
and returns None
if it returns false. Call this method as soon as possible. If your spider does not support the hasChanged
method, you can not use this feature.
scrapy crawl <spider_name> -a cleanrun=true
Please note that this will not update the elasticsearch automatically, you need to re-sync the index with the database manually in case elements have been deleted