-
Notifications
You must be signed in to change notification settings - Fork 9
How To build a crawler for edu sharing (alternative method)
This chapter of our GitHub Wiki covers the topic of building a crawler to work with our edu-sharing
-mode. If you haven't done so yet, please take a look at the fantastic Scrapy Tutorial first to get your bearings on how to use this Python framework for web-scraping. Within the Scrapy-framework each individual crawler is called a scrapy.Spider
, the most basic class being a CrawlSpider
, which is why we're using these terms interchangeably in our Wiki.
Many roads lead to Rome and this saying holds true for building web-crawlers as well: In converter/spiders
you'll find a python file called sample_spider.py that acts as a rough starting point to help new users build their own spider implementations on top of our LomBase-class (see: LOM - Learning Objects Metadata). Since the sample_spider
makes heavy use of (multi-)inheritance and method overriding, we thought it would be a good idea to also present an alternative approach that might be easier to read and understand as a new Scrapy user. Please take a look at our sample_spider_alternative.py.
Before we get started, please keep the Scrapy documentation on ItemLoaders handy. To grasp what's happening with the data that our crawler scrapes from websites, the items.py file in the converter/
-folder gives a good overview which Item
- and ItemLoader
-classes we're working with.
Our Scrapy web-crawler needs to fulfill two conditions to be considered a success:
- it should navigate through a website and gather the desired meta-data from a source
- it should prepare a BaseItem within its
parse()
-method that holds all the relevant meta-data for one specific URI/URL (for further processing in our ETL-procedure) and yield it
We're taking a peek at the BaseItem
-class definition in converter/items.py
:
class BaseItem(Item):
sourceId = Field()
uuid = Field()
"explicit uuid of the target element, please only set this if you actually know the uuid of the internal document"
hash = Field()
collection = Field(output_processor=JoinMultivalues())
"id of collections this entry should be placed into"
type = Field()
origin = Field()
"in case it was fetched from a referatorium, the real origin name may be included here"
response = Field(serializer=ResponseItem)
ranking = Field()
fulltext = Field()
thumbnail = Field()
lastModified = Field()
lom = Field(serializer=LomBaseItem)
valuespaces = Field(serializer=ValuespaceItem)
permissions = Field(serializer=PermissionItem)
"permissions (access rights) for this entry"
license = Field(serializer=LicenseItem)
publisher = Field()
# editorial notes
notes = Field()
binary = Field()
Some of these fields are mandatory while others are optional (see: required metadata).
If we build a crawler by inheriting from both CrawlSpider
and LomBase
we would normally have to override the getID()
- and getHash
-method with our own implementation. Since we're building our BaseItem
"by hand", our crawler will have to provide values for sourceId
and hash
.
You might have noticed that some fields make use of a serializer:
- LomBaseItem
- LicenseItem
- ValuespaceItem
- PermissionItem
- ResponseItem
Of course we would like to fill as many of these items as possible with (relevant) meta-data, but the most important field to fill is the LomBaseItem
because it holds several more categories within its structure, as can be seen within the items.py
:
- LomGeneralItem
- LomEducationalItem
- LomClassificationItem
- LomLifecycleItem
- LomTechnicalItem
- LomAgeRangeItem
As you can see in the (simplified) diagram, our desired end-result BaseItem
is built by nesting items within each other. The sample_spider_alternative.py should help you to understand how nesting these items works with the specific ItemLoader
-classes. The basic gist is:
- Create an ItemLoader object
-
Populate its fields by using its
.add_value(key,value)
or.replace_value()
-methods - Once you're done, create the item
scrapy.Item
itself by calling the.load_item()
-method
(last update: 2021-10-22)