Skip to content

Latest commit

 

History

History
73 lines (49 loc) · 5.43 KB

STRUCTURED_DATA.md

File metadata and controls

73 lines (49 loc) · 5.43 KB

Structured data, POIs that are not shy!

The web is full of information. A lot you can see, it is rendered by your browser. Some you cannot see, it is intended for consumption by machines, part of the so-called semantic web. Many POI web pages have embedded data which describe attributes of the POI such as address, latitude/longitude, opening hours and more. Google have an excellent tutorial on how structured data works.

More information can be found at schema.org. Various online resources are available such as the schema validation tool to help you extract structured data on an ad-hoc basis from a URL. For example, the web page for this smashburger location decodes to yield this structured data.

See also open graph protocol

StructuredDataSpider

The ATP project has custom library code which helps with the creation of spiders for sites which provide structured data in their pages. One key class provided is StructuredDataSpider.

An ATP spider will typically use CrawlSpider or, more commonly, SitemapSpider to drive the scrapy engine to call the parse_sd method in StructuredDataSpider for an individual POI page.

The parse_sd method will look for default wanted_types in the page. The spider can override this value if it is not appropriate. If required the spider can clean up source data with pre_process_data and clean up the item, or add extra attributes with post_process_item.

This is best illustrated by reference to some example spiders in the project:

scrapy sd can help

Following on from the smashburger example in our sitemap tool example then we can run the scrapy sd custom tool on one of the URLs:

$ pipenv run scrapy sd https://smashburger.com/locations/us/co/lafayette/2755-dagny-way/
{'city': 'Lafayette',
 'image': 'https://smashburger.com/wp-content/uploads/2021/03/SB_hori_logo_Lt_NoTag_TM_RGB_XL.png',
 'lat': '40.0131850',
 'lon': '-105.1315620',
 'name': 'Your Neighborhood Smashburger',
 'opening_hours': 'Mo-Th 10:30-21:30; Fr-Sa 10:30-22:00; Su 10:30-21:30',
 'phone': '13039269700',
 'postcode': '80026',
 'ref': '1001',
 'state': 'CO',
 'street_address': '2755 Dagny Way,',
 'website': 'https://smashburger.com/locations/us/co/lafayette/2755-dagny-way/'}

This is good as it shows that smashburger can be very easily processed, as we do in smashburger.py.

Note that the spider we wrote has Wikidata included. To get an idea of the kind of output this enables re-run the scrapy sd tool but this time pass a Wikidata QID for it to use:

pipenv run scrapy sd -O smashburger.geojson --wikidata Q17061332 https://smashburger.com/locations/us/co/lafayette/2755-dagny-way/

The single POI has been written as output to the smashburger.geojson file shown below:

{"type":"FeatureCollection","features":[{"type": "Feature", "id": "d9bWFhLsLRtombN0CN2nyIJozyY=", "properties": {"ref": "1001", "@spider": "my_spider", "nsi_id": "smashburger-d2abf0", "amenity": "fast_food", "cuisine": "burger", "takeaway": "yes", "addr:street_address": "2755 Dagny Way,", "addr:city": "Lafayette", "addr:state": "CO", "addr:postcode": "80026", "name": "Your Neighborhood Smashburger","phone": "13039269700", "website": "https://smashburger.com/locations/us/co/lafayette/2755-dagny-way/", "opening_hours": "Mo-Th 10:30-21:30; Fr-Sa 10:30-22:00; Su 10:30-21:30", "image": "https://smashburger.com/wp-content/uploads/2021/03/SB_hori_logo_Lt_NoTag_TM_RGB_XL.png", "brand": "Smashburger", "brand:wikidata": "Q17061332"}, "geometry": {"type": "Point", "coordinates": [-105.131562, 40.013185]}}]}

Note that the ATP pipeline code has used the QID to apply OSM POI category tags, from NSI, to the output: "amenity": "fast_food", "cuisine": "burger", "takeaway": "yes".

The smashburger.geojson file can be examined in any number of tools, a popular choice here is geojson.io.

The ATP tooling described here and on related pages can go a long way to giving you confidence in your end spider before writing any code!

Code generators

To get started quickly, try either of:

pipenv run scrapy genspider -t structured_data_crawl your_spider_name https://www.your_spider_name.com/ pipenv run scrapy genspider -t structured_data_sitemap your_spider_name https://www.your_spider_name.com/

Or to make an entire branch: ./contrib/new_spider structured_data_spider_sitemap your_spider_name https://www.your_spider_name.com/