Unofficial but comprehensive BC Ferries GraphQL API & scraper, built with Django.
Includes highly detailed data on locations, terminals, routes, schedules, conditions, ferries, and services.
GraphQL schema is available at schema.graphql
. A JSON schema is also available.
- Thanks to Graphene, a GraphiQL endpoint is available at
/graphql
- JWT auth (optional) with
django-graphql-jwt
- Extensive filtering options, including relations
Check out some example queries at examples.graphql
!
Scraping scripts are integrated with django-extensions
,
and should be run through ./manage.py runscript foo
To initialize the database with all data, run ./manage.py runscript init_scraped_data
- Routes, terminals, cities, regions:
./manage.py runscript scrape_routes
- Ferries, services, amenities:
./manage.py runscript scrape_fleet
- Sailings, scheduled sailings, en-route stops & transfers:
./manage.py runscript scrape_schedule
- Current sailings / conditions:
./manage.py runscript scrape_current_conditions
Scraping should not be resource intensive, but by default there is a 10-second delay between http requests to BC Ferries to abide by robots.txt
.
This can be changed in settings.SCRAPER_PAUSE_SECS
.
Scraping operations can also be run as asynchronous tasks via Celery (Redis as a broker and result backend). By default these are scheduled in Celery Beat, running no more than weekly with the exception of current conditions scraping, which runs every 10 minutes. Keep in mind that these tasks will be most effective after initializing other data, e.g. routes and ferries.
SCRAPER = {
'PARSER': 'html5lib', # used by bs4
'PAUSE_SECS': 0 if DEBUG else 10, # See http://bcferries.com/robots.txt
'FALLBACK_DATEDELTA': datedelta(months=3),# How far into the future to create schedules for, if not automatically determined
'FLEET_PAGE_RANGE': 2,
'LOG_LEVEL': logging.DEBUG,
'INIT_SCRIPTS': [ # Used by init_scraped_data script
'save_sitemap',
'scrape_routes',
'scrape_fleet',
'scrape_schedule',
'scrape_current_conditions',
],
'URL_PREFIX': 'https://www.bcferries.com',
'URL_PATHS': {
'SCHEDULES': '/routes-fares/schedules',
'CONDITIONS': '/current-conditions',
'DEPARTURES': '/current-conditions/departures',
'ROUTE_CONDITIONS': '/current-conditions/{}',
'ROUTES': '/route-info',
'CC_ROUTES': '/cc-route-info',
'FLEET': '/on-the-ferry/our-fleet?page={}',
'SCHEDULE_SEASONAL': '/routes-fares/schedules/seasonal/{}',
'SCHEDULE_DATES': '/getDepartureDates?origin={}&destination={}&selectedMonth={}&selectedYear={}',
'DEPARTURES': '/current-conditions/departures',
'TERMINAL': '/travel-boarding/terminal-directions-parking-food/{}/{}',
'SHIP': '/on-the-ferry/our-fleet/{}/{}',
'SCHEDULES': '/routes-fares/schedules',
'MISC_SCHEDULES': [
'/routes-fares/schedules/southern-gulf-islands',
# '/routes-fares/schedules/gambier-keats',
],
},
}
_RANGE_LOOKUPS = ['exact', 'gt', 'lt', 'gte', 'lte']
_use_range_lookups = lambda dt: [f'{dt}__{lookup}' for lookup in _RANGE_LOOKUPS]
_use_unnested_range_lookups = lambda lt: itertools.chain(*[_use_range_lookups(lookupType) for lookupType in lt])
_DATE_LOOKUPS = ['year', 'month', 'day', 'week_day']
_TIME_LOOKUPS = ['hour', 'minute']
DEFAULT_LOOKUPS = {
str: ['exact', 'iexact', 'regex', 'iregex', 'icontains'],
bool: ['exact'],
int: _RANGE_LOOKUPS,
date: [
*_RANGE_LOOKUPS,
*_use_unnested_range_lookups(_DATE_LOOKUPS),
],
time: [
*_RANGE_LOOKUPS,
*_use_unnested_range_lookups(_TIME_LOOKUPS),
],
datetime: [
*_RANGE_LOOKUPS,
*_use_range_lookups('date'),
*_use_unnested_range_lookups(_DATE_LOOKUPS),
*_use_range_lookups('time'),
*_use_unnested_range_lookups(_TIME_LOOKUPS),
],
timedelta: _RANGE_LOOKUPS,
}