Skip to content

Latest commit

 

History

History
268 lines (211 loc) · 17.3 KB

CONTRIBUTING.md

File metadata and controls

268 lines (211 loc) · 17.3 KB

Contents

Contributor guidelines

This document contains guidelines on contributing to the UKBCD project including how the project works, how to set up the environment, how we use our issue tracker, and how you can develop more scrapers.

Getting Started

You will need to install Python on the system you plan to run the script from. Python 3.8 and 3.9 are officially supported. Python 3.10 and 3.11 should work, but your mileage may vary.

The project uses poetry to manage dependencies and setup the build environment.

Environment Setup

pip install poetry

# Clone the Repo
git clone https://github.com/robbrad/UKBinCollectionData
cd UKBinCollectionData

# Install Dependencies 
poetry install
poetry shell

Project Aims

  • To provide a real-world environment to learn Python and/or web scraping
  • To provide UK bin data in a standardised format for use (albeit not exclusively) with HomeAssistant

What can I contribute to?

  • The majority of project work comes from developing new scrapers for requested councils. These can be found on the issue tracker with council request labels.
  • Tasks that require additional input have the help wanted label - these can be trickier requests or may have many smaller tasks.
  • Easier tasks, that would be a good fit for people new to the project or the world of web scraping are labelled with the good first issue label

Claiming an issue

If there is an existing issue you wish to work on, please do the following things:

  • Assign the issue to yourself (or ask someone to assign you) - that way, others know you're working on it
  • Create a new branch - its recommended to use the 'create a branch' option on the issue page, create it in your forked repo and then checkout the branch locally (or in your IDE).

NB: Exploratory work doesn't require claiming an issue - you only need to claim if you plan on developing the full scraper and associated files. If you just want to explore an issue, feel free to do so - and also feel free to post anything helpful in the issue comments.

Pushing your changes

There are guides below on how to add a scraper to the project, along with what files are needed and what tests should be run. When the time comes to push your changes, please be aware that we use conventional commits to provide a clear summary of what a change does. This means that commit messages should start with one of the following:

  • feat: for a new feature (including a new scraper)
  • fix: for when a bug is fixed or an issue is resolved
  • docs: for when changes to documentations are made

Don't worry if you forget - commit messages are automatically checked when you open a merge request by a lint checker, and can easily be rectified by recommitting or pushing again with the correct prefix.

Adding a scraper

This project uses a design pattern called the Template Method which basically allows for a structured class that can be extended. In our case, the getting of the data from the council and the presentation of the JSON remains the same via the abstract class - however the scraping of each council is different and this allows us to have a class for each council - you can see this in action here.

There are a few different options for scraping, and you are free to choose whichever best suits the council:

Developing

To get started, first you will need to fork this repository and setup your own working environment before you can start developing.

Once your environment is ready, create a new branch from your master/main branch and then create a new .py file within the uk_bin_collection\councils directory then use the development mode to generate the input.json entry. The new .py file will be used in the CLI to call the parser, so be sure to pick a sensible name - e.g. CheshireEastCouncil.py is called with:

python collect_data.py CheshireEastCouncil <web-url>

To simplify things somewhat, a template file has been created - open this file, copy the contents to your new .py file and start from there. You are pretty much free to approach the scraping however you would like, but please ensure that:

  • Your scraper returns a dictionary made up of the key "bins" and a value that is a list of bin types and collection dates. An example of this can be seen below.
  • Any dates or times are formatted to standard UK formats (see below)
Output Example
{
    "bins": [
        {
            "type": "Empty Standard Mixed Recycling",
            "collectionDate": "29/07/2022"
        },
        {
            "type": "Empty Standard Garden Waste",
            "collectionDate": "29/07/2022"
        },
        {
            "type": "Empty Standard General Waste",
            "collectionDate": "05/08/2022"
        }
    ]
}

Kwargs

UKBCD has two mandatory parameters when it runs - the name of the parser (sans .py) and the URL from which to scrape. However, developers can also get the following data using kwargs:

Parameter Prompt Notes kwargs.get
UPRN (Unique Property Reference Number) -u or --uprn kwargs.get('uprn')
USRN (Unique Street Reference Number) -us or --usrn kwargs.get('usrn')
House number -n or --number Sometimes called PAON kwargs.get('paon')
Postcode -p or --postcode Needs to be wrapped in quotes on the CLI kwargs.get('postcode')
Skip Get URL -s or --skip_get_url kwargs.get('skip_get_url')
URL for remote Selenium web driver -w or --web_driver Needs to be wrapped in quotes on the CLI kwargs.get('web_driver')
Development Mode -d or --dev_mode Create/update council's entry in the input.json on each run kwargs.get('dev_mode')

These parameters are useful if you're using something like the requests module and need to take additional user information into the request, such as:

python collect_data.py LeedsCityCouncil https://www.leeds.gov.uk/residents/bins-and-recycling/check-your-bin-day -p "LS1 2JG" -n 41

In the scraper, the following code takes the inputted parameters and uses them in two different variables:

user_postcode = kwargs.get("postcode")
user_paon = kwargs.get("paon")

Each parameter also has its own validation method that should be called after the kwargs.get:

  • check_uprn()
  • check_paon()
  • check_postcode()

The first two are simple validators - if the parameter is used but no value is given, they will throw an exception. check_postcode() works differently - instead making a call to the postcodes.io API to check if it exists or not. An exception will only be thrown here if the response code is not HTTP 200.

Common Functions

The project has a small but growing library of functions (and the occasional variable) that are useful when scraping websites or calendars - aptly named common.py. Useful functions include:

common.py also contains a standardised date format variable called date_format, which is useful to call when formatting datetimes.

Please feel free to contribute to this library as you see fit - added functions should include the following:

  • clear, lowercase and underscored name
  • parameter types
  • a return type (if there is one)
  • a docustring describing what the function does, as well as parameter and return type descriptors.

Additional files

In order for your scraper to work with the project's testing suite, some additional files need to be provided or modified:

Note: from here on, anything containing<council_name> should be replaced with the scraper's name.

Input JSON file

Type File location
Modify UKBinCollectionData/uk_bin_collection/tests/input.json

Each council should have a node that matches the scraper's name. The node should include arguments in curly braces - the URL is mandatory, but any additional parameters like UPRN or postcode should also be provided. Councils should be listed in alphabetical order.

A "wiki_name" argument with the council's full name should also be provided.

A "wiki_note" argument should be used where non-standard instructions of just providing UPRN/Postcode/House Number parameters are needed.

A "wiki_command_url_override" argument should be used where parts of the URL need to be replaced by the user to allow a valid URL to be left for the integration tests.

A new Wiki entry will be generated automatically from this file's details.

Note: If you want the integration test to work you must supply real, working data (a business address is recommended - the council's address is usually a good one).

Example
    "CheshireEastCouncil": {
        "uprn": "100012791226",
        "url": "https://online.cheshireeast.gov.uk/MyCollectionDay/SearchByAjax/GetBartecJobList?uprn=100012791226&onelineaddress=3%20COBBLERS%20YARD,%20SK9%207DZ&_=1621149987573",
        "wiki_name": "Cheshire East Council",
        "wiki_command_url_override": "https://online.cheshireeast.gov.uk/MyCollectionDay/SearchByAjax/GetBartecJobList?uprn=XXXXXXXX&onelineaddress=XXXXXXXX&_=1621149987573",
        "wiki_note": "Both the UPRN and a one-line address are passed in the URL, which needs to be wrapped in double quotes. The one-line address is made up of the house number, street name and postcode.\nUse the form [here](https://online.cheshireeast.gov.uk/mycollectionday/) to find them, then take the first line and post code and replace all spaces with `%20`."
    },

Feature file

Type File location
Modify UKBinCollectionData/uk_bin_collection/tests/features/validate_council_outputs.feature

The council's name should be added to the feature file's example list. These names are alphabetically sorted, although council should always remain on line 10. The name should be wrapped in pipes.

Testing

Behave (Integration Testing)

As with any web scraping project, there's a reliance on the council not changing their website - if this happens Beautiful Soup will fail to read the site correctly, and the expected data will not be returned. To mitigate this and stay on top of "what works and what needs work" - we have created a set of Integration tests which run a feature file.

Based on the input.json, this does an actual live run against the council's site and validates if the returned data is JSON and conforms to the common format JSON Schema.

Running the Behave tests

cd UKBinCollectionData
poetry shell
poetry run pytest uk_bin_collection/tests/step_defs/ -n logical

GitHub Actions Integration Tests

The GitHub actions is set to run on push and pull_requests

It uses a Makefile to run the Behave tests to ensure the councils are all still working

Test Results

Allure Report

The Github Actions publishes the Allure Behave Test results to Github Pages: https://robbrad.github.io/UKBinCollectionData/<python_version>/ eg https://robbrad.github.io/UKBinCollectionData/3.9/ you can check this to see if a council is still working as expected

CodeCov Report

The CodeCov.io report can be found here

Pytest (Unit Testing)

As well as integration testing the repo is setup to test some of the static methods as well to ensure basic core functionality

Running the Unittests

cd UKBinCollectionData
poetry shell
poetry run coverage run --omit "*/tests/*" -m pytest uk_bin_collection/tests --ignore=uk_bin_collection/tests/step_defs/
poetry run coverage xml

Contact info

If you have questions or comments, you can reach the project contributors in the following ways:

  • Council requests can be submitted here
  • General questions or comments can be submitted here