Skip to content

Coding Guidelines

Emmanuel Awa edited this page Aug 12, 2019 · 11 revisions

Coding Guidelines

Here are some coding guidelines that we have adopted in this repo.

Test Driven Development

We use Test Driven Development (TDD) in our development. All contributions to the repository should have unit tests, we use pytest for Python files and papermill for notebooks.

Apart from unit tests, we also have nightly builds with smoke and integration tests. For more information about the differences, see a quick introduction to unit, smoke and integration tests.

You can find a guide on how to manually execute all the tests in the TESTS.md

Click here to see some examples

Do not Repeat Yourself

Don't Repeat Yourself (DRY) by refactoring common code.

Click here to see some examples

Single Responsibility

Single responsibility is one of the SOLID principles, it states that each module or function should have responsibility over a single part of the functionality.

Click here to see some examples

Without single responsibility:

def train_and_test(train_set, test_set):
    # code for training on train set
    # code for testing on test_set

With single responsibility:

def train(train_set):
    # code for training on train set

def test(test_set):
    # code for testing on test_set  

Python and Docstrings Style

We use the automatic style formatter Black. See the installation guide for VSCode and PyCharm.

We use Google style for formatting the docstrings.

Click here to see some examples

Python PEP 8 Style Guide for Python Code

The creators of the main Python distribution (Guido van Rossum et. al) have provided documentations on best practices for coding conventions which we have decided to adopt for this repository. It defines everything from naming conventions, indentation guidelines, block vs inline comments, how to use trailing commas and so on.
Please find the full guide here

The Zen of Python

We follow the Zen of Python when developing general Python code, for PySpark code, please see the note (1) at the end of this section.

Beautiful is better than ugly.
Explicit is better than implicit. 
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those! 
Click here to see some examples

Implementation of explicit is better than implicit with a read function:

#Implicit
def read(filename):
    # code for reading a csv or json
    # depending on the file extension

#Explicit
def read_csv(filename):
    # code for reading a csv

def read_json(filename):
    # code for reading a json

(1) Note regarding PySpark development: PySpark software design is highly influenced by Java. Therefore, in order to follow the industry standards and adapt our code to our users preferences, when developing in PySpark, we don't strictly follow the Zen of Python.

Evidence-Based Software Design

When using Evidence-Based Design (EBD), software is developed based on customer inputs, standard libraries in the industry or credible research. For a detailed explanation, see this post about EBD.

Click here to see some examples

When designing the interfaces of the evaluation metrics in Python, we took the decision of using functions instead of classes, following standards in the industry like scikit-learn and tensorflow. See our implementation of Python metrics.

You are not going to need it

You aren’t going to need it (YAGNI) principle states that we should only implement functionalities when we need them and not when we foresee we might need them.

Click here to see some examples
  • Question: should we start developing now computer vision capabilities for the Recommenders project?
  • Answer: No, we will wait until we see a demand of these capabilities.

Minimum Viable Product

We work through Minimum Viable Products (MVP), which are our milestones. An MVP is that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort. More information about MVPs can be found in the Lean Startup methodology.

Click here to see some examples

Publish Often Publish Early

Even before we have an MVP, get the code base working and doing something, even if it is something trivial that everyone can "run" easily.

Click here to see some examples

We make sure that in between MVPs all the code that goes to the branches staging or master passes the tests.

If it is going to fail, let it fail fast

Make sure that our code has sanity checks for all input parameters checking the validity of data to match the functional bounds to ensure that if the code is going to fail, then it fails as soon as possible.

Click here to see some examples

Function with no checkers:

def division(a, b, c):
    d = some_function(a, b)
    e = some_other_function(a, d)
    return e/c # this will fail and raise a ZeroDivisionError if c=0

Function with checkers:

def division(a, b, c):
    if c == 0: # this will raise an exception if c=0 early so we don't need to compute the subsequent functions
        raise ValueError("c can't be 0")
    d = some_function(a, b)
    e = some_other_function(a, d)
    return e/c