-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Coding Guidelines
Here are some coding guidelines that we have adopted in this repo.
- Test Driven Development
- Do not Repeat Yourself
- Single Responsibility
- Python and Docstrings Style
- The Zen of Python
- Evidence-Based Software Design
- You are not going to need it
- Minimum Viable Product
- Publish Often Publish Early
- User feedback before making a release
We use Test Driven Development (TDD) in our development. All contributions to the repository should have unit tests, we use pytest for Python files and papermill for notebooks.
Apart from unit tests, we also have nightly builds with smoke and integration tests. For more information about the differences, see a quick introduction to unit, smoke and integration tests.
You can find a guide on how to manually execute all the tests in the TESTS.md
Click here to see some examples
- Basic asserts with fixtures comparing structures like list, dictionaries, numpy arrays and pandas dataframes.
- Basic use of common fixtures defined in a conftest file.
- Python unit tests for our evaluation metrics.
- Notebook unit tests for our PySpark notebooks.
Don't Repeat Yourself (DRY) by refactoring common code.
Click here to see some examples
- See how we are using DRY when testing our notebooks.
Single responsibility is one of the SOLID principles, it states that each module or function should have responsibility over a single part of the functionality.
Click here to see some examples
Without single responsibility:
def train_and_test(train_set, test_set):
# code for training on train set
# code for testing on test_set
With single responsibility:
def train(train_set):
# code for training on train set
def test(test_set):
# code for testing on test_set
We use the automatic style formatter Black. See the installation guide for VSCode and PyCharm.
We use Google style for formatting the docstrings.
Click here to see some examples
We follow the Zen of Python when developing general Python code, for PySpark code, please see the note (1) at the end of this section.
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Click here to see some examples
Implementation of explicit is better than implicit with a read function:
#Implicit
def read(filename):
# code for reading a csv or json
# depending on the file extension
#Explicit
def read_csv(filename):
# code for reading a csv
def read_json(filename):
# code for reading a json
(1) Note regarding PySpark development: PySpark software design is highly influenced by Java. Therefore, in order to follow the industry standards and adapt our code to our users preferences, when developing in PySpark, we don't strictly follow the Zen of Python.
When using Evidence-Based Design (EBD), software is developed based on customer inputs, standard libraries in the industry or credible research. For a detailed explanation, see this post about EBD.
Click here to see some examples
When designing the interfaces of the evaluation metrics in Python, we took the decision of using functions instead of classes, following standards in the industry like scikit-learn and tensorflow. See our implementation of Python metrics.
You aren’t going to need it (YAGNI) principle states that we should only implement functionalities when we need them and not when we foresee we might need them.
Click here to see some examples
- Question: should we start developing now computer vision capabilities for the Recommenders project?
- Answer: No, we will wait until we see a demand of these capabilities.
We work through Minimum Viable Products (MVP), which are our milestones. An MVP is that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort. More information about MVPs can be found in the Lean Startup methodology.
Click here to see some examples
- Initial MVP of our repo with basic functionality.
- Second MVP to give early access to selected users and customers.
Even before we have an MVP, get the code base working and doing something, even if it is something trivial that everyone can "run" easily.
Click here to see some examples
We make sure that in between MVPs all the code that goes to the branches staging or master passes the tests.
A product cycle is not finished until we get feedback from a user, we have made changes based on the feedback and all the tests are passing.
Click here to see some examples
- See our branch merging strategy.
Make sure that our code has checkers for inputs and other intermediate variables to ensure that if the code is going to fail, it fails as soon as possible
Click here to see some examples
Function with no checkers:
def division(a, b, c):
d = some_function(a, b)
e = some_other_function(a, d)
return e/c # this will fail and raise a ZeroDivisionError if c=0
Function with checkers:
def division(a, b, c):
if c == 0: # this will raise an exception if c=0 early so we don't need to compute the subsequent functions
raise ValueError("c can't be 0")
d = some_function(a, b)
e = some_other_function(a, d)
return e/c