Skip to content

Commit

Permalink
Create main developer guide for Python (#11235)
Browse files Browse the repository at this point in the history
This PR adds a primary developer guide for Python. It provides a more complete and informative landing page for new developers. When #11217, #11199, and #11122 are merged, they will all be linked from this page to provide a complete set of developer documentation.

There is one main point of discussion that I would like reviewer comments on, and that is the section on directory and file organization. How do we want that aspect of cuDF to look?

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Lawrence Mitchell (https://github.com/wence-)
  - Ashwin Srinath (https://github.com/shwina)

URL: #11235
  • Loading branch information
vyasr authored Aug 4, 2022
1 parent d8c25a1 commit acadcf2
Show file tree
Hide file tree
Showing 5 changed files with 140 additions and 11 deletions.
3 changes: 3 additions & 0 deletions docs/cudf/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@

html_use_modindex = True

# Enable automatic generation of systematic, namespaced labels for sections
myst_heading_anchors = 2

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]

Expand Down
111 changes: 111 additions & 0 deletions docs/cudf/source/developer_guide/contributing_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Contributing Guide

This document focuses on a high-level overview of best practices in cuDF.

## Directory structure and file naming

cuDF generally presents the same importable modules and subpackages as pandas.
All Cython code is contained in `python/cudf/cudf/_lib`.

## Code style

cuDF employs a number of linters to ensure consistent style across the code base.
We manage our linters using [`pre-commit`](https://pre-commit.com/).
Developers are strongly recommended to set up `pre-commit` prior to any development.
The `.pre-commit-config.yaml` file at the root of the repo is the primary source of truth linting.
Specifically, cuDF uses the following tools:

- [`flake8`](https://github.com/pycqa/flake8) checks for general code formatting compliance.
- [`black`](https://github.com/psf/black) is an automatic code formatter.
- [`isort`](https://pycqa.github.io/isort/) ensures imports are sorted consistently.
- [`mypy`](http://mypy-lang.org/) performs static type checking.
In conjunction with [type hints](https://docs.python.org/3/library/typing.html),
`mypy` can help catch various bugs that are otherwise difficult to find.
- [`pydocstyle`](https://github.com/PyCQA/pydocstyle/) lints docstring style.

Linter config data is stored in a number of files.
We generally use `pyproject.toml` over `setup.cfg` and avoid project-specific files (e.g. `setup.cfg` > `python/cudf/setup.cfg`).
However, differences between tools and the different packages in the repo result in the following caveats:

- `flake8` has no plans to support `pyproject.toml`, so it must live in `setup.cfg`.
- `isort` must be configured per project to set which project is the "first party" project.

Additionally, our use of `versioneer` means that each project must have a `setup.cfg`.
As a result, we currently maintain both root and project-level `pyproject.toml` and `setup.cfg` files.

For more information on how to use pre-commit hooks, see the code formatting section of the
[overall contributing guide](https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md#python--pre-commit-hooks).

## Deprecating and removing code

cuDF follows the policy of deprecating code for one release prior to removal.
For example, if we decide to remove an API during the 22.08 release cycle,
it will be marked as deprecated in the 22.08 release and removed in the 22.10 release.
All internal usage of deprecated APIs in cuDF should be removed when the API is deprecated.
This prevents users from encountering unexpected deprecation warnings when using other (non-deprecated) APIs.
The documentation for the API should also be updated to reflect its deprecation.
When the time comes to remove a deprecated API, make sure to remove all tests and documentation.

Deprecation messages should:
- emit a FutureWarning;
- consist of a single line with no newline characters;
- indicate replacement APIs, if any exist
(deprecation messages are an opportunity to show users better ways to do things);
- not specify a version when removal will occur (this gives us more flexibility).

For example:
```python
warnings.warn(
"`Series.foo` is deprecated and will be removed in a future version of cudf. "
"Use `Series.new_foo` instead.",
FutureWarning
)
```

```{warning}
Deprecations should be signaled using a `FutureWarning` **not a `DeprecationWarning`**!
`DeprecationWarning` is hidden by default except in code run in the `__main__` module.
```

## `pandas` compatibility

Maintaining compatibility with the [pandas API](https://pandas.pydata.org/docs/reference/index.html) is a primary goal of cuDF.
Developers should always look at pandas APIs when adding a new feature to cuDF.
When introducing a new cuDF API with a pandas analog, we should match pandas as much as possible.
Since we try to maintain compatibility even with various edge cases (such as null handling),
new pandas releases sometimes require changes that break compatibility with old versions.
As a result, our compatibility target is the latest pandas version.

However, there are occasionally good reasons to deviate from pandas behavior.
The most common reasons center around performance.
Some APIs cannot match pandas behavior exactly without incurring exorbitant runtime costs.
Others may require using additional memory, which is always at a premium in GPU workflows.
If you are developing a feature and believe that perfect pandas compatibility is infeasible or undesirable,
you should consult with other members of the team to assess how to proceed.

When such a deviation from pandas behavior is necessary, it should be documented.
For more information on how to do that, see [our documentation on pandas comparison](./documentation.md#comparing-to-pandas).

## Python vs Cython

cuDF makes substantial use of [Cython](https://cython.org/).
Cython is a powerful tool, but it is less user-friendly than pure Python.
It is also more difficult to debug or profile.
Therefore, developers should generally prefer Python code over Cython where possible.

The primary use-case for Cython in cuDF is to expose libcudf C++ APIs to Python.
This Cython usage is generally composed of two parts:
1. A `pxd` file declaring C++ APIs so that they may be used in Cython, and
2. A `pyx` file containing Cython functions that wrap those C++ APIs so that they can be called from Python.

The latter wrappers should generally be kept as thin as possible to minimize Cython usage.
For more information see [our Cython layer design documentation](./library_design.md#the-cython-layer).

In some rare cases we may actually benefit from writing pure Cython code to speed up particular code paths.
Given that most numerical computations in cuDF actually happen in libcudf, however,
such use cases are quite rare.
Any attempt to write pure Cython code for this purpose should be justified with benchmarks.

## Exception handling

This section is under development, see https://github.com/rapidsai/cudf/pull/7917.
7 changes: 7 additions & 0 deletions docs/cudf/source/developer_guide/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,13 @@ These pages do not conform to any specific style or set of use cases.
However, if you develop any sufficiently complex new features,
consider whether users would benefit from a more complete demonstration of them.

```{note}
We encourage using links between pages.
We enable [Myst auto-generated anchors](https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#auto-generated-header-anchors),
so links should make use of the appropriately namespaced anchors for links rather than adding manual links.
```

## Building documentation

### Requirements
Expand Down
18 changes: 18 additions & 0 deletions docs/cudf/source/developer_guide/index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,26 @@
# Developer Guide

```{note}
At present, this guide only covers the main cuDF library.
In the future, it may be expanded to also cover dask_cudf, cudf_kafka, and custreamz.
```

cuDF is a GPU-accelerated, [Pandas-like](https://pandas.pydata.org/) DataFrame library.
Under the hood, all of cuDF's functionality relies on the CUDA-accelerated `libcudf` C++ library.
Thus, cuDF's internals are designed to efficiently and robustly map pandas APIs to `libcudf` functions.
For more information about the `libcudf` library, a good starting point is the
[developer guide](https://github.com/rapidsai/cudf/blob/main/cpp/docs/DEVELOPER_GUIDE.md).

This document assumes familiarity with the
[overall contributing guide](https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md).
The goal of this document is to provide more specific guidance for Python developers.
It covers the structure of the Python code and discusses best practices.
Additionally, it includes longer sections on more specific topics like testing and benchmarking.

```{toctree}
:maxdepth: 2
library_design
documentation
options
```
12 changes: 1 addition & 11 deletions docs/cudf/source/developer_guide/library_design.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,5 @@
# Library Design

The cuDF library is a GPU-accelerated, [Pandas-like](https://pandas.pydata.org/) DataFrame library.
Under the hood, all of cuDF's functionality relies on the CUDA-accelerated `libcudf` C++ library.
Thus, cuDF's internals are designed to efficiently and robustly map pandas APIs to `libcudf` functions.

```{note}
For more information about the `libcudf` library, a good starting point is the
[developer guide](https://github.com/rapidsai/cudf/blob/main/cpp/docs/DEVELOPER_GUIDE.md).
```

At a high level, cuDF is structured in three layers, each of which serves a distinct purpose:

1. The Frame layer: The user-facing implementation of pandas-like data structures like `DataFrame` and `Series`.
Expand Down Expand Up @@ -219,8 +210,7 @@ A `Buffer` constructed from a preexisting device memory allocation (such as a Cu
Conversely, when constructed from a host object,
`Buffer` uses [`rmm.DeviceBuffer`](https://github.com/rapidsai/rmm#devicebuffers) to allocate new memory.
The data is then copied from the host object into the newly allocated device memory.
You can read more about device memory allocation with RMM [here](https://github.com/rapidsai/rmm).

You can read more about [device memory allocation with RMM here](https://github.com/rapidsai/rmm).

## The Cython layer

Expand Down

0 comments on commit acadcf2

Please sign in to comment.