Create main developer guide for Python (#11235)

This PR adds a primary developer guide for Python. It provides a more complete and informative landing page for new developers. When #11217, #11199, and #11122 are merged, they will all be linked from this page to provide a complete set of developer documentation. There is one main point of discussion that I would like reviewer comments on, and that is the section on directory and file organization. How do we want that aspect of cuDF to look? Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Lawrence Mitchell (https://github.com/wence-) - Ashwin Srinath (https://github.com/shwina) URL: #11235
rapidsai · Aug 4, 2022 · acadcf2 · acadcf2
1 parent d8c25a1
commit acadcf2
Show file tree

Hide file tree

Showing 5 changed files with 140 additions and 11 deletions.
diff --git a/docs/cudf/source/conf.py b/docs/cudf/source/conf.py
@@ -60,6 +60,9 @@
 
 html_use_modindex = True
 
+# Enable automatic generation of systematic, namespaced labels for sections
+myst_heading_anchors = 2
+
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
 

diff --git a/docs/cudf/source/developer_guide/contributing_guide.md b/docs/cudf/source/developer_guide/contributing_guide.md
@@ -0,0 +1,111 @@
+# Contributing Guide
+
+This document focuses on a high-level overview of best practices in cuDF.
+
+## Directory structure and file naming
+
+cuDF generally presents the same importable modules and subpackages as pandas.
+All Cython code is contained in `python/cudf/cudf/_lib`.
+
+## Code style
+
+cuDF employs a number of linters to ensure consistent style across the code base.
+We manage our linters using [`pre-commit`](https://pre-commit.com/).
+Developers are strongly recommended to set up `pre-commit` prior to any development.
+The `.pre-commit-config.yaml` file at the root of the repo is the primary source of truth linting.
+Specifically, cuDF uses the following tools:
+
+- [`flake8`](https://github.com/pycqa/flake8) checks for general code formatting compliance. 
+- [`black`](https://github.com/psf/black) is an automatic code formatter.
+- [`isort`](https://pycqa.github.io/isort/) ensures imports are sorted consistently.
+- [`mypy`](http://mypy-lang.org/) performs static type checking.
+  In conjunction with [type hints](https://docs.python.org/3/library/typing.html),
+  `mypy` can help catch various bugs that are otherwise difficult to find.
+- [`pydocstyle`](https://github.com/PyCQA/pydocstyle/) lints docstring style.
+
+Linter config data is stored in a number of files.
+We generally use `pyproject.toml` over `setup.cfg` and avoid project-specific files (e.g. `setup.cfg` > `python/cudf/setup.cfg`).
+However, differences between tools and the different packages in the repo result in the following caveats:
+
+- `flake8` has no plans to support `pyproject.toml`, so it must live in `setup.cfg`.
+- `isort` must be configured per project to set which project is the "first party" project.
+
+Additionally, our use of `versioneer` means that each project must have a `setup.cfg`.
+As a result, we currently maintain both root and project-level `pyproject.toml` and `setup.cfg` files.
+
+For more information on how to use pre-commit hooks, see the code formatting section of the
+[overall contributing guide](https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md#python--pre-commit-hooks).
+
+## Deprecating and removing code
+
+cuDF follows the policy of deprecating code for one release prior to removal.
+For example, if we decide to remove an API during the 22.08 release cycle,
+it will be marked as deprecated in the 22.08 release and removed in the 22.10 release.
+All internal usage of deprecated APIs in cuDF should be removed when the API is deprecated.
+This prevents users from encountering unexpected deprecation warnings when using other (non-deprecated) APIs.
+The documentation for the API should also be updated to reflect its deprecation.
+When the time comes to remove a deprecated API, make sure to remove all tests and documentation.
+
+Deprecation messages should:
+- emit a FutureWarning;
+- consist of a single line with no newline characters;
+- indicate replacement APIs, if any exist
+  (deprecation messages are an opportunity to show users better ways to do things);
+- not specify a version when removal will occur (this gives us more flexibility).
+
+For example:
+```python
+warnings.warn(
+    "`Series.foo` is deprecated and will be removed in a future version of cudf. "
+    "Use `Series.new_foo` instead.",
+    FutureWarning
+)
+```
+
+```{warning}
+Deprecations should be signaled using a `FutureWarning` **not a `DeprecationWarning`**!
+`DeprecationWarning` is hidden by default except in code run in the `__main__` module.
+```
+
+## `pandas` compatibility
+
+Maintaining compatibility with the [pandas API](https://pandas.pydata.org/docs/reference/index.html) is a primary goal of cuDF.
+Developers should always look at pandas APIs when adding a new feature to cuDF.
+When introducing a new cuDF API with a pandas analog, we should match pandas as much as possible.
+Since we try to maintain compatibility even with various edge cases (such as null handling),
+new pandas releases sometimes require changes that break compatibility with old versions.
+As a result, our compatibility target is the latest pandas version.
+
+However, there are occasionally good reasons to deviate from pandas behavior.
+The most common reasons center around performance.
+Some APIs cannot match pandas behavior exactly without incurring exorbitant runtime costs.
+Others may require using additional memory, which is always at a premium in GPU workflows.
+If you are developing a feature and believe that perfect pandas compatibility is infeasible or undesirable,
+you should consult with other members of the team to assess how to proceed.
+
+When such a deviation from pandas behavior is necessary, it should be documented.
+For more information on how to do that, see [our documentation on pandas comparison](./documentation.md#comparing-to-pandas).
+
+## Python vs Cython
+
+cuDF makes substantial use of [Cython](https://cython.org/).
+Cython is a powerful tool, but it is less user-friendly than pure Python.
+It is also more difficult to debug or profile.
+Therefore, developers should generally prefer Python code over Cython where possible.
+
+The primary use-case for Cython in cuDF is to expose libcudf C++ APIs to Python.
+This Cython usage is generally composed of two parts:
+1. A `pxd` file declaring C++ APIs so that they may be used in Cython, and
+2. A `pyx` file containing Cython functions that wrap those C++ APIs so that they can be called from Python.
+
+The latter wrappers should generally be kept as thin as possible to minimize Cython usage.
+For more information see [our Cython layer design documentation](./library_design.md#the-cython-layer).
+
+In some rare cases we may actually benefit from writing pure Cython code to speed up particular code paths.
+Given that most numerical computations in cuDF actually happen in libcudf, however,
+such use cases are quite rare.
+Any attempt to write pure Cython code for this purpose should be justified with benchmarks.
+
+## Exception handling
+
+This section is under development, see https://github.com/rapidsai/cudf/pull/7917.
diff --git a/docs/cudf/source/developer_guide/documentation.md b/docs/cudf/source/developer_guide/documentation.md
@@ -153,6 +153,13 @@ These pages do not conform to any specific style or set of use cases.
 However, if you develop any sufficiently complex new features,
 consider whether users would benefit from a more complete demonstration of them.
 
+```{note}
+We encourage using links between pages.
+We enable [Myst auto-generated anchors](https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#auto-generated-header-anchors),
+so links should make use of the appropriately namespaced anchors for links rather than adding manual links.
+
+```
+
 ## Building documentation
 
 ### Requirements

diff --git a/docs/cudf/source/developer_guide/index.md b/docs/cudf/source/developer_guide/index.md
@@ -1,8 +1,26 @@
 # Developer Guide
 
+```{note}
+At present, this guide only covers the main cuDF library.
+In the future, it may be expanded to also cover dask_cudf, cudf_kafka, and custreamz.
+```
+
+cuDF is a GPU-accelerated, [Pandas-like](https://pandas.pydata.org/) DataFrame library.
+Under the hood, all of cuDF's functionality relies on the CUDA-accelerated `libcudf` C++ library.
+Thus, cuDF's internals are designed to efficiently and robustly map pandas APIs to `libcudf` functions.
+For more information about the `libcudf` library, a good starting point is the
+[developer guide](https://github.com/rapidsai/cudf/blob/main/cpp/docs/DEVELOPER_GUIDE.md).
+
+This document assumes familiarity with the
+[overall contributing guide](https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md).
+The goal of this document is to provide more specific guidance for Python developers.
+It covers the structure of the Python code and discusses best practices.
+Additionally, it includes longer sections on more specific topics like testing and benchmarking.
+
 ```{toctree}
 :maxdepth: 2
 
 library_design
 documentation
 options
+```
diff --git a/docs/cudf/source/developer_guide/library_design.md b/docs/cudf/source/developer_guide/library_design.md
@@ -1,14 +1,5 @@
 # Library Design
 
-The cuDF library is a GPU-accelerated, [Pandas-like](https://pandas.pydata.org/) DataFrame library.
-Under the hood, all of cuDF's functionality relies on the CUDA-accelerated `libcudf` C++ library.
-Thus, cuDF's internals are designed to efficiently and robustly map pandas APIs to `libcudf` functions.
-
-```{note}
-For more information about the `libcudf` library, a good starting point is the
-[developer guide](https://github.com/rapidsai/cudf/blob/main/cpp/docs/DEVELOPER_GUIDE.md).
-```
-
 At a high level, cuDF is structured in three layers, each of which serves a distinct purpose:
 
 1. The Frame layer: The user-facing implementation of pandas-like data structures like `DataFrame` and `Series`.
@@ -219,8 +210,7 @@ A `Buffer` constructed from a preexisting device memory allocation (such as a Cu
 Conversely, when constructed from a host object,
 `Buffer` uses [`rmm.DeviceBuffer`](https://github.com/rapidsai/rmm#devicebuffers) to allocate new memory.
 The data is then copied from the host object into the newly allocated device memory.
-You can read more about device memory allocation with RMM [here](https://github.com/rapidsai/rmm).
-
+You can read more about [device memory allocation with RMM here](https://github.com/rapidsai/rmm).
 
 ## The Cython layer