The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.
Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.
You can get a minimal conda installation with Miniconda or get the full installation with Anaconda.
You can install and update cuDF using the conda command:
conda install -c numba -c conda-forge -c rapidsai -c defaults cudf=0.2.0
Note: This conda installation only applies to Linux and Python versions 3.5/3.6.
You can create and activate a development environment using the conda command:
conda env create --name cudf --file conda_environments/testing_py35.yml
source activate cudf
For cudf development, use conda—environments/dev_py35.yml
in the above
conda create
command instead.
Support is coming soon, please use conda for the time being.
The following instructions are tested on Linux Ubuntu 16.04 & 18.04, to enable from source builds and development. Other operatings systems may be compatible, but are not currently supported.
Compiler requirements:
g++
5.4cmake
3.12
CUDA/GPU requirements:
- CUDA 9.2+
- NVIDIA driver 396.44+
- Pascal architecture or better
You can obtain CUDA from https://developer.nvidia.com/cuda-downloads
Since cmake
will download and build Apache Arrow (version 0.7.1 or
0.8+) you may need to install Boost C++ (version 1.58+) before running
cmake
:
# Install Boost C++ for Ubuntu 16.04/18.04
$ sudo apt-get install libboost-all-dev
or
# Install Boost C++ for Conda
$ conda install -c conda-forge boost
To install cuDF from source, ensure the dependencies are met and follow the steps below:
- Clone the repository
git clone --recurse-submodules https://github.com/rapidsai/cudf.git
cd cudf
- Create the conda development environment
cudf
as detailed above - Build and install
libgdf
source activate cudf
mkdir -p libgdf/build
cd libgdf/build
cmake .. -DHASH_JOIN=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
make -j install
make copy_python
python setup.py install
- Build and install
cudf
from the root of the repository
cd ../..
python setup.py install
A Dockerfile is provided with a preconfigured conda environment for building and installing cuDF from source based off of the master branch.
- Install nvidia-docker2 for Docker + GPU support
- Verify NVIDIA driver is
396.44
or higher - Ensure CUDA 9.2+ is installed
From cudf project root run the following, to build with defaults:
docker build -t cudf .
After the container is built run the container:
docker run --runtime=nvidia -it cudf bash
Activate the conda environment cudf
to use the newly built cuDF and libgdf libraries:
root@3f689ba9c842:/# source activate cudf
(cudf) root@3f689ba9c842:/# python -c "import cudf"
(cudf) root@3f689ba9c842:/#
Several build arguments are available to customize the build process of the container. These are spcified by using the Docker build-arg flag. Below is a list of the available arguments and their purpose:
Build Argument | Default Value | Other Value(s) | Purpose |
---|---|---|---|
CUDA_VERSION |
9.2 | 10.0 | set CUDA version |
LINUX_VERSION |
ubuntu16.04 | ubuntu18.04 | set Ubuntu version |
CC & CXX |
5 | 7 | set gcc/g++ version; NOTE: gcc7 requires Ubuntu 18.04 |
CUDF_REPO |
This repo | Forks of cuDF | set git URL to use for git clone |
CUDF_BRANCH |
master | Any branch name | set git branch to checkout of CUDF_REPO |
NUMBA_VERSION |
0.40.0 | Not supported | set numba version |
NUMPY_VERSION |
1.14.3 | Not supported | set numpy version |
PANDAS_VERSION |
0.20.3 | Not supported | set pandas version |
PYARROW_VERSION |
0.10.0 | 0.8.0+ | set pyarrow version |
PYTHON_VERSION |
3.5 | 3.6 | set python version |
This project uses py.test
In the source root directory and with the development conda environment activated, run:
py.test --cache-clear --ignore=libgdf
The libgdf
tests require a GPU and CUDA. CUDA can be installed locally or through the conda packages of numba
& cudatoolkit
. For more details on the requirements needed to run these tests see the libgdf README.
libgdf
has two testing frameworks py.test
and GoogleTest:
# Run py.test command inside the /libgdf folder
py.test
# Run GoogleTest command inside the /libgdf/build folder after cmake
make -j test
The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.