diff --git a/dev/tasks/r/github.devdocs.yml b/dev/tasks/r/github.devdocs.yml index b447dd2144191..17b49ddc4f471 100644 --- a/dev/tasks/r/github.devdocs.yml +++ b/dev/tasks/r/github.devdocs.yml @@ -53,9 +53,9 @@ jobs: DEVDOCS_UBUNTU: {{ "${{contains(matrix.os, 'ubuntu')}}" }} DEVDOCS_WINDOWS: {{ "${{contains(matrix.os, 'windows')}}" }} run: | - # This isn't actually rendering the docs, but will save arrow/r/vignettes/script.sh + # This isn't actually rendering the docs, but will save arrow/r/vignettes/developers/script.sh # which can be sourced to install arrow. - rmarkdown::render("arrow/r/vignettes/developing.Rmd") + rmarkdown::render("arrow/r/vignettes/developers/setup.Rmd") shell: Rscript {0} - name: Install from the devdocs env: @@ -68,9 +68,9 @@ jobs: # from Windows paths (C:\R\bin) to Unix paths (/c/r/bin) echo export PATH=\"$(cygpath --path "$PATH")\" >> ~/.bash_profile # This starts a special mingw64 Git Bash shell - $RTOOLS40_HOME/msys2_shell.cmd -here -mingw64 -no-start -defterm -full-path arrow/r/vignettes/script.sh - else - bash arrow/r/vignettes/script.sh + $RTOOLS40_HOME/msys2_shell.cmd -here -mingw64 -no-start -defterm -full-path arrow/r/vignettes/developers/script.sh + else + bash arrow/r/vignettes/developers/script.sh fi shell: bash - name: Ensure that the Arrow package is loadable and we have the correct one @@ -88,5 +88,5 @@ jobs: uses: actions/upload-artifact@v2 with: name: {{ "devdocs-script_os-${{ matrix.os }}_sysinstall-${{ matrix.system-install }}" }} - path: arrow/r/vignettes/script.sh + path: arrow/r/vignettes/developers/script.sh if: always() diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml index 988717ac14f5e..a76345e1abff8 100644 --- a/r/_pkgdown.yml +++ b/r/_pkgdown.yml @@ -74,6 +74,10 @@ navbar: href: articles/developing.html - text: Developers menu: + - text: Developer Environment Setup + href: articles/developers/setup.html + - text: Common Workflow Tasks + href: articles/developers/workflow.html - text: Debugging href: articles/developers/debugging.html reference: diff --git a/r/vignettes/developers/setup.Rmd b/r/vignettes/developers/setup.Rmd new file mode 100644 index 0000000000000..8fdcb0ac339c3 --- /dev/null +++ b/r/vignettes/developers/setup.Rmd @@ -0,0 +1,471 @@ +# Developer environment setup + +```{r setup-options, include=FALSE} +knitr::opts_chunk$set(error = TRUE, eval = FALSE) +# Get environment variables describing what to evaluate +run <- tolower(Sys.getenv("RUN_DEVDOCS", "false")) == "true" +macos <- tolower(Sys.getenv("DEVDOCS_MACOS", "false")) == "true" +ubuntu <- tolower(Sys.getenv("DEVDOCS_UBUNTU", "false")) == "true" +windows <- tolower(Sys.getenv("DEVDOCS_WINDOWS", "false")) == "true" +sys_install <- tolower(Sys.getenv("DEVDOCS_SYSTEM_INSTALL", "false")) == "true" +# Update the source knit_hook to save the chunk (if it is marked to be saved) +knit_hooks_source <- knitr::knit_hooks$get("source") +knitr::knit_hooks$set(source = function(x, options) { + # Extra paranoia about when this will write the chunks to the script, we will + # only save when: + # * CI is true + # * RUN_DEVDOCS is true + # * options$save is TRUE (and a check that not NULL won't crash it) + if (as.logical(Sys.getenv("CI", FALSE)) && run && !is.null(options$save) && options$save) + cat(x, file = "script.sh", append = TRUE, sep = "\n") + # but hide the blocks we want hidden: + if (!is.null(options$hide) && options$hide) { + return(NULL) + } + knit_hooks_source(x, options) +}) +``` + +```{bash, save=run, hide=TRUE} +# Stop on failure, echo input as we go +set -e +set -x +``` + + +```{bash, save=run & windows, hide=TRUE} +# For some reason CRAN Mirror goes missing in CI +echo 'options(repos=structure(c(CRAN="https://cloud.r-project.org")))' > $HOME/.Rprofile +``` + +Windows and macOS users who wish to contribute to the R package and +don't need to alter libarrow (Arrow's C++ library) may be able to obtain a +recent version of the library without building from source. + +### Linux + +On Linux, you can download a .zip file containing libarrow from the +nightly repository. + +To see what nightlies are available, you can use arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: + +``` +nightly <- s3_bucket("arrow-r-nightly") +nightly$ls("libarrow/bin") +``` +Version numbers in that repository correspond to dates. + +You'll need to create a `libarrow` directory inside the R package directory and unzip the zip file containing the compiled libarrow binary files into it. + +### macOS +On macOS, you can install libarrow using [Homebrew](https://brew.sh/): + +```bash +# For the released version: +brew install apache-arrow +# Or for a development version, you can try: +brew install apache-arrow --HEAD +``` + +### Windows + +On Windows, you can download a .zip file containing libarrow from the nightly repository. + +To see what nightlies are available, you can use arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: + +``` +nightly <- s3_bucket("arrow-r-nightly") +nightly$ls("libarrow/bin") +``` +Version numbers in that repository correspond to dates. + +You can set the `RWINLIB_LOCAL` environment variable to point to the zip file containing libarrow before installing the arrow R package. + + +## R and C++ + +If you need to alter both libarrow and the R package code, or if you can't get a binary version of the latest libarrow elsewhere, you'll need to build it from source. This section discusses how to set up a C++ libarrow build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). + +There are five major steps to the process. + +### Step 1 - Install dependencies {.tabset} + +When building libarrow, by default, system dependencies will be used if suitable versions are found. If system dependencies are not present, libarrow will build them during its own build process. The only dependencies that you need to install _outside_ of the build process are [cmake](https://cmake.org/) (for configuring the build) and [openssl](https://www.openssl.org/) if you are building with S3 support. + +For a faster build, you may choose to pre-install more C++ library dependencies (such as [lz4](http://lz4.github.io/lz4/), [zstd](https://facebook.github.io/zstd/), etc.) on the system so that they don't need to be built from source in the libarrow build. + +#### Ubuntu +```{bash, save=run & ubuntu} +sudo apt install -y cmake libcurl4-openssl-dev libssl-dev +``` + +#### macOS +```{bash, save=run & macos} +brew install cmake openssl +``` + +#### Windows + +The package can be built on Windows using [RTools 4](https://cran.r-project.org/bin/windows/Rtools/). It can be built for mingw32 (i386), mingw64 (x64), or ucrt64 (UCRT x64). mingw64 is the recommended 64-bit installation. + +Open the corresponding RTools Bash, for example "Rtools MinGW 64-bit" for mingw64. + +Install CMake and Ninja with: + +```{bash, save=run & windows} +pacman --sync --refresh --noconfirm \ + ${MINGW_PACKAGE_PREFIX}-cmake \ + ${MINGW_PACKAGE_PREFIX}-ninja \ + ${MINGW_PACKAGE_PREFIX}-openssl +export CMAKE_GENERATOR=Ninja +``` + +You will need to add R to your path. For a user-level installation, R will be at something like `~/Documents/R/R-4.1.2/bin`. For a global installation, R will be at something like `/c/Program\ Files/R/R-4.1.2/bin`. The R on your path needs to match the architecture you are compiling for, so if you are compiling on 32-bit specify `.../bin/i386` instead of `.../bin/x64`. + +```{bash} +export PATH=~/Documents/R/R-4.1.2/bin/x64:$PATH +``` + +You can install additional dependencies like so. Note that you are limited to the packages in [the RTools repo](https://github.com/r-windows/rtools-packages), which does not contain every dependency used by Arrow. + +```{bash} +pacman --sync --refresh --noconfirm \ + ${MINGW_PACKAGE_PREFIX}-boost \ + ${MINGW_PACKAGE_PREFIX}-lz4 \ + ${MINGW_PACKAGE_PREFIX}-protobuf \ + ${MINGW_PACKAGE_PREFIX}-snappy \ + ${MINGW_PACKAGE_PREFIX}-thrift \ + ${MINGW_PACKAGE_PREFIX}-zlib \ + ${MINGW_PACKAGE_PREFIX}-zstd \ + ${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp \ + ${MINGW_PACKAGE_PREFIX}-re2 \ + ${MINGW_PACKAGE_PREFIX}-libutf8proc +``` + +### Step 2 - Configure the libarrow build + +We recommend that you configure libarrow to be built to a user-level directory rather than a system directory for your development work. This is so that the development version you are using doesn't overwrite a released version of libarrow you may already have installed, and so that you are also able work with more than one version of libarrow (by using different `ARROW_HOME` directories for the different versions). + +In the example below, libarrow is installed to a directory called `dist` that has the same parent directory as the `arrow` checkout. Your installation of the Arrow R package can point to any directory with any name, though we recommend *not* placing it inside of the `arrow` git checkout directory as unwanted changes could stop it working properly. + +```{bash, save=run & !sys_install} +export ARROW_HOME=$(pwd)/dist +mkdir $ARROW_HOME +``` + +_Special instructions on Linux:_ You will need to set `LD_LIBRARY_PATH` to the `lib` directory that is under where you set `$ARROW_HOME`, before launching R and using arrow. One way to do this is to add it to your profile (we use `~/.bash_profile` here, but you might need to put this in a different file depending on your setup, e.g. if you use a shell other than `bash`). On macOS you do not need to do this because the macOS shared library paths are hardcoded to their locations during build time. + +```{bash, save=run & ubuntu & !sys_install} +export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH +echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> ~/.bash_profile +``` + +_Special instructions on Windows:_ You will need to add `$ARROW_HOME/bin` to your `PATH` if you are using dynamic libraries (which is recommended). + +```{bash, save=run & windows} +export PATH=$ARROW_HOME/bin:$PATH +echo "export PATH=\"$ARROW_HOME/bin:$PATH\"" >> ~/.bash_profile +``` + +Start by navigating in a terminal to the `arrow` repository. You will need to create a directory into which the C++ build will put its contents. We recommend that you make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). Next, change directories to be inside `cpp/build`: + +```{bash, save=run & !sys_install} +pushd arrow +mkdir -p cpp/build +pushd cpp/build +``` + +You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in libarrow using `-D` flags: + +#### {.tabset} + +##### Linux / Mac OS + +```{bash, save=run & !sys_install & !windows} +cmake \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + -DCMAKE_INSTALL_LIBDIR=lib \ + -DARROW_COMPUTE=ON \ + -DARROW_CSV=ON \ + -DARROW_DATASET=ON \ + -DARROW_EXTRA_ERROR_CONTEXT=ON \ + -DARROW_FILESYSTEM=ON \ + -DARROW_INSTALL_NAME_RPATH=OFF \ + -DARROW_JEMALLOC=ON \ + -DARROW_JSON=ON \ + -DARROW_PARQUET=ON \ + -DARROW_WITH_SNAPPY=ON \ + -DARROW_WITH_ZLIB=ON \ + .. +``` + +##### Windows + +```{bash, save=run & !sys_install & windows} +cmake \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + -DCMAKE_INSTALL_LIBDIR=lib \ + -DARROW_COMPUTE=ON \ + -DARROW_CSV=ON \ + -DARROW_DATASET=ON \ + -DARROW_EXTRA_ERROR_CONTEXT=ON \ + -DARROW_FILESYSTEM=ON \ + -DARROW_INSTALL_NAME_RPATH=OFF \ + -DARROW_JSON=OFF \ + -DARROW_PARQUET=ON \ + -DARROW_WITH_LZ4=OFF \ + -DARROW_WITH_SNAPPY=OFF \ + -DARROW_WITH_ZLIB=ON \ + -DARROW_MIMALLOC=OFF \ + -DARROW_S3=OFF \ + -DARROW_WITH_BROTLI=ON \ + -DARROW_WITH_BZ2=ON \ + -DARROW_WITH_SNAPPY=ON \ + -DARROW_WITH_ZSTD=ON \ + .. +``` + +#### {-} + +`..` refers to the C++ source directory: you're in `cpp/build` and the source is in `cpp`. + +**For Windows**: some options, including `-DARROW_JEMALLOC`, are not supported on Windows. + +#### Enabling more Arrow features + +To enable optional features including: S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags to your call to `cmake` (the trailing `\` makes them easier to paste into a bash shell on a new line): + +```bash + -DARROW_MIMALLOC=ON \ + -DARROW_S3=ON \ + -DARROW_WITH_BROTLI=ON \ + -DARROW_WITH_BZ2=ON \ + -DARROW_WITH_LZ4=ON \ + -DARROW_WITH_SNAPPY=ON \ + -DARROW_WITH_ZSTD=ON \ +``` + +Other flags that may be useful: + +* `-DBoost_SOURCE=BUNDLED` and `-DThrift_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source. + +* `-DCMAKE_BUILD_TYPE=debug` or `-DCMAKE_BUILD_TYPE=relwithdebinfo` can be useful for debugging. You probably don't want to do this generally because a debug build is much slower at runtime than the default `release` build. + +* `-DARROW_BUILD_STATIC=ON` and `-DARROW_BUILD_SHARED=OFF` if you want to use static libraries instead of dynamic libraries. With static libraries there isn't a risk of the R package linking to the wrong library, but it does mean if you change the C++ code you have to recompile both the C++ libraries and the R package. Compilers typically will link to static libraries only if the dynamic ones are not present, which is why we need to set `-DARROW_BUILD_SHARED=OFF`. If you are switching after compiling and installing previously, you may need to remove the `.dll` or `.so` files from `$ARROW_HOME/dist/bin`. + +_Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace. + +### Step 3 - Building libarrow + +You can add `-j#` at the end of the command here too to speed up compilation by running in parallel (where `#` is the number of cores you have available). + +```{bash, save=run & !(sys_install & ubuntu)} +cmake --build . --target install -j8 +``` + +### Step 4 - Build the Arrow R package + +Once you've built libarrow, you can install the R package and its +dependencies, along with additional dev dependencies, from the git +checkout: + +```{bash, save=run} +popd # To go back to the root directory of the project, from cpp/build +pushd r +R -e "install.packages('remotes'); remotes::install_deps(dependencies = TRUE)" +R CMD INSTALL --no-multiarch . +``` + +The `--no-multiarch` flag makes it only compile on the "main" architecture. This will compile for the architecture that the R in your path corresponds to. If you compile on one architecture and then switch to another, make sure to pass the `--preclean` flag so that the R package code is recompiled for the new architecture. Otherwise, you may see errors like `LoadLibrary failure: %1 is not a valid Win32 application`. + +#### Compilation flags + +If you need to set any compilation flags while building the C++ +extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For +example, if you are using `perf` to profile the R extensions, you may +need to set + +```bash +export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer +``` + +#### Recompiling the C++ code + +With the setup described here, you should not need to rebuild the Arrow library or even the C++ source in the R package as you iterate and work on the R package. The only time those should need to be rebuilt is if you have changed the C++ in the R package (and even then, `R CMD INSTALL .` should only need to recompile the files that have changed) _or_ if the libarrow C++ has changed and there is a mismatch between libarrow and the R package. If you find yourself rebuilding either or both each time you install the package or run tests, something is probably wrong with your set up. + +
+For a full build: a `cmake` command with all of the R-relevant optional dependencies turned on. Development with other languages might require different flags as well. For example, to develop Python, you would need to also add `-DARROW_PYTHON=ON` (though all of the other flags used for Python are already included here). +

+ +```bash +cmake \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + -DCMAKE_INSTALL_LIBDIR=lib \ + -DARROW_COMPUTE=ON \ + -DARROW_CSV=ON \ + -DARROW_DATASET=ON \ + -DARROW_EXTRA_ERROR_CONTEXT=ON \ + -DARROW_FILESYSTEM=ON \ + -DARROW_INSTALL_NAME_RPATH=OFF \ + -DARROW_JEMALLOC=ON \ + -DARROW_JSON=ON \ + -DARROW_MIMALLOC=ON \ + -DARROW_PARQUET=ON \ + -DARROW_S3=ON \ + -DARROW_WITH_BROTLI=ON \ + -DARROW_WITH_BZ2=ON \ + -DARROW_WITH_LZ4=ON \ + -DARROW_WITH_SNAPPY=ON \ + -DARROW_WITH_ZLIB=ON \ + -DARROW_WITH_ZSTD=ON \ + .. +``` +

+
+ +## Installing a version of the R package with a specific git reference + +If you need an arrow installation from a specific repository or git reference, on most platforms except Windows, you can run: + +```{r} +remotes::install_github("apache/arrow/r", build = FALSE) +``` + +The `build = FALSE` argument is important so that the installation can access the +C++ source in the `cpp/` directory in `apache/arrow`. + +As with other installation methods, setting the environment variables `LIBARROW_MINIMAL=false` and `ARROW_R_DEV=true` will provide a more full-featured version of Arrow and provide more verbose output, respectively. + +For example, to install from the (fictional) branch `bugfix` from `apache/arrow` you could run: + +```r +Sys.setenv(LIBARROW_MINIMAL="false") +remotes::install_github("apache/arrow/r@bugfix", build = FALSE) +``` + +Developers may wish to use this method of installing a specific commit +separate from another Arrow development environment or system installation +(e.g. we use this in [arrowbench](https://github.com/ursacomputing/arrowbench) +to install development versions of libarrow isolated from the system install). If +you already have libarrow installed system-wide, you may need to set +some additional variables in order to isolate this build from your system libraries: + +* Setting the environment variable `FORCE_BUNDLED_BUILD` to `true` will skip the `pkg-config` search for libarrow and attempt to build from the same source at the repository+ref given. + +* You may also need to set the Makevars `CPPFLAGS` and `LDFLAGS` to `""` in order to prevent the installation process from attempting to link to already installed system versions of libarrow. One way to do this temporarily is wrapping your `remotes::install_github()` call like so: +```{r} +withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...)) +``` + +# Summary of environment variables + +* See the user-facing [Install vignette](install.html) for a large number of + environment variables that determine how the build works and what features + get built. +* `TEST_OFFLINE_BUILD`: When set to `true`, the build script will not download + prebuilt the C++ library binary. + It will turn off any features that require a download, unless they're available + in the `tools/cpp/thirdparty/download/` subfolder of the tar.gz file. + `create_package_with_all_dependencies()` creates that subfolder. + Regardless of this flag's value, `cmake` will be downloaded if it's unavailable. +* `TEST_R_WITHOUT_LIBARROW`: When set to `true`, skip tests that would require + the C++ Arrow library (that is, almost everything). + +# Troubleshooting + +Note that after any change to libarrow, you must reinstall it and +run `make clean` or `git clean -fdx .` to remove any cached object code +in the `r/src/` directory before reinstalling the R package. This is +only necessary if you make changes to libarrow source; you do not +need to manually purge object files if you are only editing R or C++ +code inside `r/`. + +## Arrow library - R package mismatches + +If libarrow and the R package have diverged, you will see errors like: + +``` +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): + unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': + dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx + Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so + Expected in: flat namespace + in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so +Error: loading failed +Execution halted +ERROR: loading failed +``` + +To resolve this, try [rebuilding the Arrow library](#step-3-building-arrow). + +## Multiple versions of libarrow + +If you are installing from a user-level directory, and you already have a +previous installation of libarrow in a system directory, you get you may get +errors like the following when you install the R package: + +``` +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): + unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': + dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib + Referenced from: /usr/local/lib/libparquet.400.dylib + Reason: image not found +``` + +If this happens, you need to make sure that you don't let R link to your system +library when building arrow. You can do this a number of different ways: + +* Setting the `MAKEFLAGS` environment variable to `"LDFLAGS="` (see below for an example) this is the recommended way to accomplish this +* Using {withr}'s `with_makevars(list(LDFLAGS = ""), ...)` +* adding `LDFLAGS=` to your `~/.R/Makevars` file (the least recommended way, though it is a common debugging approach suggested online) + +```{bash, save=run & !sys_install & macos, hide=TRUE} +# Setup troubleshooting section +# install a system-level arrow on macOS +brew install apache-arrow +``` + + +```{bash, save=run & !sys_install & ubuntu, hide=TRUE} +# Setup troubleshooting section +# install a system-level arrow on Ubuntu +sudo apt update +sudo apt install -y -V ca-certificates lsb-release wget +wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb +sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb +sudo apt update +sudo apt install -y -V libarrow-dev +``` + +```{bash, save=run & !sys_install & macos} +MAKEFLAGS="LDFLAGS=" R CMD INSTALL . +``` + +## `rpath` issues + +If the package fails to install/load with an error like this: + +``` + ** testing if installed package can be loaded from temporary location + Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): + unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': + dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib +``` + +ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on +macOS to prevent problems at link time and is a no-op on other platforms). +Alternatively, try setting the environment variable `R_LD_LIBRARY_PATH` to +wherever Arrow C++ was put in `make install`, e.g. `export +R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. + +When installing from source, if the R and C++ library versions do not +match, installation may fail. If you've previously installed the +libraries and want to upgrade the R package, you'll need to update the +Arrow C++ library first. + +For any other build/configuration challenges, see the [C++ developer +guide](https://arrow.apache.org/docs/developers/cpp/building.html). + +## Other installation issues + +There are a number of scripts that are triggered when the arrow R package is installed. For package users who are not interacting with the underlying code, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host). However, knowing about these scripts can help package developers troubleshoot if things go wrong in them or things go wrong in an install. See [the installation vignette](./install.html#how-dependencies-are-resolved) for more information. diff --git a/r/vignettes/developers/workflow.Rmd b/r/vignettes/developers/workflow.Rmd new file mode 100644 index 0000000000000..e316dd8fe56be --- /dev/null +++ b/r/vignettes/developers/workflow.Rmd @@ -0,0 +1,198 @@ +# Common developer workflow tasks + +```{r setup-options, include=FALSE} +knitr::opts_chunk$set(error = TRUE, eval = FALSE) +``` + +The `arrow/r` directory contains a `Makefile` to help with some common tasks from the command line (e.g. `make test`, `make doc`, `make clean`, etc.). + +## Loading arrow + +You can load the R package via `devtools::load_all()`. + +## Rebuilding the documentation + +The R documentation uses the [`@examplesIf`](https://roxygen2.r-lib.org/articles/rd.html#functions) tag introduced in `roxygen2` version 7.1.2. + +```{r, run=FALSE} +remotes::install_github("r-lib/roxygen2") +``` + +You can use `devtools::document()` and `pkgdown::build_site()` to rebuild the documentation and preview the results. + +```r +# Update roxygen documentation +devtools::document() + +# To preview the documentation website +pkgdown::build_site(preview=TRUE) +``` + +## Styling and linting + +### R code + +The R code in the package follows [the tidyverse style](https://style.tidyverse.org/). On PR submission (and on pushes) our CI will run linting and will flag possible errors on the pull request with annotations. + +To run the [lintr](https://github.com/jimhester/lintr) locally, install the lintr package (note, we currently use a fork that includes fixes not yet accepted upstream, see how lintr is being installed in the file `ci/docker/linux-apt-lint.dockerfile` for the current status) and then run + +```{r} +lintr::lint_package("arrow/r") +``` + +You can automatically change the formatting of the code in the package using the [styler](https://styler.r-lib.org/) package. There are two ways to do this: + +1. Use the comment bot to do this automatically with the command `@github-actions autotune` on a PR, and commit it back to the branch. + +2. Run the styler locally either via Makefile commands: + +```bash +make style # (for only the files changed) +make style-all # (for all files) +``` + +or in R: + +```{r} +# note the two excluded files which should not be styled +styler::style_pkg(exclude_files = c("tests/testthat/latin1.R", "data-raw/codegen.R")) +``` + +The styler package will fix many styling errors, thought not all lintr errors are automatically fixable with styler. The list of files we intentionally do not style is in `r/.styler_excludes.R`. + +### C++ code + +The arrow package uses some customized tools on top of [cpp11](https://cpp11.r-lib.org/) to prepare its +C++ code in `src/`. This is because there are some features that are only enabled +and built conditionally during build time. If you change C++ code in the R +package, you will need to set the `ARROW_R_DEV` environment variable to `true` +(optionally, add it to your `~/.Renviron` file to persist across sessions) so +that the `data-raw/codegen.R` file is used for code generation. The `Makefile` +commands also handles this automatically. + +We use Google C++ style in our C++ code. The easiest way to accomplish this is +use an editors/IDE that formats your code for you. Many popular editors/IDEs +have support for running `clang-format` on C++ files when you save them. +Installing/enabling the appropriate plugin may save you much frustration. + +Check for style errors with + +```bash +./lint.sh +``` + +Fix any style issues before committing with + +```bash +./lint.sh --fix +``` + +The lint script requires Python 3 and `clang-format-8`. If the command +isn't found, you can explicitly provide the path to it like: + +```bash +CLANG_FORMAT=$(which clang-format-8) ./lint.sh +``` + +On macOS, you can get this by installing LLVM via Homebrew and running the script as: +```bash +CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh +``` + +_Note_ that the lint script requires Python 3 and the Python dependencies +(note that `cmake_format is pinned to a specific version): + +* autopep8 +* flake8 +* cmake_format==0.5.2 + +## Running tests + +Tests can be run either using `devtools::test()` or the Makefile alternative. + +```r +# Run the test suite, optionally filtering file names +devtools::test(filter="^regexp$") + +# or the Makefile alternative from the arrow/r directory in a shell: +make test file=regexp +``` + +Some tests are conditionally enabled based on the availability of certain +features in the package build (S3 support, compression libraries, etc.). +Others are generally skipped by default but can be enabled with environment +variables or other settings: + +* All tests are skipped on Linux if the package builds without the C++ libarrow. + To make the build fail if libarrow is not available (as in, to test that + the C++ build was successful), set `TEST_R_WITH_ARROW=true` + +* Some tests are disabled unless `ARROW_R_DEV=true` + +* Tests that require allocating >2GB of memory to test Large types are disabled + unless `ARROW_LARGE_MEMORY_TESTS=true` + +* Integration tests against a real S3 bucket are disabled unless credentials + are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available + on request + +* S3 tests using [MinIO](https://min.io/) locally are enabled if the + `minio server` process is found running. If you're running MinIO with custom + settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and + `MINIO_PORT` to override the defaults. + +## Running checks + +You can run package checks by using `devtools::check()` and check test coverage +with `covr::package_coverage()`. + +```r +# All package checks +devtools::check() + +# See test coverage statistics +covr::report() +covr::package_coverage() +``` + +For full package validation, you can run the following commands from a terminal. + +``` +R CMD build . +R CMD check arrow_*.tar.gz --as-cran +``` + + +## Running additional CI checks + +On a pull request, there are some actions you can trigger by commenting on the +PR. We have additional CI checks that run nightly and can be requested on demand +using an internal tool called +[crossbow](https://arrow.apache.org/docs/developers/crossbow.html). +A few important GitHub comment commands are shown below. + +#### Run all extended R CI tasks +``` +@github-actions crossbow submit -g r +``` + +This runs each of the R-related CI tasks. + +#### Run a specific task +``` +@github-actions crossbow submit {task-name} +``` + +See the `r:` group definition near the beginning of the [crossbow configuration](https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml) +for a list of glob expression patterns that match names of items in the `tasks:` +list below it. + +#### Run linting and documentation building tasks + +``` +@github-actions autotune +``` + +This will run and fix lint C++ linting errors, run R documentation (among other +cleanup tasks), run styler on any changed R code, and commit the resulting +updates to the branch. diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index f8bff4f5dcd40..3869a36e65338 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -7,684 +7,31 @@ vignette: > %\VignetteEncoding{UTF-8} --- -```{r setup-options, include=FALSE} -knitr::opts_chunk$set(error = TRUE, eval = FALSE) -# Get environment variables describing what to evaluate -run <- tolower(Sys.getenv("RUN_DEVDOCS", "false")) == "true" -macos <- tolower(Sys.getenv("DEVDOCS_MACOS", "false")) == "true" -ubuntu <- tolower(Sys.getenv("DEVDOCS_UBUNTU", "false")) == "true" -windows <- tolower(Sys.getenv("DEVDOCS_WINDOWS", "false")) == "true" -sys_install <- tolower(Sys.getenv("DEVDOCS_SYSTEM_INSTALL", "false")) == "true" -# Update the source knit_hook to save the chunk (if it is marked to be saved) -knit_hooks_source <- knitr::knit_hooks$get("source") -knitr::knit_hooks$set(source = function(x, options) { - # Extra paranoia about when this will write the chunks to the script, we will - # only save when: - # * CI is true - # * RUN_DEVDOCS is true - # * options$save is TRUE (and a check that not NULL won't crash it) - if (as.logical(Sys.getenv("CI", FALSE)) && run && !is.null(options$save) && options$save) - cat(x, file = "script.sh", append = TRUE, sep = "\n") - # but hide the blocks we want hidden: - if (!is.null(options$hide) && options$hide) { - return(NULL) - } - knit_hooks_source(x, options) -}) -``` +If you're interested in contributing to arrow, this vignette explains our approach, +at a high-level. If you're looking for more detailed content, you may want to +look at one of the following links: -```{bash, save=run, hide=TRUE} -# Stop on failure, echo input as we go -set -e -set -x -``` +* [setting up a development environment and building the components that make up the Arrow project and R package](https://arrow.apache.org/docs/r/articles/developers/setup.html) +* [common Arrow dev workflow tasks](https://arrow.apache.org/docs/r/articles/developers/workflow.html) +* [running R with the C++ debugger attached](https://arrow.apache.org/docs/r/articles/developers/debugging.html) +# Approach to implementing functionality -```{bash, save=run & windows, hide=TRUE} -# For some reason CRAN Mirror goes missing in CI -echo 'options(repos=structure(c(CRAN="https://cloud.r-project.org")))' > $HOME/.Rprofile -``` +Our general philosophy when implementing functionality is to match to existing +R function signatures which may be familiar to users, whilst exposing any +additional functionality available via Arrow. The intention is to allow users +to be able to use their existing code with minimal changes, or new code or +approaches to learn. -If you're looking to contribute to arrow, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: +There are a number of ways in which we do this: -* how to build the components that make up the Arrow project and R package -* workflows that developers use -* some common troubleshooting steps and solutions +* when implementing a function with an R equivalent, support the arguments +available in R version as much as possible - use the original parameter names +and translate to the arrow parameter name inside the function -This document is intended only for **developers** of Apache Arrow or the Arrow R package. R package users do not need to do any of this setup. If you're looking for how to install Arrow, see [the instructions in the readme](https://arrow.apache.org/docs/r/#installation). +* if there are arrow parameters which do not exist in the R function, allow the +user to pass in those options through too -This document is a work in progress and will grow and change as the Apache Arrow project grows and changes. We have tried to make these steps as robust as possible (in fact, we even test exactly these instructions on our nightly CI to ensure they don't become stale!), but custom configurations might conflict with these instructions and there are differences of opinion across developers about how to set up development environments like this. - -We welcome any feedback you have about things that are confusing or additions you would like to see here - please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) if you have any suggestions or requests. - -# Developer environment setup - -## R-only {.tabset} - -Windows and macOS users who wish to contribute to the R package and -don't need to alter libarrow (Arrow's C++ library) may be able to obtain a -recent version of the library without building from source. - -### Linux - -On Linux, you can download a .zip file containing libarrow from the -nightly repository. - -To see what nightlies are available, you can use arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: - -``` -nightly <- s3_bucket("arrow-r-nightly") -nightly$ls("libarrow/bin") -``` -Version numbers in that repository correspond to dates. - -You'll need to create a `libarrow` directory inside the R package directory and unzip the zip file containing the compiled libarrow binary files into it. - -### macOS -On macOS, you can install libarrow using [Homebrew](https://brew.sh/): - -```bash -# For the released version: -brew install apache-arrow -# Or for a development version, you can try: -brew install apache-arrow --HEAD -``` - -### Windows - -On Windows, you can download a .zip file containing libarrow from the nightly repository. - -To see what nightlies are available, you can use arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: - -``` -nightly <- s3_bucket("arrow-r-nightly") -nightly$ls("libarrow/bin") -``` -Version numbers in that repository correspond to dates. - -You can set the `RWINLIB_LOCAL` environment variable to point to the zip file containing libarrow before installing the arrow R package. - - -## R and C++ - -If you need to alter both libarrow and the R package code, or if you can't get a binary version of the latest libarrow elsewhere, you'll need to build it from source. This section discusses how to set up a C++ libarrow build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). - -There are five major steps to the process. - -### Step 1 - Install dependencies {.tabset} - -When building libarrow, by default, system dependencies will be used if suitable versions are found. If system dependencies are not present, libarrow will build them during its own build process. The only dependencies that you need to install _outside_ of the build process are [cmake](https://cmake.org/) (for configuring the build) and [openssl](https://www.openssl.org/) if you are building with S3 support. - -For a faster build, you may choose to pre-install more C++ library dependencies (such as [lz4](http://lz4.github.io/lz4/), [zstd](https://facebook.github.io/zstd/), etc.) on the system so that they don't need to be built from source in the libarrow build. - -#### Ubuntu -```{bash, save=run & ubuntu} -sudo apt install -y cmake libcurl4-openssl-dev libssl-dev -``` - -#### macOS -```{bash, save=run & macos} -brew install cmake openssl -``` - -#### Windows - -The package can be built on Windows using [RTools 4](https://cran.r-project.org/bin/windows/Rtools/). It can be built for mingw32 (i386), mingw64 (x64), or ucrt64 (UCRT x64). mingw64 is the recommended 64-bit installation. - -Open the corresponding RTools Bash, for example "Rtools MinGW 64-bit" for mingw64. - -Install CMake and Ninja with: - -```{bash, save=run & windows} -pacman --sync --refresh --noconfirm \ - ${MINGW_PACKAGE_PREFIX}-cmake \ - ${MINGW_PACKAGE_PREFIX}-ninja \ - ${MINGW_PACKAGE_PREFIX}-openssl -export CMAKE_GENERATOR=Ninja -``` - -You will need to add R to your path. For a user-level installation, R will be at something like `~/Documents/R/R-4.1.2/bin`. For a global installation, R will be at something like `/c/Program\ Files/R/R-4.1.2/bin`. The R on your path needs to match the architecture you are compiling for, so if you are compiling on 32-bit specify `.../bin/i386` instead of `.../bin/x64`. - -```{bash} -export PATH=~/Documents/R/R-4.1.2/bin/x64:$PATH -``` - -You can install additional dependencies like so. Note that you are limited to the packages in [the RTools repo](https://github.com/r-windows/rtools-packages), which does not contain every dependency used by Arrow. - -```{bash} -pacman --sync --refresh --noconfirm \ - ${MINGW_PACKAGE_PREFIX}-boost \ - ${MINGW_PACKAGE_PREFIX}-lz4 \ - ${MINGW_PACKAGE_PREFIX}-protobuf \ - ${MINGW_PACKAGE_PREFIX}-snappy \ - ${MINGW_PACKAGE_PREFIX}-thrift \ - ${MINGW_PACKAGE_PREFIX}-zlib \ - ${MINGW_PACKAGE_PREFIX}-zstd \ - ${MINGW_PACKAGE_PREFIX}-aws-sdk-cpp \ - ${MINGW_PACKAGE_PREFIX}-re2 \ - ${MINGW_PACKAGE_PREFIX}-libutf8proc -``` - -### Step 2 - Configure the libarrow build - -We recommend that you configure libarrow to be built to a user-level directory rather than a system directory for your development work. This is so that the development version you are using doesn't overwrite a released version of libarrow you may already have installed, and so that you are also able work with more than one version of libarrow (by using different `ARROW_HOME` directories for the different versions). - -In the example below, libarrow is installed to a directory called `dist` that has the same parent directory as the `arrow` checkout. Your installation of the Arrow R package can point to any directory with any name, though we recommend *not* placing it inside of the `arrow` git checkout directory as unwanted changes could stop it working properly. - -```{bash, save=run & !sys_install} -export ARROW_HOME=$(pwd)/dist -mkdir $ARROW_HOME -``` - -_Special instructions on Linux:_ You will need to set `LD_LIBRARY_PATH` to the `lib` directory that is under where you set `$ARROW_HOME`, before launching R and using arrow. One way to do this is to add it to your profile (we use `~/.bash_profile` here, but you might need to put this in a different file depending on your setup, e.g. if you use a shell other than `bash`). On macOS you do not need to do this because the macOS shared library paths are hardcoded to their locations during build time. - -```{bash, save=run & ubuntu & !sys_install} -export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH -echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> ~/.bash_profile -``` - -_Special instructions on Windows:_ You will need to add `$ARROW_HOME/bin` to your `PATH` if you are using dynamic libraries (which is recommended). - -```{bash, save=run & windows} -export PATH=$ARROW_HOME/bin:$PATH -echo "export PATH=\"$ARROW_HOME/bin:$PATH\"" >> ~/.bash_profile -``` - -Start by navigating in a terminal to the `arrow` repository. You will need to create a directory into which the C++ build will put its contents. We recommend that you make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). Next, change directories to be inside `cpp/build`: - -```{bash, save=run & !sys_install} -pushd arrow -mkdir -p cpp/build -pushd cpp/build -``` - -You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in libarrow using `-D` flags: - -#### {.tabset} - -##### Linux / Mac OS - -```{bash, save=run & !sys_install & !windows} -cmake \ - -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ - -DCMAKE_INSTALL_LIBDIR=lib \ - -DARROW_COMPUTE=ON \ - -DARROW_CSV=ON \ - -DARROW_DATASET=ON \ - -DARROW_EXTRA_ERROR_CONTEXT=ON \ - -DARROW_FILESYSTEM=ON \ - -DARROW_INSTALL_NAME_RPATH=OFF \ - -DARROW_JEMALLOC=ON \ - -DARROW_JSON=ON \ - -DARROW_PARQUET=ON \ - -DARROW_WITH_SNAPPY=ON \ - -DARROW_WITH_ZLIB=ON \ - .. -``` - -##### Windows - -```{bash, save=run & !sys_install & windows} -cmake \ - -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ - -DCMAKE_INSTALL_LIBDIR=lib \ - -DARROW_COMPUTE=ON \ - -DARROW_CSV=ON \ - -DARROW_DATASET=ON \ - -DARROW_EXTRA_ERROR_CONTEXT=ON \ - -DARROW_FILESYSTEM=ON \ - -DARROW_INSTALL_NAME_RPATH=OFF \ - -DARROW_JSON=OFF \ - -DARROW_PARQUET=ON \ - -DARROW_WITH_LZ4=OFF \ - -DARROW_WITH_SNAPPY=OFF \ - -DARROW_WITH_ZLIB=ON \ - -DARROW_MIMALLOC=OFF \ - -DARROW_S3=OFF \ - -DARROW_WITH_BROTLI=ON \ - -DARROW_WITH_BZ2=ON \ - -DARROW_WITH_SNAPPY=ON \ - -DARROW_WITH_ZSTD=ON \ - .. -``` - -#### {-} - -`..` refers to the C++ source directory: you're in `cpp/build` and the source is in `cpp`. - -**For Windows**: some options, including `-DARROW_JEMALLOC`, are not supported on Windows. - -#### Enabling more Arrow features - -To enable optional features including: S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags to your call to `cmake` (the trailing `\` makes them easier to paste into a bash shell on a new line): - -```bash - -DARROW_MIMALLOC=ON \ - -DARROW_S3=ON \ - -DARROW_WITH_BROTLI=ON \ - -DARROW_WITH_BZ2=ON \ - -DARROW_WITH_LZ4=ON \ - -DARROW_WITH_SNAPPY=ON \ - -DARROW_WITH_ZSTD=ON \ -``` - -Other flags that may be useful: - -* `-DBoost_SOURCE=BUNDLED` and `-DThrift_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source. - -* `-DCMAKE_BUILD_TYPE=debug` or `-DCMAKE_BUILD_TYPE=relwithdebinfo` can be useful for debugging. You probably don't want to do this generally because a debug build is much slower at runtime than the default `release` build. - -* `-DARROW_BUILD_STATIC=ON` and `-DARROW_BUILD_SHARED=OFF` if you want to use static libraries instead of dynamic libraries. With static libraries there isn't a risk of the R package linking to the wrong library, but it does mean if you change the C++ code you have to recompile both the C++ libraries and the R package. Compilers typically will link to static libraries only if the dynamic ones are not present, which is why we need to set `-DARROW_BUILD_SHARED=OFF`. If you are switching after compiling and installing previously, you may need to remove the `.dll` or `.so` files from `$ARROW_HOME/dist/bin`. - -_Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace. - -### Step 3 - Building libarrow - -You can add `-j#` at the end of the command here too to speed up compilation by running in parallel (where `#` is the number of cores you have available). - -```{bash, save=run & !(sys_install & ubuntu)} -cmake --build . --target install -j8 -``` - -### Step 4 - Build the Arrow R package - -Once you've built libarrow, you can install the R package and its -dependencies, along with additional dev dependencies, from the git -checkout: - -```{bash, save=run} -popd # To go back to the root directory of the project, from cpp/build -pushd r -R -e "install.packages('remotes'); remotes::install_deps(dependencies = TRUE)" -R CMD INSTALL --no-multiarch . -``` - -The `--no-multiarch` flag makes it only compile on the "main" architecture. This will compile for the architecture that the R in your path corresponds to. If you compile on one architecture and then switch to another, make sure to pass the `--preclean` flag so that the R package code is recompiled for the new architecture. Otherwise, you may see errors like `LoadLibrary failure: %1 is not a valid Win32 application`. - -#### Compilation flags - -If you need to set any compilation flags while building the C++ -extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For -example, if you are using `perf` to profile the R extensions, you may -need to set - -```bash -export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer -``` - -#### Recompiling the C++ code - -With the setup described here, you should not need to rebuild the Arrow library or even the C++ source in the R package as you iterate and work on the R package. The only time those should need to be rebuilt is if you have changed the C++ in the R package (and even then, `R CMD INSTALL .` should only need to recompile the files that have changed) _or_ if the libarrow C++ has changed and there is a mismatch between libarrow and the R package. If you find yourself rebuilding either or both each time you install the package or run tests, something is probably wrong with your set up. - -
-For a full build: a `cmake` command with all of the R-relevant optional dependencies turned on. Development with other languages might require different flags as well. For example, to develop Python, you would need to also add `-DARROW_PYTHON=ON` (though all of the other flags used for Python are already included here). -

- -```bash -cmake \ - -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ - -DCMAKE_INSTALL_LIBDIR=lib \ - -DARROW_COMPUTE=ON \ - -DARROW_CSV=ON \ - -DARROW_DATASET=ON \ - -DARROW_EXTRA_ERROR_CONTEXT=ON \ - -DARROW_FILESYSTEM=ON \ - -DARROW_INSTALL_NAME_RPATH=OFF \ - -DARROW_JEMALLOC=ON \ - -DARROW_JSON=ON \ - -DARROW_MIMALLOC=ON \ - -DARROW_PARQUET=ON \ - -DARROW_S3=ON \ - -DARROW_WITH_BROTLI=ON \ - -DARROW_WITH_BZ2=ON \ - -DARROW_WITH_LZ4=ON \ - -DARROW_WITH_SNAPPY=ON \ - -DARROW_WITH_ZLIB=ON \ - -DARROW_WITH_ZSTD=ON \ - .. -``` -

-
- -## Installing a version of the R package with a specific git reference - -If you need an arrow installation from a specific repository or git reference, on most platforms except Windows, you can run: - -```{r} -remotes::install_github("apache/arrow/r", build = FALSE) -``` - -The `build = FALSE` argument is important so that the installation can access the -C++ source in the `cpp/` directory in `apache/arrow`. - -As with other installation methods, setting the environment variables `LIBARROW_MINIMAL=false` and `ARROW_R_DEV=true` will provide a more full-featured version of Arrow and provide more verbose output, respectively. - -For example, to install from the (fictional) branch `bugfix` from `apache/arrow` you could run: - -```r -Sys.setenv(LIBARROW_MINIMAL="false") -remotes::install_github("apache/arrow/r@bugfix", build = FALSE) -``` - -Developers may wish to use this method of installing a specific commit -separate from another Arrow development environment or system installation -(e.g. we use this in [arrowbench](https://github.com/ursacomputing/arrowbench) -to install development versions of libarrow isolated from the system install). If -you already have libarrow installed system-wide, you may need to set -some additional variables in order to isolate this build from your system libraries: - -* Setting the environment variable `FORCE_BUNDLED_BUILD` to `true` will skip the `pkg-config` search for libarrow and attempt to build from the same source at the repository+ref given. - -* You may also need to set the Makevars `CPPFLAGS` and `LDFLAGS` to `""` in order to prevent the installation process from attempting to link to already installed system versions of libarrow. One way to do this temporarily is wrapping your `remotes::install_github()` call like so: -```{r} -withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...)) -``` - -# Common developer workflow tasks - -The `arrow/r` directory contains a `Makefile` to help with some common tasks from the command line (e.g. `make test`, `make doc`, `make clean`, etc.). - -## Loading arrow - -You can load the R package via `devtools::load_all()`. - -## Rebuilding the documentation - -The R documentation uses the [`@examplesIf`](https://roxygen2.r-lib.org/articles/rd.html#functions) tag introduced in `roxygen2` version 7.1.1.9001, which hasn't yet been released on CRAN at the time of writing. If you are making changes which require updating the documentation, please install the development version of `roxygen2` from GitHub. - -```{r} -remotes::install_github("r-lib/roxygen2") -``` - -You can use `devtools::document()` and `pkgdown::build_site()` to rebuild the documentation and preview the results. - -```r -# Update roxygen documentation -devtools::document() - -# To preview the documentation website -pkgdown::build_site(preview=TRUE) -``` - -## Styling and linting - -### R code - -The R code in the package follows [the tidyverse style](https://style.tidyverse.org/). On PR submission (and on pushes) our CI will run linting and will flag possible errors on the pull request with annotations. - -To run the [lintr](https://github.com/jimhester/lintr) locally, install the lintr package (note, we currently use a fork that includes fixes not yet accepted upstream, see how lintr is being installed in the file `ci/docker/linux-apt-lint.dockerfile` for the current status) and then run - -```{r} -lintr::lint_package("arrow/r") -``` - -You can automatically change the formatting of the code in the package using the [styler](https://styler.r-lib.org/) package. There are two ways to do this: - -1. Use the comment bot to do this automatically with the command `@github-actions autotune` on a PR, and commit it back to the branch. - -2. Run the styler locally either via Makefile commands: - -```bash -make style # (for only the files changed) -make style-all # (for all files) -``` - -or in R: - -```{r} -# note the two excluded files which should not be styled -styler::style_pkg(exclude_files = c("tests/testthat/latin1.R", "data-raw/codegen.R")) -``` - -The styler package will fix many styling errors, thought not all lintr errors are automatically fixable with styler. The list of files we intentionally do not style is in `r/.styler_excludes.R`. - -### C++ code - -The arrow package uses some customized tools on top of [cpp11](https://cpp11.r-lib.org/) to prepare its -C++ code in `src/`. This is because there are some features that are only enabled -and built conditionally during build time. If you change C++ code in the R -package, you will need to set the `ARROW_R_DEV` environment variable to `true` -(optionally, add it to your `~/.Renviron` file to persist across sessions) so -that the `data-raw/codegen.R` file is used for code generation. The `Makefile` -commands also handles this automatically. - -We use Google C++ style in our C++ code. The easiest way to accomplish this is -use an editors/IDE that formats your code for you. Many popular editors/IDEs -have support for running `clang-format` on C++ files when you save them. -Installing/enabling the appropriate plugin may save you much frustration. - -Check for style errors with - -```bash -./lint.sh -``` - -Fix any style issues before committing with - -```bash -./lint.sh --fix -``` - -The lint script requires Python 3 and `clang-format-8`. If the command -isn't found, you can explicitly provide the path to it like: - -```bash -CLANG_FORMAT=$(which clang-format-8) ./lint.sh -``` - -On macOS, you can get this by installing LLVM via Homebrew and running the script as: -```bash -CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh -``` - -_Note_ that the lint script requires Python 3 and the Python dependencies -(note that `cmake_format is pinned to a specific version): - -* autopep8 -* flake8 -* cmake_format==0.5.2 - -## Running tests - -Tests can be run either using `devtools::test()` or the Makefile alternative. - -```r -# Run the test suite, optionally filtering file names -devtools::test(filter="^regexp$") - -# or the Makefile alternative from the arrow/r directory in a shell: -make test file=regexp -``` - -Some tests are conditionally enabled based on the availability of certain -features in the package build (S3 support, compression libraries, etc.). -Others are generally skipped by default but can be enabled with environment -variables or other settings: - -* All tests are skipped on Linux if the package builds without the C++ libarrow. - To make the build fail if libarrow is not available (as in, to test that - the C++ build was successful), set `TEST_R_WITH_ARROW=true` - -* Some tests are disabled unless `ARROW_R_DEV=true` - -* Tests that require allocating >2GB of memory to test Large types are disabled - unless `ARROW_LARGE_MEMORY_TESTS=true` - -* Integration tests against a real S3 bucket are disabled unless credentials - are set in `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`; these are available - on request - -* S3 tests using [MinIO](https://min.io/) locally are enabled if the - `minio server` process is found running. If you're running MinIO with custom - settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and - `MINIO_PORT` to override the defaults. - -## Running checks - -You can run package checks by using `devtools::check()` and check test coverage -with `covr::package_coverage()`. - -```r -# All package checks -devtools::check() - -# See test coverage statistics -covr::report() -covr::package_coverage() -``` - -For full package validation, you can run the following commands from a terminal. - -``` -R CMD build . -R CMD check arrow_*.tar.gz --as-cran -``` - - -## Running additional CI checks - -On a pull request, there are some actions you can trigger by commenting on the -PR. We have additional CI checks that run nightly and can be requested on demand -using an internal tool called -[crossbow](https://arrow.apache.org/docs/developers/crossbow.html). -A few important GitHub comment commands are shown below. - -#### Run all extended R CI tasks -``` -@github-actions crossbow submit -g r -``` - -This runs each of the R-related CI tasks. - -#### Run a specific task -``` -@github-actions crossbow submit {task-name} -``` - -See the `r:` group definition near the beginning of the [crossbow configuration](https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml) -for a list of glob expression patterns that match names of items in the `tasks:` -list below it. - -#### Run linting and documentation building tasks - -``` -@github-actions autotune -``` - -This will run and fix lint C++ linting errors, run R documentation (among other -cleanup tasks), run styler on any changed R code, and commit the resulting -updates to the branch. - -# Summary of environment variables - -* See the user-facing [Install vignette](install.html) for a large number of - environment variables that determine how the build works and what features - get built. -* `TEST_OFFLINE_BUILD`: When set to `true`, the build script will not download - prebuilt the C++ library binary. - It will turn off any features that require a download, unless they're available - in the `tools/cpp/thirdparty/download/` subfolder of the tar.gz file. - `create_package_with_all_dependencies()` creates that subfolder. - Regardless of this flag's value, `cmake` will be downloaded if it's unavailable. -* `TEST_R_WITHOUT_LIBARROW`: When set to `true`, skip tests that would require - the C++ Arrow library (that is, almost everything). - -# Troubleshooting - -Note that after any change to libarrow, you must reinstall it and -run `make clean` or `git clean -fdx .` to remove any cached object code -in the `r/src/` directory before reinstalling the R package. This is -only necessary if you make changes to libarrow source; you do not -need to manually purge object files if you are only editing R or C++ -code inside `r/`. - -## Arrow library - R package mismatches - -If libarrow and the R package have diverged, you will see errors like: - -``` -Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): - unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': - dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx - Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so - Expected in: flat namespace - in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so -Error: loading failed -Execution halted -ERROR: loading failed -``` - -To resolve this, try [rebuilding the Arrow library](#step-3-building-arrow). - -## Multiple versions of libarrow - -If you are installing from a user-level directory, and you already have a -previous installation of libarrow in a system directory, you get you may get -errors like the following when you install the R package: - -``` -Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): - unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': - dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib - Referenced from: /usr/local/lib/libparquet.400.dylib - Reason: image not found -``` - -If this happens, you need to make sure that you don't let R link to your system -library when building arrow. You can do this a number of different ways: - -* Setting the `MAKEFLAGS` environment variable to `"LDFLAGS="` (see below for an example) this is the recommended way to accomplish this -* Using {withr}'s `with_makevars(list(LDFLAGS = ""), ...)` -* adding `LDFLAGS=` to your `~/.R/Makevars` file (the least recommended way, though it is a common debugging approach suggested online) - -```{bash, save=run & !sys_install & macos, hide=TRUE} -# Setup troubleshooting section -# install a system-level arrow on macOS -brew install apache-arrow -``` - - -```{bash, save=run & !sys_install & ubuntu, hide=TRUE} -# Setup troubleshooting section -# install a system-level arrow on Ubuntu -sudo apt update -sudo apt install -y -V ca-certificates lsb-release wget -wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb -sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb -sudo apt update -sudo apt install -y -V libarrow-dev -``` - -```{bash, save=run & !sys_install & macos} -MAKEFLAGS="LDFLAGS=" R CMD INSTALL . -``` - - -## `rpath` issues - -If the package fails to install/load with an error like this: - -``` - ** testing if installed package can be loaded from temporary location - Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): - unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': - dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib -``` - -ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on -macOS to prevent problems at link time and is a no-op on other platforms). -Alternatively, try setting the environment variable `R_LD_LIBRARY_PATH` to -wherever Arrow C++ was put in `make install`, e.g. `export -R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. - -When installing from source, if the R and C++ library versions do not -match, installation may fail. If you've previously installed the -libraries and want to upgrade the R package, you'll need to update the -Arrow C++ library first. - -For any other build/configuration challenges, see the [C++ developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html). - -## Other installation issues - -There are a number of scripts that are triggered when the arrow R package is installed. For package users who are not interacting with the underlying code, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host). However, knowing about these scripts can help package developers troubleshoot if things go wrong in them or things go wrong in an install. See [the installation vignette](./install.html#how-dependencies-are-resolved) for more information. +* where necessary add extra arguments to the function signature for features +that don't exist in R but do in Arrow (e.g. passing in a schema when reading a +CSV dataset)