Skip to content
Henrik Bengtsson edited this page Jan 11, 2016 · 44 revisions

Wishlist for R

People commented:

List of features and modification I would love to see in R:

Basic data types

  • Internal HASNA(x) flag indicating whether x has missing values (HASNA=1) or not (HASNA=0), or it is unknown (HASNA=2). This flag can be set by any function that have scanned x for missing values. This would allow functions to skip expensive testing for missing values whenever HASNA=0. (Now it is up to the user to keep track and use na.rm=FALSE, iff supported)

    • Luke [Tierney] is changing the SEXP header for reference counting. Thanks to the need for alignment, we will get some extra bits. We have already decided to use one of those for this purpose. Another bit will track whether a vector is sorted. /ML
      • That's good news. Will there be two bits for sorted to specifying increasing versus decreasing ordering? /HB
        • This would be an extremely cheap check (binary search for an element different than the first in the worst case, assuming NAs-at-end or NAs-at-beginning). Not sure it's worth a valuable header bit. /GB
      • What about character vector; will they ever be flagged as sorted? For instance, how will you know in what locale such a vector was sorted, e.g. you first sorted/collated it lexicographically using the C locale but then work in the en_US.UTF-8 locale. /HB
  • Generic support for dimension-aware attributes that are acknowledged whenever the object is subsetted. For vectors we have names(), for matrices and data frames we have rownames() and colnames(), and for arrays and other objects we have dimnames().

    • DISCUSSION: Issue #2
    • This is essentially Biobase::AnnotatedDataFrame and S4Vectors::DataFrame. One interesting direction would be to consider the meta columns as grouping factors and use them to implement pivot-table functionality. /ML
    • Prototype:
> x <- matrix(1:12, ncol=4)
> colnames(x) <- c("A", "B", "C", "D")
> colattr(x, 'gender') <- c("male", "male", "female", "male")
> x

     male male female male
        A    B      C    D
[1,]    1    4      7   10
[2,]    2    5      8   11
[3,]    3    6      9   12

> x[,2:3]
     male female
        B      C
[1,]    4      7
[2,]    5      8
[3,]    6      9
  • Add support for dim(x) <- dims, where dims has one NA value, which is then inferred from length(x) and na.omit(dims). If incompatible, then an error is given. For example,
> x <- matrix(1:12, ncol=4)
> dim(x)
[1] 3 4
> dim(x) <- c(NA, 3)
> dim(x)
[1] 4 3

Comment: The R.utils::dimNA() function implements this.

  • Preserve element names in multi-dimensional subsetting. For instance, with x <- matrix(1:6, nrow=2); names(x) <- letters[1:6] we get that names(x[,1:2]) is NULL.
  • Allow attributes dim and dimnames for environments. Currently we get attr(env, "dim") <- c(2, 3) : invalid first argument.

Function calls

  • Explicitly specify the value of an argument as "missing". For instance, calling value <- missing() and foo(x=value) should resolve missing(x) as TRUE. Comment: See matrixStats discussion.
    • This is already possibly by doing foo(x=,). This works even if foo only takes one argument. /GB
      • Clarified my wish; value needs to be passed explicitly, i.e. need to be able to call foo(a=v1, b=v2, c=v3) where zero or more of v1, v2 and v3 should be missing so sometimes foo(a=, b=v2, c=v3) and sometimes foo(a=v1, b=, c=). /HB
        • So long as you are in a call-frame and the value is being passed from above, you can do this, e.g the code below. Can you tell me more about the context of the use-case here? /GB
f = function(a) f2(b = a)
f2 = function(b) missing(b)
f()
[1] TRUE

Evaluation

  • value <- sandbox(...) which analogously to evalq(local(...)) evaluates an R expression but without leaving any side effects and preserving all options, environments, connections sinks, graphics devices, etc. The effect should be as evalutating the expression in a separate R processing (after importing global variables and loading packages) and returning the value to the calling R process.

  • source(..., args=...) - pass / override command-line arguments when calling source().

  • rscript(..., args=...) - run an R script (with command-line arguments) in a separate process (via system2("Rscript", ...)). Should (optionally?) preserve the same setup (e.g. .libPaths(), options(), ...) as the calling R session.

Parallel processing

  • Allow for mc.cores = 0 in parallel::mclapply() and friends, cf. Issue #7

Graphics

  • Support for one-sided plot limits, e.g. plot(5:10, xlim=c(0,+Inf)) where xlim[2] is inferred from data, cf. xlim=NULL.

  • Standardized graphics device settings and API. For instance, we have ps.options() but no png.options(). For some devices we can set the default width and height, whereas for others the defaults are hardwired to the arguments of the device function. Comment: The R.devices package tries to work around this.

Files

  • Atomic writing to file to avoid incomplete/corrupt files being written. This can be achieved by writing to a temporary file/directory and the renaming when writing/saving is complete. This can be made optional, e.g. saveRDS(x, file="foo.rds", atomic=TRUE).

Basic classes

  • A simple class for files, e.g. pathname <- p("R/zzz.R") and pathnames <- p("R/000.R", "R/zzz.R"). More over, for instance, pathnames <- dir("R/") should effectively return pathnames <- p(dir("R/")).

  • A simple class for regular expressions, e.g. gsub(re("^[a-z]+"), x). Also fixed expression, e.g. gsub(fe("(abc)"), x). This could allow for things such as using x[re("a.a")] to get subset x[c("aba", "aea")].

R system and configuration

Calling R

  • Support URLs in addition to local files when calling R -f or Rscript, e.g. Rscript http://callr.org/install#MASS.

  • Package scripts via Rscript R.rsp::rfile, which calls script rfile.R in system.file("exec", package="R.rsp") iff it exists. Similarly for R CMD, e.g. R CMD R.rsp::rfile. Also, if package is not explicitly specified, the exec directory of all packages should be scanned (only for R CMD), e.g. R CMD rfile. See also R-devel thread R CMD <custom>?

  • R CMD check --flavor=<flavor>: Instead of hard-coded tests as in R CMD check --as-cran, support for custom test suits, which themselves could be R packages, e.g. R CMD check --flavor=CRAN (R package check.CRAN) and R CMD check --flavor=Bioconductor check.Bioconductor). In the bigger picture, this will separate R core and CRAN.

  • Rscript -p <n> foo.R (or --processes=<n>) for specifying that a (maximum of) <n> cores may be used including the main process. This would set option mc.cores to <n>-1, cf. help('options'). As an alternative, evironment variable R_PROCESSES can be set. The default is <n> = 1. See also R-devel thread 'SUGGESTION: Environment variable R_MAX_MC_CORES for maximum number of cores'.

Exception handling and core dumps

  • Use An exceptional error occurred that R could not recover from. The R session is now aborting ... instead of just aborting ..., because from the latter it is not always clear where that messages comes from, i.e. it could have been outputted by something else.

Random number generation (RNG)

  • Function randomSeed(action, seed, kind) for interacting with .Random.seed: .Random.seed holds the current RNG state. It must live in the global environment (it's ignored anywhere else). If the RNG state is not initiated, .Random.seed does not exists. /HB
  • FACT: The fact that one can not assume that .Random.seed requires one to always use exists(".Random.seed", envir=globalenv(), inherits=FALSE). Even if R would always initiate the RNG state, .Random.seed could be removed by the user or other code at any time. /HB
  • FACT: The above leads to cumbersome code for getting, setting and resetting the RNG state involving exists(".Random.seed", envir=globalenv(), inherits=FALSE), get(".Random.seed", envir=globalenv(), inherits=FALSE), assign(".Random.seed", seed, envir=globalenv(), inherits=FALSE) and rm(".Random.seed", envir=globalenv(), inherits=FALSE) calls. Also, R CMD check will complain about assignments to the global environment, so one needs to trick it by working with envir=genv where genv <- globalenv(). /HB
  • PROPOSAL: Hide the above mess by randomSeed(action, seed, kind), where randomSeed("get") would return the current value of .Random.seed (or NULL if non existing), and randomSeed("set", seed=s) would assign .Random.seed <- s (unless length(s) == 1L when set.seed(s) is called instead). With randomSeed("set", seed=s, kind=k) one can set RNGkind(k) and the new seed at the same time. This function also push/pop current RNG (kind, seed) states such that it can be reset by randomSeed("reset"). For L'Ecuyer-CMRG RNG streams (useful for asynchronous processing), randomSeed("advance") could be used to advance to the next RNG stream. /HB
  • PROPOSAL: With a function, such as randomSeed(), R could do much more validation and eventually move away from having .Random.seed in the global environment, which is rather unsafe and error prone. /HB

Packages, libraries and repositories

  • Enforce that all namespaces can be unloaded / all package be detached. /HB

  • The system-library directory should be read only after installing R and/or not accept installation of non-base packages. If installation additional packages there, an end-user is forced to have those package on their library path. Better is to install any additional site-wide packages in a site-wide library, cf. .Library.site and R_LIBS_SITE. This way the user can choose to include the site-wide library/libraries or not.

  • One package library per repository, e.g. ~/R/library/3.1/CRAN/, ~/R/library/3.1/Bioconductor/, and ~/R/library/3.1/R-Forge/. This way it is easy to include/exclude complete sets of packages. install.packages() should install packages to the corresponding directory, cf. how update.packages() updates packages where they lives (I think).

  • Repository metadata that provides information about a repository. This can be provide as a DCF file REPOSITORY in the root of the repository URL, e.g. http://cran.r-project.org/REPOSITORY and http://www.bioconductor.org/packages/release/bioc/REPOSITORY. The content of REPOSITORY could be:

Repository: BioCsoft_3.1
Title: Bioconductor release Software repository
Depends: R (>= 3.2.0)
Description: R package repository for Bioconductor release 3.1 branch.
Maintainer: Bioconductor Webmaster <webmaster@bioconductor.org>
URL: http://www.bioconductor.org/packages/release/bioc
SeeAlso: http://www.bioconductor.org/about/mirrors/mirror-how-to/
IsMirror: TRUE

Build Process

  • R CMD xyz has few external hooks besides calling the cleanup scripts. It would be nice if one or more additional scripts could be call prior to R CMD build and maybe also before R CMD INSTALL. R CMD build needs to call a script to call a) Rcpp::compileAttributes() to update RcppExports.{cpp,R} based on the declared C++ interfaces and b) roxygen2::roxygenize() to update man/ based on R/ (and this should happen after the previous step) /DE
    • I think the hooks should almost exclusively be at the R CMD INSTALL step. That is where the package library is "made" (in the sense of make). Essentially, these hooks would be extensions to the fact that you can already provide a configure script, by letting you specify, e.g. the R-based engine to build docs or auto-generate C or R code. /GB

Miscellaneous

  • Use 'KiB', 'MiB', 'GiB', 'TiB', ... for byte sizes, cf. Issue #6
Clone this wiki locally