Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a single column to be used as rownames in as.matrix #2702

Merged
merged 22 commits into from
Apr 7, 2018
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ The response has been positive ([this tweet](https://twitter.com/MattDowle/statu
```
Thus, `with=` should no longer be needed in any cases. Please change to using the `..` prefix and in a few years we will start to formally deprecate and remove the `with=` parameter. If this is well received, the `..` prefix could be expanded to symbols appearing in `i=` and `by=`, too.

19. `as.matrix.data.table` method now has an additional `rownames` argument allowing for a single column to be used as the `rownames` after conversion to a `matrix`. Thanks to @sritchie73 for the suggestion, use cases, [#2692](https://github.com/Rdatatable/data.table/issues/2692) and implementation [PR#2702](https://github.com/Rdatatable/data.table/pull/2702) and @MichaelChirico for additional use cases.

#### BUG FIXES

1. The new quote rules handles this single field `"Our Stock Screen Delivers an Israeli Software Company (MNDO, CTCH)<\/a> SmallCapInvestor.com - Thu, May 19, 2011 10:02 AM EDT<\/cite><\/div>Yesterday in \""Google, But for Finding
Expand Down
63 changes: 55 additions & 8 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1881,17 +1881,64 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# x
#}


as.matrix.data.table <- function(x,...)
{
dm <- dim(x)
cn <- names(x)
as.matrix.data.table <- function(x, rownames, ...) {
rn <- NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we have a style guide on this, but I note that the corresponding CRAN cheat for [.data.table symbols are defined in the package environment rather than the function body:

https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L11

  • side is cutting that tiny bit of overhead for use cases that might repeatedly call this method; - side is increasing potential for unintentional collisions (so if moving outside the body, perhaps use some more obscure name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are defined there because they are exported. rn won't be used by user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap - rn is internal here, it will contain the vector of rownames to put in the matrix (after all the processing in if (!missing(rownames)) {}. rnc will contain the index of the column in x to be dropped.

rnc <- NULL
if (!missing(rownames)) { # Convert rownames to a column index if possible
if (is.null(rownames)) {
warning("rownames is NULL, ignoring rownames")
} else if (length(rownames) == nrow(x)) {
# rownames argument is a vector of row names, no column in x to drop.
rn <- rownames
rnc <- NULL
} else if (length(rownames) != 1) {
stop("rownames must be a single column in x or a vector of row names of length nrow(x)")
} else if (is.na(rownames)) {
warning("rownames is NA, ignoring rownames")
} else if (identical(rownames, FALSE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find it a bit weird that rownames = TRUE is accepted, but rownames = FALSE is incorrect usage and results in this warning. Setting rownames dynamically by some condition evaluating to TRUE or FALSE seems like a natural use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop the warning

warning("rownames is FALSE, ignoring rownames")
} else if (!(is.logical(rownames) || is.character(rownames) || is.numeric(rownames))) {
# E.g. because rownames is some sort of object that cant be converted to a column index
stop("rownames must be TRUE, a column index, a column name in x, or a vector of row names")
} else {
if (identical(rownames, TRUE)) {
if (haskey(x)) {
rownames <- key(x)
if (length(rownames) > 1) {
warning("rownames is TRUE but multiple keys found in key(x), using first column instead")
rownames <- 1
}
} else {
rownames <- 1
}
}
if (is.character(rownames)) { # Handles cases where rownames is a column name, or key(x) from TRUE
rnc <- chmatch(rownames, names(x))
if (is.na(rnc)) stop(rownames, " is not a column of x")
} else { # rownames is an index already
if (rownames < 1 || rownames > ncol(x))
stop("rownames is ", rownames, " which is outside the column number range [1,ncol=", ncol(x), "]")
rnc <- rownames
}
}
}
# If the rownames argument has been used, and is a single column,
# extract that column's index (rnc) and drop it from x
if (!is.null(rnc)) {
rn <- x[[rnc]]
dm <- dim(x) - c(0, 1)
cn <- names(x)[-rnc]
X <- x[, .SD, .SDcols = cn]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x <- x[, -..rnc] now works here in dev instead of these 2 lines. A copy is needed in this case at this point, iiuc, otherwise, x[, (rnc):=NULL] to remove that column by reference, currently. And maybe x[, ..rnc := NULL] in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes I discovered -.. was implemented as I was working on this. Ultimately I decided not to use -..rnc as you then have to define a dummy ..rnc at the top of the function to avoid R CMD check --as-cran complaining ..rnc is a global variable with no visible binding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I see. That is a shame about the no visible binding note.
In general we've tried to use idiomatic data.table internally, and then deal with the no-visible-binding note explicitly by adding a dummy NULL as you say. This way, when folk look at the internals they see how we use data.table ourselves. If and when there is ever a solution for the CRAN note, we can fix it one distinct place by taking away the NULL definitions.

} else {
dm <- dim(x)
cn <- names(x)
X <- x
}
if (any(dm == 0L))
return(array(NA, dim = dm, dimnames = list(NULL, cn)))
return(array(NA, dim = dm, dimnames = list(rn, cn)))
p <- dm[2L]
n <- dm[1L]
collabs <- as.list(cn)
X <- x
class(X) <- NULL
non.numeric <- non.atomic <- FALSE
all.logical <- TRUE
Expand Down Expand Up @@ -1936,7 +1983,7 @@ as.matrix.data.table <- function(x,...)
}
X <- unlist(X, recursive = FALSE, use.names = FALSE)
dim(X) <- c(n, length(X)/n)
dimnames(X) <- list(NULL, unlist(collabs, use.names = FALSE))
dimnames(X) <- list(rn, unlist(collabs, use.names = FALSE))
X
}

Expand Down
23 changes: 23 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -11544,6 +11544,29 @@ test(1894.11, DT[, sum(z)*..z], 72L)
setnames(DT, "z", "..z")
test(1894.12, DT[, sum(y)*..z], error="..z in j is looking for z in calling scope, but a column '..z' exists. Column names should not start with ..")

# Allow column to be used as rownames when converting to matrix #2702
DT = data.table(id = letters[1:4], X = 1:4, Y = 5:8)
mat <- matrix(1:8, ncol = 2, dimnames=list(letters[1:4], c("X", "Y")))
mat2 <- matrix(c(letters[1:4], 1:8), ncol=3, dimnames=list(NULL, c("id", "X", "Y")))
mat3 <- matrix(c(letters[1:4], 1:8), ncol=3, dimnames=list(1:4, c("id", "X", "Y")))
test(1895.01, as.matrix(DT, 1), mat)
test(1895.02, as.matrix(DT, "id"), mat)
test(1895.03, as.matrix(DT, TRUE), mat)
setkey(DT, id)
test(1895.04, as.matrix(DT, TRUE), mat)
test(1895.05, as.matrix(DT, 1:4), mat3)
# errors
test(1895.06, as.matrix(DT, -1), error="rownames is -1 which is outside the column number range")
test(1895.07, as.matrix(DT, "Z"), error="Z is not a column of x")
test(1895.08, as.matrix(DT, c(1,2)), error="rownames must be a single column in x or a vector of row names of length nrow(x)")
test(1895.09, as.matrix(DT, complex(1)), error="rownames must be TRUE, a column index, a column name in x, or a vector of row names")
# warnings
test(1895.10, as.matrix(DT, NA), mat2, warning="rownames is NA, ignoring rownames")
test(1895.11, as.matrix(DT, NULL), mat2, warning="rownames is NULL, ignoring rownames")
test(1895.12, as.matrix(DT, FALSE), mat2, warning="rownames is FALSE, ignoring rownames")
setkey(DT, id, X)
test(1895.13, as.matrix(DT, TRUE), mat, warning="rownames is TRUE but multiple keys found in key(x), using first column instead")


###################################
# Add new tests above this line #
Expand Down
63 changes: 63 additions & 0 deletions man/as.matrix.Rd
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
\name{as.matrix}
\alias{as.matrix}
\alias{as.matrix.data.table}
\title{Convert a data.table to a matrix}
\description{
Converts a \code{data.table} into a \code{matrix}, optionally using one
of the columns in the \code{data.table} as the \code{matrix} \code{rownames}.
}
\usage{
\method{as.matrix}{data.table}(x, rownames, ...)}

\arguments{
\item{x}{a \code{data.table}}
\item{rownames}{optional, a single column name or column index to use as
the \code{rownames} in the returned \code{matrix}. If \code{TRUE} the
\code{\link{key}} of the \code{data.table} will be used if it is a
single column, otherwise the first column in the \code{data.table} will
be used. Alternative a vector of length \code{nrow(x)} to assign as the
row names of the returned \code{matrix}.}
\item{\dots}{additional arguments to be passed to or from methods.}
}

\details{
\code{\link{as.matrix}} is a generic function in base R. It dispatches to
\code{as.matrix.data.table} if its \code{x} argument is a \code{data.table}.

The method for \code{data.table}s will return a character matrix if there
are only atomic columns and any non-(numeric/logical/complex) column,
applying \code{\link{as.vector}} to factors and \code{\link{format}} to other
non-character columns. Otherwise, the usual coercion hierarchy (logical <
integer < double < complex) will be used, e.g., all-logical data frames
will be coerced to a logical matrix, mixed logical-integer will give an
integer matrix, etc.

An additional argument \code{rownames} is provided for \code{as.matrix.data.table}
to facilitate conversions to matrices where the \code{\link{rownames}} are stored
in a single column of \code{x}, e.g. the first column after using
\code{\link{dcast.data.table}}.
}

\value{
A new \code{matrix} containing the contents of \code{x}.
}

\seealso{
\code{\link{data.table}}, \code{\link{as.matrix}}, \code{\link{data.matrix}}
\code{\link{array}}
}

\examples{
(dt1 <- data.table(A = letters[1:10], X = 1:10, Y = 11:20))
as.matrix(dt1) # character matrix
as.matrix(dt1, rownames = "A")
as.matrix(dt1, rownames = 1)
as.matrix(dt1, rownames = TRUE)

(dt1 <- data.table(A = letters[1:10], X = 1:10, Y = 11:20))
setkey(dt1, A)
as.matrix(dt1, rownames = TRUE)
}

\keyword{ array }