#5636 #2

DorisAmoakohene · 2023-10-06T17:10:27Z

This is a performance regression I want to reproduce, is trying to compare the performance of data.table and repair_sum2one.

I want to draw a graph showing the time each of the two took and compare, since the author stated that repair_sum2one is 5* faster than data.table

my result gives me a plot with just one point.
@tdhock kindly assist.

parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                   digits = 4)
n <- 10000
newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                dummy = sample(c("d1", "d2"), n, replace = TRUE))

repair_sum2one <- function(configuration, parameters) {
  isreal <- names(which(parameters$types[colnames(configuration)] == "r"))
  digits <- parameters$digits[isreal]
  c_real <- unlist(configuration[isreal])
  c_real <- c_real / sum(c_real)
  c_real[-1] <- round(c_real[-1], digits[-1])
  c_real[1] <- 1 - sum(c_real[-1])
  configuration[isreal] <- c_real
  return(configuration)
}

atime.list <- atime::atime(
  N = nrow(newConfigurations),
  setup = {
    parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                       digits = 4)
    n <- 10000
    newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                    dummy = sample(c("d1", "d2"), n, replace = TRUE))
  },
  "repair_sum2one" = {
    j <- colnames(newConfigurations)
    for (i in seq_len(nrow(newConfigurations))) {
      set(newConfigurations, i, j = j,
          value = repair_sum2one(as.data.frame(newConfigurations[i]), parameters))
    }
  },
  verbose = 2
)

The text was updated successfully, but these errors were encountered:

tdhock · 2023-10-06T19:51:14Z

first of all I don't think this issue should "compare the performance of data.table and repair_sum2one" because data.table is being used to set the value which is computed by repair_sum2one. In this issue I think you should compare the current version with a past version, or with a PR which fixes this issue.
The reason why there is only one data point (N=10000) is because that is the only one you specified,

atime.list <- atime::atime(
  N = nrow(newConfigurations),
  setup = {
    parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                       digits = 4)
    n <- 10000
    newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                    dummy = sample(c("d1", "d2"), n, replace = TRUE))
  },

instead, you should specify a range on log scale, like as.integer(10^seq(1, 4, by=0.5)) etc.
Also your setup is problematic because N is un-used. setup should depend on big N, which is the data size, that is varied by atime. (little n is different, and maybe was used in the original example, but atime uses big N)

DorisAmoakohene · 2023-10-07T01:09:42Z

@tdhock well noted let me get on that now

tdhock · 2023-10-09T04:08:08Z

please try comparing current data.table master with Rdatatable/data.table#4488 which is apparently a fix for this issue.

DorisAmoakohene · 2023-10-09T04:09:37Z

sure, on it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#5636 #2

#5636 #2

DorisAmoakohene commented Oct 6, 2023

tdhock commented Oct 6, 2023

DorisAmoakohene commented Oct 7, 2023

tdhock commented Oct 9, 2023

DorisAmoakohene commented Oct 9, 2023

#5636 #2

#5636 #2

Comments

DorisAmoakohene commented Oct 6, 2023

tdhock commented Oct 6, 2023

DorisAmoakohene commented Oct 7, 2023

tdhock commented Oct 9, 2023

DorisAmoakohene commented Oct 9, 2023