Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#5636 #2

Open
DorisAmoakohene opened this issue Oct 6, 2023 · 4 comments
Open

#5636 #2

DorisAmoakohene opened this issue Oct 6, 2023 · 4 comments

Comments

@DorisAmoakohene
Copy link
Owner

Rdatatable/data.table#5636

This is a performance regression I want to reproduce, is trying to compare the performance of data.table and repair_sum2one.

I want to draw a graph showing the time each of the two took and compare, since the author stated that repair_sum2one is 5* faster than data.table

my result gives me a plot with just one point.
@tdhock kindly assist.

parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                   digits = 4)
n <- 10000
newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                dummy = sample(c("d1", "d2"), n, replace = TRUE))

repair_sum2one <- function(configuration, parameters) {
  isreal <- names(which(parameters$types[colnames(configuration)] == "r"))
  digits <- parameters$digits[isreal]
  c_real <- unlist(configuration[isreal])
  c_real <- c_real / sum(c_real)
  c_real[-1] <- round(c_real[-1], digits[-1])
  c_real[1] <- 1 - sum(c_real[-1])
  configuration[isreal] <- c_real
  return(configuration)
}

atime.list <- atime::atime(
  N = nrow(newConfigurations),
  setup = {
    parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                       digits = 4)
    n <- 10000
    newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                    dummy = sample(c("d1", "d2"), n, replace = TRUE))
  },
  "repair_sum2one" = {
    j <- colnames(newConfigurations)
    for (i in seq_len(nrow(newConfigurations))) {
      set(newConfigurations, i, j = j,
          value = repair_sum2one(as.data.frame(newConfigurations[i]), parameters))
    }
  },
  verbose = 2
)

Screenshot 2023-10-06 100915

@tdhock
Copy link

tdhock commented Oct 6, 2023

first of all I don't think this issue should "compare the performance of data.table and repair_sum2one" because data.table is being used to set the value which is computed by repair_sum2one. In this issue I think you should compare the current version with a past version, or with a PR which fixes this issue.
The reason why there is only one data point (N=10000) is because that is the only one you specified,

atime.list <- atime::atime(
  N = nrow(newConfigurations),
  setup = {
    parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                       digits = 4)
    n <- 10000
    newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                    dummy = sample(c("d1", "d2"), n, replace = TRUE))
  },

instead, you should specify a range on log scale, like as.integer(10^seq(1, 4, by=0.5)) etc.
Also your setup is problematic because N is un-used. setup should depend on big N, which is the data size, that is varied by atime. (little n is different, and maybe was used in the original example, but atime uses big N)

@DorisAmoakohene
Copy link
Owner Author

@tdhock well noted let me get on that now

@tdhock
Copy link

tdhock commented Oct 9, 2023

please try comparing current data.table master with Rdatatable/data.table#4488 which is apparently a fix for this issue.

@DorisAmoakohene
Copy link
Owner Author

sure, on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants