Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create custom PipeOp - winsorization #583

Open
MislavSag opened this issue May 31, 2021 · 2 comments
Open

Create custom PipeOp - winsorization #583

MislavSag opened this issue May 31, 2021 · 2 comments
Labels
Type: New PipeOp Issue suggests a new PipeOp

Comments

@MislavSag
Copy link

Hi,

I am trying to develop my first custom PipeOp and kindly ask for feedback on my try.

I want to implement a winsorization process on train / test data set.

Here is my try:

library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)

PipeOpWinsorize = R6::R6Class(
  "PipeOpWinsorize",
  inherit = mlr3pipelines::PipeOpTaskPreproc,
  public = list(
    initialize = function(id = "scale.always.simple", param_vals = list()) {
      ps = ParamSet$new(list(
        ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("minval", special_vals = list(NULL), tags = c("winsorize_tag")),
        ParamDbl$new("maxval", special_vals = list(NULL), tags = c("winsorize_tag")),
        ParamLgl$new("na.rm", tags = c("winsorize_tag")),
        ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
      ))
      ps$values = list(probs_low = 0.05, probs_high = 0.95, minval = NULL, maxval = NULL, na.rm = FALSE, type = 7)
      super$initialize(id, param_set = ps, param_vals = param_vals)
    }
  ),
  
  private = list(
    
    .train_dt = function(dt, levels, target) {
      self$state <- list()
      pv = self$param_set$get_values(tags = "winsorize_tag")
      wnsr = lapply(dt, winsorize)
      wnsr
    },
    
    .predict_dt = function(dt, levels) {
      dt
      }
  )
)


# This function is minor modification of Winsorize function from DescTools package
winsorize <- function (x, minval = NULL, maxval = NULL, probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7) {
  if (is.null(minval) || is.null(maxval)) {
    xq <- quantile(x = x, probs = c(probs_low, probs_high), na.rm = na.rm, type = type)
    if (is.null(minval)) 
      minval <- xq[1L]
    if (is.null(maxval)) 
      maxval <- xq[2L]
  }
  x[x < minval] <- minval
  x[x > maxval] <- maxval
  return(x)
}


# try pipe
task = tsk("iris")
task$head()
task$data()[, max(Petal.Length)] # 6.9
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new())
gr$plot()
result = gr$train(task)
result[[1]]$data()
result[[1]]$data()[, max(Petal.Length)] # 6.1

I would like to know:

  1. Is this a good approach to develop this kind of PipeOps?
  2. Does this approach apply winsorize on whole data at one or it applies winsorize on train and test set separately?
  3. Should I make PR for this function? If yes, where do you have guidelines for PR's?

Best,

Mislav

@pfistfl
Copy link
Member

pfistfl commented Jul 14, 2021

Hi @MislavSag

Sorry for the silence, a few comments and perhaps answers:

  1. In general yes!
  2. The $train function decides what happens with the training data. In order to apply to test data, you would need to also apply winsorize in $predict. But in order to do this, you would need to store the quantiles from training data and apply those during test. Integrating winsorize as a member function, therefore, seems sensible.
  3. In general yes, in practice it depends how often we think the feature will be used. We are currently focussing a little more on stability and speedups of the core package and less on new features. I think we need to think about new PipeOps like the ones proposed and how we can maintain them.
    https://github.com/mlr-org/mlr3#contributing-to-mlr3

@MislavSag
Copy link
Author

MislavSag commented Feb 2, 2023

After long time I have started to develope in mlr3 again and I neew to use winsorization again :)

Here is my current try:

library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)
library(paradox)

PipeOpWinsorize = R6::R6Class(
  "PipeOpWinsorize",
  inherit = mlr3pipelines::PipeOpTaskPreproc,
  public = list(
    groups = NULL,
    initialize = function(id = "winsorization", param_vals = list()) {
      ps = ParamSet$new(list(
        ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamLgl$new("na.rm", tags = c("winsorize_tag")),
        ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
      ))
      ps$values = list(probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7)
      super$initialize(id, param_set = ps, param_vals = param_vals, feature_types = c("numeric", "integer"))
    }
  ),

  private = list(

    .select_cols = function(task) {
      self$groups = task$groups
      task$feature_names
    },

    .train_dt = function(dt, levels, target) {
      # params
      pv = self$param_set$get_values(tags = "winsorize_tag")

      # state variables
      if (!(is.null(self$groups))) {
        row_ids  = self$groups[group == self$groups[nrow(self$groups), group], row_id]
        q = dt[row_ids, lapply(.SD,
                               quantile,
                               probs = c(pv$probs_low, pv$probs_high),
                               na.rm = pv$na.rm,
                               type = pv$type)]
      } else {
        q = dt[, lapply(.SD,
                        quantile,
                        probs = c(pv$probs_low, pv$probs_high),
                        na.rm = pv$na.rm,
                        type = pv$type)]
      }
      self$state = list(
        minvals = q[1],
        maxvals = q[2]
      )

      # dt object train
      if (!(is.null(self$groups))) {
        dt = dt[, lapply(.SD, function(x){
          q = quantile(x,
                       probs = c(pv$probs_low, pv$probs_high),
                       na.rm = pv$na.rm,
                       type = pv$type)
          minval = q[1L]
          maxval = q[2L]
          x[x < minval] <- minval
          x[x > maxval] <- maxval
          x
        }), by = self$groups[, group]]
        dt = dt[, -1]
      } else {
        dt = dt[, lapply(.SD, function(x){
          q = quantile(x,
                       probs = c(pv$probs_low, pv$probs_high),
                       na.rm = pv$na.rm,
                       type = pv$type)
          minval = q[1L]
          maxval = q[2L]
          x[x < minval] <- minval
          x[x > maxval] <- maxval
          x
        })]
      }
      dt
    },

    .predict_dt = function(dt, levels) {
      dt = dt[, Map(function(a, b) ifelse(a < b, b, a), .SD, self$state$minvals)]
      dt = dt[, Map(function(a, b) ifelse(a > b, b, a), .SD, self$state$maxvals)]
      dt
    }
  )
)


library("mlr3")

# no group variable
task = tsk("iris")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()
predres = gr$predict(task)

# group variable
dt = tsk("iris")$data()
dt[, monthid := c(rep(1, 50), rep(2, 50), rep(3, 50))]
task = as_task_classif(dt, target = "Species")
task$set_col_roles("monthid", "group")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()


I kindly as k for your feedback. Would you use similar apprach?

I am trying to apply winsorization on tables with a time dimensions. In this, TS setting, we ussually want to apply winsorization on cross section basis. In my case across months. I couldn't figure out a better approach than using the groups filed. I am not sure if that would mess integration with other pipes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: New PipeOp Issue suggests a new PipeOp
Projects
None yet
Development

No branches or pull requests

3 participants