Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] deprecate the use of 'info' in Dataset #4573

Merged
merged 12 commits into from
Sep 1, 2021
Merged

Conversation

jameslamb
Copy link
Collaborator

Addresses the rest of #4543 (changes not covered by #4571), as part of #4310.

  • deprecates keyword argument info in lgb.Dataset()
  • add keyword arguments group, weight, init_score, and label to lgb.Dataset()

Notes for Reviewers

While working through this, I realized that today the only way to pass new parameters through to Dataset when using lgb.Dataset.create.valid() is to pass them through .... So this PR also proposes adding a new keyword params to that function (and the corresponding public method Dataset$create_valid()), so that ... can be eliminated in release 4.0.0 while still allowing customization of parameters within lgb.Dataset.create.valid() calls.

No existing unit tests need to be changed in this PR, since there are not any that initialize a list info and pass it in. All of the tests and demo examples in this project are already passing weight, group, label, and init_score directly as keyword arguments.

@jameslamb jameslamb mentioned this pull request Aug 29, 2021
21 tasks
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm qualified enough to provide a thoughtful review, but I have a few comments/questions below:

Comment on lines -62 to -65
# Store as param
params[[key]] <- additional_params[[key]]

}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like additional params like min_data are lost here before 4.0.0 release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! I don't think so, but it's not obvious at all what I did, so let me share some links.

First, Dataset in the R package is not exported, so its constructor can be considered "private" in the package's API. That isn't true for any other public methods on this class, since they can be called by user code once a function like lgb.Dataset() returns a Dataset. So as long as all the exported methods / functions maintain backwards-compatibility, no user code will be broken.

Since this PR adds each of the elements of INFO_KEYS as keyword arguments in Dataset$new(), the only thing that could be left in ... is information you want to include in params.

As of this PR, exported entrypoints that create a Dataset will handle merging params and ... before calling Dataset$new().

Dataset$create_valid()

params <- modifyList(additional_params, params)
# the Dataset's existing parameters should be overwritten by any passed in to this call
params <- modifyList(private$params, params)
# Create new dataset
ret <- Dataset$new(
data = data
, params = params
, reference = self
, colnames = private$colnames
, categorical_feature = private$categorical_feature
, predictor = private$predictor
, free_raw_data = private$free_raw_data
, used_indices = NULL
, info = info
, label = label
, weight = weight
, group = group
, init_score = init_score
)

lgb.Dataset()

additional_params <- list(...)
params <- modifyList(params, additional_params)
if (length(additional_params) > 0L) {
warning(paste0(
"lgb.Dataset: Found the following passed through '...': "
, paste(names(additional_params), collapse = ", ")
, ". These will be used, but in future releases of lightgbm, this warning will become an error. "
, "Add these to 'params' instead. See ?lgb.Dataset for documentation on how to call this function."
))
}
# Create new dataset
return(
invisible(Dataset$new(
data = data
, params = params
, reference = reference
, colnames = colnames
, categorical_feature = categorical_feature
, predictor = NULL
, free_raw_data = free_raw_data
, used_indices = NULL
, info = info
, label = label
, weight = weight
, group = group
, init_score = init_score
))

=====

this made me realize that one place, Dataset$slice(), is missing the "combine params before calling Dataset$new() logic. I'll open a separate PR for that.

slice = function(idxset, ...) {
additional_params <- list(...)
if (length(additional_params) > 0L) {
warning(paste0(
"Dataset$slice(): Found the following passed through '...': "
, paste(names(additional_params), collapse = ", ")
, ". These are ignored and should be removed. "
, "In future releases of lightgbm, this warning will become an error."
))
}
# Perform slicing
return(
Dataset$new(
data = NULL
, params = private$params
, reference = self
, colnames = private$colnames
, categorical_feature = private$categorical_feature
, predictor = private$predictor
, free_raw_data = private$free_raw_data
, used_indices = sort(idxset, decreasing = FALSE)
, info = NULL
, ...
)
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of this PR, exported entrypoints that create a Dataset will handle merging params and ... before calling Dataset$new().

Ah, here is the key point! Thanks a lot for as always detailed explanation!

R-package/R/lgb.Dataset.R Outdated Show resolved Hide resolved
R-package/R/lgb.Dataset.R Outdated Show resolved Hide resolved
jameslamb and others added 2 commits August 31, 2021 00:32
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any other comments, thanks!

params <- modifyList(additional_params, params)

# the Dataset's existing parameters should be overwritten by any passed in to this call
params <- modifyList(private$get_params(), params)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake, this should have been self$get_params(), not private. Will fix in a local commit because I also might as well merge in latest master if this PR is going to get another CI run anyway.

@jameslamb
Copy link
Collaborator Author

I'm not sure I'm qualified enough to provide a thoughtful review

Thanks very much as always for the reviews, @StrikerRUS !

I'm confident in merging this without another reviewer's approval since it's an important step towards the next release, and since:

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants