-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: ColumnOptions actually a good name? #2884
Comments
Another thing is in the case of |
On your last note, @Ivanidzo4ka suggested to internalize the I agree with that approach. |
I feel quite strongly about keeping the name of the
The reason why I believe that we should not rename In the My intuition for question 2 is therefore that we should have a separate class that contains the parameters of the trained transformations that we want to make public and that it should not be named Regarding question 1 I think it would be beneficial for the consistency of the code base to make the fields mutable like in the trainers |
Then you would need to create two objects. One with mutable fields (for configuration during initial construction) and one with immutable fields (to actually store inside the estimators and transformers). Otherwise you have estimators and transformers with changing behavior on their schema propogation and transformation logic, and our entire pipeline is based on the assumption that that does not change. |
Well in the trainers we solve this issue by simply not allowing access to the |
If you believe that it is valuable to have access to the settings of a trainer/transformer once it is trained (say for model tracking) maybe we should make the |
Right. Not only do we not even expose access to them, we shouldn't be even storing the
This is not what I am saying. I think it's useful to have this mutable thing. I also see no reason for it to be accessible through the estimator itself -- that's pointless, and can be easily solved by having the user just hold on to the object themselves. You also cannot expose a mutable object via estimators and transformers, and we do not. I am asking a rather simpler question than anything posed here: given all that, is it appropriate for this thing to be named |
The following FYI are the estimators where we expose an extension method on
So I'm wondering, for these, do we need for v1 these "per column" configuration options, if these things are not acting like options? (If we do we can do that for post v1.) And if they're not acting like options, once we hide them, should we rename these back to info? Just wondering about that. |
We talked offline and here is the conclusion we reached and the related work item. Initially the The work item is:
|
Thank you for writing this up @artidoro ! Let me just be explicit on one point:
While there may be similarities, I think the internal immutable structure may differ in some important respects, since it represents two distinct states -- the settings we use to control training, and the results of that training. So they needn't resemble each other any more than training hyperparameters are the result of training a model. Let's give a specific example. So, for example, if we take the value to key mapping estimator and transformer, it will differ in that when configuring the estimator, we will have a parameter that controls whether the keys and values are ordered. The configuration state controls this parameter. But the trained state has no need to -- it is irrelevant by that time, the ordering of the keys is now set. Likewise, the trained state contains the mapping from keys to values, which is information that simply was not knowable. So I would say that this requirement that they resemble each other is perhaps not correct. I would say that, to speak more properly, and to speak in a way that reflects what we have done with the command line tool and suchlike, that the |
Or maybe I misunderstand and am conflating the immutable structures in the estimators and transforms, and you were talking about the other one. ;-) That's more likely upon reflection sorry |
My first attempt at solving this in PR #2893 brought up two issues:
|
@TomFinley suggests to hide The benefit of this approach is that it would allow us to spend more time thinking about a better solution to the problem post v1. The downside for v1 is that we will not allow a transformation to be applied on multiple columns at the same time and we will not provide a structure to specify the settings of a transformer/estimator. |
So at this point we have the following possible approaches to solve the issue:
What are your thoughts @eerhardt, @TomFinley, @Ivanidzo4ka , @shauheen, @sfilipi ? |
I think long term, the third options would aling with the trainers, entry point and cmd usage. The only debate, IMO is: is there enough time to do it. |
I agree @sfilipi, and I kind of feel like we don't have enough time. I thought we might, till I saw the issues that arose in the PR, and that made me super nervous. The change is actually I think fairly involved. The class that we're talking about is in many cases not over the We got away with this sort of transition far more easily when we moved from Will it just be a simple matter of just "copying" the Sometimes those options classes are tied up in the inheritance structure of those But even if we somehow magically did all that work in the next few minutes and it was done, we still have a major problem. These options classes collectively have many hundreds of public members, whose surface has not been reviewed at all. All the work we've previously done via @rogancarr and @sfilipi to make sure they were consistent would have to be done on them immediately, and we simply do not have time to do that, even if the work was somehow able to be completed and checked in right now. There's not enough time till March 26. To say we're going to do all this work in the next 10 working days is, I think, unrealistic. I'd be surprised if it took less than a month to do it altogether, much less review the public surface etc. And for what? So that you can in the public API transform multiple columns at once, a scenario that at least internally has seen only nominal usage? Even in the unlikely event that becomes important, I think we can come up with a simpler answer to that problem if it really comes down to it. So: I think hiding and working towards imagining a better solution is a perfectly fine answer here. In the best case, we discover it isn't actually that important. And if they do, maybe we can do something a bit more specialized. (I think only a handful of transformers really benefit from the multi-column logic.) |
In the short term, we can always give work-around samples, so I'm okay with hiding and delaying to v1.1. |
So @artidoro these are the ones I have observed as being practically useful in that list:
Note also that the hash versions of the value-to-key is also useful for multi-column mapping. |
Any reason not to add to all of them that support multiple columns?
|
Summarizing what happened after the above discussion:
What we need to do before v1:
The above lists by Tom and Gleb were answering the question of which transforms should have multicolumn mapping re-enabled. |
What is the best way to re-enable multi-column mapping for the few transforms that would benefit from it?
This makes me wonder whether it is even a good idea to introduce multi-column mapping until we expose @TomFinley @eerhardt @glebuk @sfilipi @singlis @wschin @Ivanidzo4ka |
After discussing with @shauheen we decided to do this post v1. So we will not enable multi-column mapping before v1. |
Closing as this is not a Project 13 issue any longer. No breaking API change is required. |
In #2878, @eerhardt had a comment that we should consider, the gist of which was, since all of our
Options
classes have mutable properties, is it appropriate forColumnOptions
to be called this, since they are not and often cannot be mutable? We also have issue #2854 where @rogancarr thought he couldn't get normalization information out of the structure since it was named options, so this is not actually as academic an issue than I might have thought, say, a few days ago.The approach taken in #2709 was that these structures created for configuration of the per-column options should be called options, and that it was (apparently) assumed to be irrelevant whether those items were mutable or not. Now, I'm not saying we should revert that PR necessarily, but it is something to consider, since it seems to be confusing people.
Now then, the structures themselves obviously must not be mutable, since they are often the same structures used in the associated estimators and transformers to project schema, e.g., here it is for the n-gram hashing estimator:
machinelearning/src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs
Line 1077 in a558010
Here it is in the transformer:
machinelearning/src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs
Line 1077 in a558010
So, just something to think about, whether it was in fact a good idea for this thing to be called "options" really, in all the cases we named it options. Maybe we could have a refinement on the policy of naming this thing? Or maybe we decide to just live with it, because the confusion of calling all these things "options" vs. "info" vs. "whatever" is greater than this inconsistency in roles?
I'm fine with leaving it as is, but I do see some confusion so I think we should think about it, and at least formulate a psoition.
/cc @eerhardt and @rogancarr and @sfilipi and @artidoro ...
The text was updated successfully, but these errors were encountered: