Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: ColumnOptions actually a good name? #2884

Closed
TomFinley opened this issue Mar 7, 2019 · 23 comments
Closed

Discussion: ColumnOptions actually a good name? #2884

TomFinley opened this issue Mar 7, 2019 · 23 comments
Assignees
Labels
API Issues pertaining the friendly API question Further information is requested
Milestone

Comments

@TomFinley
Copy link
Contributor

In #2878, @eerhardt had a comment that we should consider, the gist of which was, since all of our Options classes have mutable properties, is it appropriate for ColumnOptions to be called this, since they are not and often cannot be mutable? We also have issue #2854 where @rogancarr thought he couldn't get normalization information out of the structure since it was named options, so this is not actually as academic an issue than I might have thought, say, a few days ago.

The approach taken in #2709 was that these structures created for configuration of the per-column options should be called options, and that it was (apparently) assumed to be irrelevant whether those items were mutable or not. Now, I'm not saying we should revert that PR necessarily, but it is something to consider, since it seems to be confusing people.

Now then, the structures themselves obviously must not be mutable, since they are often the same structures used in the associated estimators and transformers to project schema, e.g., here it is for the n-gram hashing estimator:

private readonly ColumnOptions[] _columns;

Here it is in the transformer:

private readonly ColumnOptions[] _columns;

So, just something to think about, whether it was in fact a good idea for this thing to be called "options" really, in all the cases we named it options. Maybe we could have a refinement on the policy of naming this thing? Or maybe we decide to just live with it, because the confusion of calling all these things "options" vs. "info" vs. "whatever" is greater than this inconsistency in roles?

I'm fine with leaving it as is, but I do see some confusion so I think we should think about it, and at least formulate a psoition.

/cc @eerhardt and @rogancarr and @sfilipi and @artidoro ...

@TomFinley TomFinley added question Further information is requested API Issues pertaining the friendly API labels Mar 7, 2019
@TomFinley
Copy link
Contributor Author

Another thing is in the case of NgramHashing this ColumnOptions never actually appears anywhere as an input or an output in the public API, but that's probably just an oversight (more usual would be, expose it as a IReadOnlyCollection<ColumnOptions> on the transformer, as we do elsewhere, we just forgot to do that, but we can do that post v1 since that's strictly additive.) More a question on the naming.

@artidoro
Copy link
Contributor

artidoro commented Mar 7, 2019

On your last note, @Ivanidzo4ka suggested to internalize the IReadOnlyCollection<ColumnOptions> Columns everywhere as part of the scrubbing tasks for transforms. I believe it has already been carried out in some places.

I agree with that approach.

@artidoro
Copy link
Contributor

artidoro commented Mar 7, 2019

I feel quite strongly about keeping the name of the ColumnOptions classes used to specify the settings of the estimator/transformers as it is. However, I think there are two separate questions to answer in this issue:

  1. Should ColumnOptions have mutable fields in a similar fashion to the trainers Options?
  2. How should we access parameters for trained transformers, like in the NormalizingTransformer case brought up by Tom and Rogan?

The reason why I believe that we should not rename ColumnOptions is that I think we should not rely on ColumnOptions to access parameters of trained transformers. I think it's actually not possible to use the objects that we use to specify settings for the estimator/transformer to store trained parameters of the transformer.

In the NormalizingTransformer, for example, there is a class that is created only for the purpose of making the normalizer parameters public. ColumnOptions in NormalizingTransformer is not used to specify settings for the estimator/transformer. In fact, I think it was mistakenly renamed ColumnOptions (by me in #2709) while it should probably indicate that it provides information about the trained transformer rather than the settings of the transformer.

My intuition for question 2 is therefore that we should have a separate class that contains the parameters of the trained transformations that we want to make public and that it should not be named ColumnOptions but rather {TransformName}ColumnInfos or something along those lines.

Regarding question 1 I think it would be beneficial for the consistency of the code base to make the fields mutable like in the trainers Options. There is no need for those fields to be not mutable, as they will only be used to specify the estimator/transformer settings. But this would require some work since many of the estimators have public ColumnOptions. The question is whether the added benefit justifies the extra work.

@TomFinley
Copy link
Contributor Author

Regarding question 1 I think it would be beneficial for the consistency of the code base to make the fields mutable like in the trainers Options. There is no need for those fields to be not mutable, as they will only be used to specify the estimator/transformer settings. But this would require some work since many of the estimators have public ColumnOptions. The question is whether the added benefit justifies the extra work.

Then you would need to create two objects. One with mutable fields (for configuration during initial construction) and one with immutable fields (to actually store inside the estimators and transformers). Otherwise you have estimators and transformers with changing behavior on their schema propogation and transformation logic, and our entire pipeline is based on the assumption that that does not change.

@artidoro
Copy link
Contributor

artidoro commented Mar 7, 2019

Well in the trainers we solve this issue by simply not allowing access to the Options of the trainer once they have been specified. There is no public property that makes the Options of the trainer accessible after the trainer is instantiated.

@artidoro
Copy link
Contributor

artidoro commented Mar 7, 2019

If you believe that it is valuable to have access to the settings of a trainer/transformer once it is trained (say for model tracking) maybe we should make the Options of the trainers non mutable similar to ColumnOptions. This would prevent the need to create a second immutable class to store the Options.

@TomFinley
Copy link
Contributor Author

TomFinley commented Mar 7, 2019

Right. Not only do we not even expose access to them, we shouldn't be even storing the Options class at all. (And, in all new code I am aware of created within the past few years, we do not. Just in some legacy code we haven't had the chance to properly write.) But that still leaves us with teh question, of what is this object? If it doesn't behave like an Options, should it be named an Options?

If you believe that it is valuable to have access to the settings of a trainer/transformer once it is trained (say for model tracking) maybe we should make the Options of the trainers non mutable similar to ColumnOptions. This would prevent the need to create a second immutable class to store the Options.

This is not what I am saying. I think it's useful to have this mutable thing. I also see no reason for it to be accessible through the estimator itself -- that's pointless, and can be easily solved by having the user just hold on to the object themselves. You also cannot expose a mutable object via estimators and transformers, and we do not.

I am asking a rather simpler question than anything posed here: given all that, is it appropriate for this thing to be named ColumnOptions? Since the most essenital property of Options (indeed its only reason for existing at all) is that it is mutable, but this ColumnOptions thing, which used to be ColumnInfo, is not, and does not resemble Options at all?)

@TomFinley
Copy link
Contributor Author

The following FYI are the estimators where we expose an extension method on MLContext with the ColumnOptions.

OneHotEncodingEstimator
TypeConvertingEstimator
KeyToVectorMappingEstimator
ValueToKeyMappingEstimator
MissingValueReplacingEstimator
VectorToImageConvertingEstimator
ImagePixelExtractingEstimator
NormalizingEstimator
PrincipalComponentAnalysisEstimator
RandomFourierFeaturizingEstimator
WordEmbeddingsExtractingEstimator
LatentDirichletAllocationEstimator
NgramExtractingEstimator
WordTokenizingEstimator

So I'm wondering, for these, do we need for v1 these "per column" configuration options, if these things are not acting like options? (If we do we can do that for post v1.) And if they're not acting like options, once we hide them, should we rename these back to info? Just wondering about that.

@artidoro artidoro self-assigned this Mar 7, 2019
@artidoro
Copy link
Contributor

artidoro commented Mar 8, 2019

We talked offline and here is the conclusion we reached and the related work item.

Initially the ColumnInfo objects were meant to store internal settings of the estimator/transformers, and that's why they were immutable. However, with time we have started exposing them to the users as ways to specify settings of the transformer/estimators and subsequently we have renamed them ColumnOptions to match the Options class in trainers. This poses some problems (as it emerged from the discussion above} because as Options they should be mutable but as storage of settings they should be immutable.

The work item is:

  1. Create a new mutable class which will be named ColumnOptions that matches the immutable one.
  2. The immutable one will remain internal to store the settings of the transformer/estimators.

@TomFinley
Copy link
Contributor Author

Thank you for writing this up @artidoro ! Let me just be explicit on one point:

  1. Create a new mutable class which will be named ColumnOptions that matches the immutable one.

While there may be similarities, I think the internal immutable structure may differ in some important respects, since it represents two distinct states -- the settings we use to control training, and the results of that training. So they needn't resemble each other any more than training hyperparameters are the result of training a model.

Let's give a specific example. So, for example, if we take the value to key mapping estimator and transformer, it will differ in that when configuring the estimator, we will have a parameter that controls whether the keys and values are ordered.

The configuration state controls this parameter. But the trained state has no need to -- it is irrelevant by that time, the ordering of the keys is now set. Likewise, the trained state contains the mapping from keys to values, which is information that simply was not knowable. So I would say that this requirement that they resemble each other is perhaps not correct.

I would say that, to speak more properly, and to speak in a way that reflects what we have done with the command line tool and suchlike, that the ColumnOptions class will, when it exists, resemble the Options class that exists for the same transform. Which, admittedly, we have not exposed too often in our public API, but which exists for the sake of the command line and GUI.

@shauheen shauheen added this to the 0319 milestone Mar 8, 2019
@TomFinley
Copy link
Contributor Author

Or maybe I misunderstand and am conflating the immutable structures in the estimators and transforms, and you were talking about the other one. ;-) That's more likely upon reflection sorry

@artidoro
Copy link
Contributor

My first attempt at solving this in PR #2893 brought up two issues:

  1. We already have a class Column in the Transformers that has all the fields that we need in the ColumnOptions class. It is therefore not the right approach to create yet another one.
  2. Using the exisiting Column as the way to specify the settings of the transformers (the new ColumnOptions ) has the issue that the fields would be nullable.

@artidoro
Copy link
Contributor

@TomFinley suggests to hide ColumnOptions altogether from the public surface of the estimators/transformers #2893 (comment)

The benefit of this approach is that it would allow us to spend more time thinking about a better solution to the problem post v1.

The downside for v1 is that we will not allow a transformation to be applied on multiple columns at the same time and we will not provide a structure to specify the settings of a transformer/estimator.

@artidoro
Copy link
Contributor

artidoro commented Mar 11, 2019

So at this point we have the following possible approaches to solve the issue:

  • Hide ColumnOptions altogether from the public surface of the estimators/transformers. We would only have the simple scenario MLContext extension to instantiate the transformer.

    • 👍 Time to think about a better solution post V1
    • 👎 Not possible to define a transformation on multiple columns, and no structure to specify settings
  • Use Column

    • 👍 No new structure, can define a transformation on multiple columns
    • 👎 Column has nullable types (see
      internal sealed class Column : OneToOneColumn
      {
      [Argument(ArgumentType.AtMostOnce, HelpText = "Number of bits to hash into. Must be between 1 and 31, inclusive", ShortName = "bits")]
      public int? HashBits;
      [Argument(ArgumentType.AtMostOnce, HelpText = "Hashing seed")]
      public uint? Seed;
      [Argument(ArgumentType.AtMostOnce, HelpText = "Whether the position of each term should be included in the hash",
      ShortName = "ord")]
      public bool? Ordered;
      [Argument(ArgumentType.AtMostOnce, HelpText = "Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.",
      ShortName = "ih")]
      public int? InvertHash;
      ), we would need to do renamings for fields in the base classes and make sure that the Column class structure make sense.
  • Expose Options instead of ColumnOptions. We already have the documentation, and know what the public surface should be in terms of which fields to expose. We just need to update the names of the fields and copy the documentation from the current ColumnOptions over.

    • 👍 No new structure, more expressive API, consistent with trainers.
    • 👎 We would need to do renamings for fields in the base classes and make sure that the Column class structure make sense.

What are your thoughts @eerhardt, @TomFinley, @Ivanidzo4ka , @shauheen, @sfilipi ?
I am leaning towards the third approach the more I think about it.

@sfilipi
Copy link
Member

sfilipi commented Mar 11, 2019

I think long term, the third options would aling with the trainers, entry point and cmd usage. The only debate, IMO is: is there enough time to do it.

@TomFinley
Copy link
Contributor Author

TomFinley commented Mar 11, 2019

I think long term, the third options would aling with the trainers, entry point and cmd usage. The only debate, IMO is: is there enough time to do it.

I agree @sfilipi, and I kind of feel like we don't have enough time. I thought we might, till I saw the issues that arose in the PR, and that made me super nervous.

The change is actually I think fairly involved. The class that we're talking about is in many cases not over the Estimators, but rather an entirely different thing called an IDataTransform. (We retired IDataTransform from the public surface per #1995, but it remains sort of the "central" concept of entry-points, command line, etc., and is often tightly coupled with the inheritance structure of that containing class.)

We got away with this sort of transition far more easily when we moved from ITrainer to ITrainerEstimator in our current pipelines, because in fact those already closely resembled each other and remain now the same class. In the current world they are not the same class at all: what used to be IDataTransformer has been split into three distinct types of classes per #581. But even with that advantage, the transition of ITrainer to ITrainerEstimator was not seemless by any stretch of the imagination. So we're talking a fundamentally more difficult change, and as icing on the cake there are more of the transforms than trainers to boot.

Will it just be a simple matter of just "copying" the Options class in the "old" IDataTransform or "new" ITransformer over to the corresponding new IEstimator? Maybe, but even if this copying were to happen without problems it would represent a fairly major undertaking, if we took it on for all non-training estimators. And I think that if we take it on for any of them we probably should have a consistent story for all of them.

Sometimes those options classes are tied up in the inheritance structure of those IDataTransform or ITransformer implementors -- teasing that apart will be hard. You can't really do the copying without disentangling that inheritance structure. We did not have time to even just deprecate the IDataTransform in our internal command-line/entry-point code. We definitely don't have time to do that work, and move over the old Options class.

But even if we somehow magically did all that work in the next few minutes and it was done, we still have a major problem. These options classes collectively have many hundreds of public members, whose surface has not been reviewed at all. All the work we've previously done via @rogancarr and @sfilipi to make sure they were consistent would have to be done on them immediately, and we simply do not have time to do that, even if the work was somehow able to be completed and checked in right now.

There's not enough time till March 26. To say we're going to do all this work in the next 10 working days is, I think, unrealistic. I'd be surprised if it took less than a month to do it altogether, much less review the public surface etc. And for what? So that you can in the public API transform multiple columns at once, a scenario that at least internally has seen only nominal usage? Even in the unlikely event that becomes important, I think we can come up with a simpler answer to that problem if it really comes down to it. So: I think hiding and working towards imagining a better solution is a perfectly fine answer here. In the best case, we discover it isn't actually that important. And if they do, maybe we can do something a bit more specialized. (I think only a handful of transformers really benefit from the multi-column logic.)

/cc @eerhardt @shauheen @rogancarr

@rogancarr
Copy link
Contributor

In the short term, we can always give work-around samples, so I'm okay with hiding and delaying to v1.1.

@TomFinley
Copy link
Contributor Author

TomFinley commented Mar 20, 2019

So @artidoro these are the ones I have observed as being practically useful in that list:

  • OneHotEncodingEstimator
  • TypeConvertingEstimator
  • KeyToVectorMappingEstimator
  • ValueToKeyMappingEstimator
  • NormalizingEstimator

Note also that the hash versions of the value-to-key is also useful for multi-column mapping.

@glebuk
Copy link
Contributor

glebuk commented Mar 21, 2019

Any reason not to add to all of them that support multiple columns?
At least the following need to be added to the list above:

  • OneHotHashEncodingEstimator (so that it is symmetrical with OneHotEncodingEstimator)
  • TextFeaturizingEstimator
  • MissingValueEstimator
  • FeatureSelectionCatalog.*
  • KeyMappingToValueEstimator (to be symmetric with valuetokey)

@artidoro
Copy link
Contributor

artidoro commented Mar 21, 2019

Summarizing what happened after the above discussion:

What we need to do before v1:

  • Re-enable multicolumn mapping for some transforms.

The above lists by Tom and Gleb were answering the question of which transforms should have multicolumn mapping re-enabled.

@artidoro
Copy link
Contributor

artidoro commented Mar 21, 2019

What is the best way to re-enable multi-column mapping for the few transforms that would benefit from it?
Here are our possibilities:

  1. Introduce a class which stores InputColumnName and OutputColumnName, and have an array of objects of that class something like:
    params InputOutputColumnPair[] columPairs

    • This would introduce a new class to the public surface. If we think about exposing Options in the near future this would be yet another class that is used to set parameters in a transform.
  2. Use an array of tuples like:
    params (string outputColumnName, string inputColumnName)[] columns

    • Tuples apparently don't work well with F#. Is this something we can be fine with for now, waiting for the Options to be exposed?
  3. Use two parallel arrays expressing the input and output names respectively:
    string[] inputNames, string[] outputNames, ....

    • Two separate arrays are not ideal when specifying pairs. It would easily introduce errors with different sizes and mismatches.
  4. Use an array of strings without specifying the name of the output column:
    params string[] columnName

    • This would introduce an inconsistency with the rest of the API since this would be the only place where we don't allow to specify the name of the output column. We would have inputColumnName = ouputColumnaName = columnName.

This makes me wonder whether it is even a good idea to introduce multi-column mapping until we expose Options.

@TomFinley @eerhardt @glebuk @sfilipi @singlis @wschin @Ivanidzo4ka

@artidoro
Copy link
Contributor

After discussing with @shauheen we decided to do this post v1. So we will not enable multi-column mapping before v1.

@artidoro
Copy link
Contributor

Closing as this is not a Project 13 issue any longer. No breaking API change is required.
Issue #3068 tracks the enabling of multicolumn mapping for some estimators.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants