Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

TomFinley · 2019-04-17T19:06:28Z

I was attempting to write documentation for IEstimator and ITransformer. One of the core responsibilities of these is, as expressed in #581, schema propagation, so as to, among other benefits, do the sort of "pre-fit validation" scenario that has proven so troublesome to us in the past (this w.r.t. #267).

However, what I find is that I struggled to detect a meaningful unifying "principle" behind IEstimator and ITransformer. I think the people that worked on it thought there was a principle, but when I encountered what people though tit was (something about annotations being a 'subset' of one or the other), it seemed like there were numerous "exceptions" that made the principle meaningless. . Indeed, I'm starting to suspect that there was no scenario by the original designer aside from "work until things compile and run on what we have so far," which is not really an acceptable state for something like IEstimator and ITransformer which are core concepts of this API, especially if we hope to describe them in such a way that people can implement these interfaces on their own.

In this issue I outline the trouble that I have observed. The points here are subtle, but they are important insofar as resolving them is, I think, crucial for us to define the relationship between IEstimator and the ITransformer returned by fitting that estimator.

CCing @artidoro, @shauheen since I know they had some interest in this problem... going to mark as the unusual step of both bug and code-sanitation, since it both touches upon our need to have a consistent architectural story among these key structures, as well as there having been some genuine bugs that have been uncovered.

A Troublesome Example

So, I'll start with the same seemingly innocuous code that led me to think there is something architecturally wrong at stake here: Consider the IsNormalized annotation, in the KeyToVectorMappingEstimator, here is the condition where this annotation is mentioned as coming:

machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs

Lines 803 to 804 in 73e29c8

    
           if (!colInfo.OutputCountVector || (col.Kind == SchemaShape.Column.VectorKind.Scalar)) 
        
               metadata.Add(new SchemaShape.Column(AnnotationUtils.Kinds.IsNormalized, SchemaShape.Column.VectorKind.Scalar, BooleanDataViewType.Instance, false));

Meanwhile, the resulting transformer has subtly different logic. If the input source value count is 1, which of course happens when something is scalar, but also happens when the input is vector and happens to have vector size of one. (A concept that has no meaning in SchemaShape, and is basically unknowable.)

machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs

Lines 339 to 345 in 73e29c8

    
                           if (!_parent._columns[iinfo].OutputCountVector || srcValueCount == 1) 
        
                           { 
        
                               ValueGetter<bool> getter = (ref bool dst) => 
        
                               { 
        
                                   dst = true; 
        
                               }; 
        
                               builder.Add(AnnotationUtils.Kinds.IsNormalized, BooleanDataViewType.Instance, getter);

This leads to an interesting decision, because from a certain perspective, the addition of the metadata in the ITransformer code is correct (and, at the time it was originally written, some years I believe before the concept of IEstimator and SchemaShape was introduced, was in fact unambiguously correct, since there was no concept as we have now of pre-Fit schema validation.) But in the world of IEstimator, we are using some information that in some cases could be used post-Fit in the sense that it can only be expressed in a DataViewSchema, not a SchemaShape -- that this is a vector type that happens to have length exactly one -- and that in many cases could not even be known pre-Fit under some circumstnaces. (E.g., consider a value-to-key mapping estimator that winds up finding only one value.)

So, the IEstimator makes no claim that the IsNormalized data is there, but due to runtime logic that had existed prior to this time, it is in fact there.

This was known, I believe, the originators of the concept of IEstimator and SchemaShape, which at one point was deliberately made as, "all right, we consider the claim in SchemaShape to be, in the area of annotations specifically, a subset of what is actually present among the non-hidden columns of DatavViewSchema. This is in fact what the relevant method called from TestEstimatorCore does. (Though, it is reversed from what is actually in the comment -- what it actually tests is that what is delivered is a superset of what is promised).

machinelearning/test/Microsoft.ML.TestFramework/DataPipe/TestDataPipeBase.cs

Lines 151 to 163 in 3b576fe

    
           private void CheckSameSchemaShape(SchemaShape promised, SchemaShape delivered) 
        
           { 
        
               Assert.True(promised.Count == delivered.Count); 
        
               var sortedCols1 = promised.OrderBy(x => x.Name); 
        
               var sortedCols2 = delivered.OrderBy(x => x.Name); 
        
               foreach (var (x, y) in sortedCols1.Zip(sortedCols2, (x, y) => (x, y))) 
        
               { 
        
                   Assert.Equal(x.Name, y.Name); 
        
                   // We want the 'promised' metadata to be a superset of 'delivered'. 
        
                   Assert.True(y.IsCompatibleWith(x), $"Mismatch on {x.Name}"); 
        
               } 
        
           }

This notion of "subsetting" being the requirement is mentioned in the documentation and testing I could find on the relationship between IEstimator and ITransformer (see also here). But is superset actually the right thing? Let's consider another very important transformer: key-to-value. In that case, the estimators relies on a complete description of the KeyValue annotation to detect what the input type is! So in that case this mere talk of superset and subset becomes increasingly complex, because sometimes it is necessary, and sometimes it is not.

If we want an ITransformer and IEstimator to be paired, this suggests a mutuality of information. One of the core tenants of IDataView is composability, but the foundation of that composibility isn't because we've just created a system that works to just accomodate the set of components we ourselves wrote. (Indeed, this is one thing that I argue in one of our documents about the IDataView system and why it works.)

So right now we have this system where sometimes annotations are subsets, but possibly not, depending on whether we consider a combination of annotations and estimators and transformers valuable. But, that seems like a pretty ad hoc way of engineering a system. How is it possible at all to make any meaningful statements about such a thing like, "we have this requirement, but possibly not, it only depends on if we've written anything yet that cares."

What does that mean? Are we prepared to make as part of the contract of IEstimator and ITransformer that some sorts of annotations are allowed to impact the schema (KeyValues must!), but that some are not (we effectively were having the assumption in a few places that SlotNames and IsNormalized could not, based on our uneven propagation of them)? That doesn't seem like a good idea. Are we allowed to evolve that definition in any way? So I wanted something much simpler, which leads me to this:

Proposed Schema Relationship Between an Estimator and its Transformer

Edit: This principle is wrong, see this comment. But I still think enforcing the principle on our own tests with our own components makes sense and does way more help than harm, but we can't insist on it as a principle of the estimators and transformers themselves.

So, to make the proposed relationship concrete, I'll consider the following code (for any est and data):

IEstimator est = ...;
IDataView data = ...;
var schema = data.Schema;

var promisedShape = est.GetOutputSchema(SchemaShape.Create(schema));
var deliveredShape = SchemaShape.Create(est.Fit(data).GetOutputSchema(schema));

If both constructions of promisedShape and deliveredShape succeed, they should be the same, that is, the two constructed SchemaShape objects should be indistinguishable (besides, of course, say, tests on reference equality).
If the construction of deliveredShape would succeed, then the construction of promisedShape should succeed as well. (That is, the schema-propagation logic of an estimator should be no more strict than the schema-propagation logic of its produced transformer.)

Both represent a somewhat more "restrictive" view of the relationship of paired IEstimator and ITransformer. The previous state was sort of an ad hoc free for all. I have not yet
established that it will be completely possible to do this; I know it will require some changes of behavior of some ITransformers, especially around their annotations, but so far I have not seen any "show stoppers." Maybe I will though.

Bugs

Unfortunately due to the fact that this "requirement" is meaningless unless we test for it, it is best if I introduce the stricter test, and test for this, all at the same time. Doing so results in several test failures -- some were due to stricter requirements, but some were due to the fact that some IEstimator implementations were just plain wrong. I will outline in comments below those bugs that I have found, since I have not completed that work yet. (I have only fixed two tests so far, and each takes me a few hours. Regrettably, given the subtlety of the issues at stake, this is one of those situations that requires more finesse and consideration than other changes.)/+
+

The text was updated successfully, but these errors were encountered:

TomFinley · 2019-04-17T19:07:01Z

The first round of four bugs are failures found through fixing tests on CategoricalHashStatic and NgramWorkout. I still have seven more test failures that I know of beyond that, these are just the ones I have fixed so far:

HashingEstimator

The "kind" of vector-ness of the input is quite incorrect. The vector-kind (by which I mean, scalar, vector, variable sized vector) of the output should just be the vector-kind of the input, full stop. This conditional ?: statement is not sensible.

machinelearning/src/Microsoft.ML.Data/Transforms/Hashing.cs

Line 1277 in 3b576fe

    
           result[colInfo.Name] = new SchemaShape.Column(colInfo.Name, col.ItemType is VectorDataViewType ? SchemaShape.Column.VectorKind.Vector : SchemaShape.Column.VectorKind.Scalar, NumberDataViewType.UInt32, true, new SchemaShape(metadata));

KeyToVector and KeyToBinaryVector

This one is less of a bug, and more of a consequence. This was the example I listed above for why the disparity between IEstimator and ITransformer are above. So the natural consequence if we follow that to its logical conclusion is that we have the corresponding transformer instances not produce the metadata they claimed to do in the case of any vector, even in the case where the vector length just so happened to be 1.

I do not expect the effect of this to be significant. The rule as it stood was merely more of an attempt to be cute and clever than to produce real value. But it would mean that going forward in that very, very specific special case, we would no longer be producing that value. (I expect however, that htis was rare. At least, so I hope.)

NgramHashingEstimator

Here we claim to have slot names, but generally there will not be slot names.

machinelearning/src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs

Line 1211 in 9b9fd6a

    
           metadata.Add(new SchemaShape.Column(AnnotationUtils.Kinds.SlotNames, SchemaShape.Column.VectorKind.Vector, TextDataViewType.Instance, false));

TomFinley · 2019-04-17T19:21:57Z

Found based off CountFeatureSelectionWorkout and MutualInformationSelectionWorkout tests...

The feature selection estimators were both reporting that they would be producing normalized columns unconditionally. Naturally, the reality (as correctly coded in the transformer) is that the output will be annotated as normalized if the input is annotated as being normalized. So, I made the estimator logic likewise conditional, just as it was in the transformer.

TomFinley · 2019-04-18T17:04:19Z

Here's a fun one in KeyToValueMappingEstimator:

machinelearning/src/Microsoft.ML.Data/Transforms/KeyToValue.cs

Lines 536 to 537 in 4728a74

    
           if (col.Annotations.TryFindColumn(AnnotationUtils.Kinds.SlotNames, out var slotCol)) 
        
               metadata = new SchemaShape(new[] { slotCol });

If something named SlotNames annotation exists but was not really the right type, this will still propagate it in the estimator, but the transformer will not. This was leading to some test failures.

Found via the PairwiseCouplingTrainer test failure, but in principle probably affected multiple tests since this is such a widely used estimator... fixing this also seems to have fixed a prior failure in MetacomponentsFeaturesRenamed.

TomFinley · 2019-04-18T17:59:41Z

The meta-multiclass trainer estimators have this key method to produce their annotations on their produced columns.

machinelearning/src/Microsoft.ML.StandardTrainers/Standard/MulticlassClassification/MetaMulticlassTrainer.cs

Lines 161 to 182 in 8675ec5

    
           private SchemaShape.Column[] GetOutputColumnsCore(SchemaShape inputSchema) 
        
           { 
        
               if (LabelColumn.IsValid) 
        
               { 
        
                   bool success = inputSchema.TryFindColumn(LabelColumn.Name, out var labelCol); 
        
                   Contracts.Assert(success); 
        
                   var metadata = new SchemaShape(labelCol.Annotations.Where(x => x.Name == AnnotationUtils.Kinds.KeyValues) 
        
                                   .Concat(MetadataForScoreColumn())); 
        
                   return new[] 
        
                   { 
        
                       new SchemaShape.Column(DefaultColumnNames.Score, SchemaShape.Column.VectorKind.Vector, NumberDataViewType.Single, false, new SchemaShape(AnnotationUtils.AnnotationsForMulticlassScoreColumn(labelCol))), 
        
                       new SchemaShape.Column(DefaultColumnNames.PredictedLabel, SchemaShape.Column.VectorKind.Scalar, NumberDataViewType.UInt32, true, metadata) 
        
                   }; 
        
               } 
        
               else 
        
                   return new[] 
        
                   { 
        
                       new SchemaShape.Column(DefaultColumnNames.Score, SchemaShape.Column.VectorKind.Vector, NumberDataViewType.Single, false, new SchemaShape(MetadataForScoreColumn())), 
        
                       new SchemaShape.Column(DefaultColumnNames.PredictedLabel, SchemaShape.Column.VectorKind.Scalar, NumberDataViewType.UInt32, true, new SchemaShape(MetadataForScoreColumn())) 
        
                   }; 
        
           }

This method has a number of problems and is more or less completely wrong. Note the usage of MetadataForScoreColumn for both, even on the predicted label column. We also have the problem that even for score columns, it was not changed to reflect the later situation after #3101 where we have the label columns annotation. Basically the correct solution is to change the code in a few ways including using the helper functions, which most other multiclass trainers were doing but this one was not, for some reason.

machinelearning/src/Microsoft.ML.Core/Data/AnnotationUtils.cs

Line 441 in cc40049

    
           public static IEnumerable<SchemaShape.Column> AnnotationsForMulticlassScoreColumn(SchemaShape.Column? labelColumn = null)

The end result will be cleaner and more direct.

This was brought to my attention as another bug revealed by PairwiseCouplingTrainer, but fixing that fixed a few other test failures like OVAWithAllConstructorArgs, OVAWithAllConstructorArgs,

TomFinley · 2019-04-18T20:23:47Z

Here we have the estimator for calibrators producing its schema shape (note the usage of AnnotationUtils.GetTrainerOutputAnnotation):

machinelearning/src/Microsoft.ML.Data/Prediction/CalibratorCatalog.cs

Lines 101 to 105 in 243ff02

    
           outColumns[DefaultColumnNames.Probability] = new SchemaShape.Column(DefaultColumnNames.Probability, 
        
               SchemaShape.Column.VectorKind.Scalar, 
        
               NumberDataViewType.Single, 
        
               false, 
        
               new SchemaShape(AnnotationUtils.GetTrainerOutputAnnotation(true)));

Here's the actual DataViewSchema column that winds up getting produced by the transformer (note the null for the annotations data):

machinelearning/src/Microsoft.ML.Data/Prediction/CalibratorCatalog.cs

Line 228 in 243ff02

    
           new DataViewSchema.DetachedColumn(DefaultColumnNames.Probability, NumberDataViewType.Single, null)

These are inconsistent, since the estimator is claiming it will be providing four columns of annotations, when in reality the transformer produces no annotation data whatsoever.

The simplest thing, I suppose, would be to make the estimator match the transformer. But, it seems like keeping the association might actually be more correct and useful, so that is what I am going to do. If the input score column has the metadata of the right type and values and everything, then it will be passed along. Of course, if the score doesn't have these annotations, it will not produce annotations itself. Anyway, that is what I am doing after discussions with @sfilipi.

Also I noted that a number of calibrator estimator tests weren't quite testing the right thing, so now they should be.

TomFinley · 2019-04-18T20:41:15Z

OK, now have a PR. An unfortunate side effect of this is what I had hoped would be a relatively straightforward work of documentation has in fact not happened at all. 😛

TomFinley · 2019-04-18T23:30:22Z

Regarding the "Proposed Schema Relationship Between an Estimator and its Transformer" that I wrote yesterday, there's a fairly subtle point here I somehow just realized. This cannot be the principle, because, let's imagine that you had something called SlotNames metadata for a vector-valued column of length 3, but the annotation was of length 4. That is totally legal. So, the idea that the SchemaShape's annotations are potentially a superset is in fact correct.

That said, I do think that having tests for our components to ensure they don't pull this sort of thing has, by the sheer volume of bugs found, proven to be more helpful than harmful. So, I'll keep the tests and bug fixes, but we can't really make this a principle. It conflicts logically with the other principle of IDataView of annotations being optional, and being of a non-expected type never being an error condition, it's just as if the data isn't really there if you're trying to use it for a specific purpose. Which is interesting. But, I still think our tests should treat that as an error, since I'd hope none of our components are pulling things like that.

TomFinley added bug Something isn't working code-sanitation Code consistency, maintainability, and best practices, moreso than any public API. labels Apr 17, 2019

TomFinley self-assigned this Apr 17, 2019

TomFinley mentioned this issue Apr 18, 2019

Check for IEstimator/ITransformer schema consistency, fix bugs uncovered #3408

Merged

TomFinley closed this as completed in #3408 Apr 18, 2019

ghost locked as resolved and limited conversation to collaborators Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

TomFinley commented Apr 17, 2019 •

edited

Loading

TomFinley commented Apr 17, 2019

TomFinley commented Apr 17, 2019 •

edited

Loading

TomFinley commented Apr 18, 2019 •

edited

Loading

TomFinley commented Apr 18, 2019 •

edited

Loading

TomFinley commented Apr 18, 2019 •

edited

Loading

TomFinley commented Apr 18, 2019

TomFinley commented Apr 18, 2019

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

Comments

TomFinley commented Apr 17, 2019 • edited Loading

A Troublesome Example

Proposed Schema Relationship Between an Estimator and its Transformer

Bugs

TomFinley commented Apr 17, 2019

HashingEstimator

KeyToVector and KeyToBinaryVector

NgramHashingEstimator

TomFinley commented Apr 17, 2019 • edited Loading

TomFinley commented Apr 18, 2019 • edited Loading

TomFinley commented Apr 18, 2019 • edited Loading

TomFinley commented Apr 18, 2019 • edited Loading

TomFinley commented Apr 18, 2019

TomFinley commented Apr 18, 2019

TomFinley commented Apr 17, 2019 •

edited

Loading

TomFinley commented Apr 17, 2019 •

edited

Loading

TomFinley commented Apr 18, 2019 •

edited

Loading

TomFinley commented Apr 18, 2019 •

edited

Loading

TomFinley commented Apr 18, 2019 •

edited

Loading