-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380
Comments
The first round of four bugs are failures found through fixing tests on HashingEstimatorThe "kind" of vector-ness of the input is quite incorrect. The vector-kind (by which I mean, scalar, vector, variable sized vector) of the output should just be the vector-kind of the input, full stop. This conditional
KeyToVector and KeyToBinaryVectorThis one is less of a bug, and more of a consequence. This was the example I listed above for why the disparity between I do not expect the effect of this to be significant. The rule as it stood was merely more of an attempt to be cute and clever than to produce real value. But it would mean that going forward in that very, very specific special case, we would no longer be producing that value. (I expect however, that htis was rare. At least, so I hope.) NgramHashingEstimatorHere we claim to have slot names, but generally there will not be slot names.
|
Found based off The feature selection estimators were both reporting that they would be producing normalized columns unconditionally. Naturally, the reality (as correctly coded in the transformer) is that the output will be annotated as normalized if the input is annotated as being normalized. So, I made the estimator logic likewise conditional, just as it was in the transformer. |
Here's a fun one in machinelearning/src/Microsoft.ML.Data/Transforms/KeyToValue.cs Lines 536 to 537 in 4728a74
If something named Found via the |
The meta-multiclass trainer estimators have this key method to produce their annotations on their produced columns. Lines 161 to 182 in 8675ec5
This method has a number of problems and is more or less completely wrong. Note the usage of
The end result will be cleaner and more direct. This was brought to my attention as another bug revealed by |
Here we have the estimator for calibrators producing its schema shape (note the usage of machinelearning/src/Microsoft.ML.Data/Prediction/CalibratorCatalog.cs Lines 101 to 105 in 243ff02
Here's the actual
These are inconsistent, since the estimator is claiming it will be providing four columns of annotations, when in reality the transformer produces no annotation data whatsoever. The simplest thing, I suppose, would be to make the estimator match the transformer. But, it seems like keeping the association might actually be more correct and useful, so that is what I am going to do. If the input score column has the metadata of the right type and values and everything, then it will be passed along. Of course, if the score doesn't have these annotations, it will not produce annotations itself. Anyway, that is what I am doing after discussions with @sfilipi. Also I noted that a number of calibrator estimator tests weren't quite testing the right thing, so now they should be. |
OK, now have a PR. An unfortunate side effect of this is what I had hoped would be a relatively straightforward work of documentation has in fact not happened at all. 😛 |
Regarding the "Proposed Schema Relationship Between an Estimator and its Transformer" that I wrote yesterday, there's a fairly subtle point here I somehow just realized. This cannot be the principle, because, let's imagine that you had something called That said, I do think that having tests for our components to ensure they don't pull this sort of thing has, by the sheer volume of bugs found, proven to be more helpful than harmful. So, I'll keep the tests and bug fixes, but we can't really make this a principle. It conflicts logically with the other principle of |
I was attempting to write documentation for
IEstimator
andITransformer
. One of the core responsibilities of these is, as expressed in #581, schema propagation, so as to, among other benefits, do the sort of "pre-fit validation" scenario that has proven so troublesome to us in the past (this w.r.t. #267).However, what I find is that I struggled to detect a meaningful unifying "principle" behind
IEstimator
andITransformer
. I think the people that worked on it thought there was a principle, but when I encountered what people though tit was (something about annotations being a 'subset' of one or the other), it seemed like there were numerous "exceptions" that made the principle meaningless. . Indeed, I'm starting to suspect that there was no scenario by the original designer aside from "work until things compile and run on what we have so far," which is not really an acceptable state for something likeIEstimator
andITransformer
which are core concepts of this API, especially if we hope to describe them in such a way that people can implement these interfaces on their own.In this issue I outline the trouble that I have observed. The points here are subtle, but they are important insofar as resolving them is, I think, crucial for us to define the relationship between
IEstimator
and theITransformer
returned by fitting that estimator.CCing @artidoro, @shauheen since I know they had some interest in this problem... going to mark as the unusual step of both bug and code-sanitation, since it both touches upon our need to have a consistent architectural story among these key structures, as well as there having been some genuine bugs that have been uncovered.
A Troublesome Example
So, I'll start with the same seemingly innocuous code that led me to think there is something architecturally wrong at stake here: Consider the
IsNormalized
annotation, in theKeyToVectorMappingEstimator
, here is the condition where this annotation is mentioned as coming:machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Lines 803 to 804 in 73e29c8
Meanwhile, the resulting transformer has subtly different logic. If the input source value count is
1
, which of course happens when something is scalar, but also happens when the input is vector and happens to have vector size of one. (A concept that has no meaning inSchemaShape
, and is basically unknowable.)machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Lines 339 to 345 in 73e29c8
This leads to an interesting decision, because from a certain perspective, the addition of the metadata in the
ITransformer
code is correct (and, at the time it was originally written, some years I believe before the concept ofIEstimator
andSchemaShape
was introduced, was in fact unambiguously correct, since there was no concept as we have now of pre-Fit
schema validation.) But in the world ofIEstimator
, we are using some information that in some cases could be used post-Fit
in the sense that it can only be expressed in aDataViewSchema
, not aSchemaShape
-- that this is a vector type that happens to have length exactly one -- and that in many cases could not even be known pre-Fit
under some circumstnaces. (E.g., consider a value-to-key mapping estimator that winds up finding only one value.)So, the
IEstimator
makes no claim that theIsNormalized
data is there, but due to runtime logic that had existed prior to this time, it is in fact there.This was known, I believe, the originators of the concept of
IEstimator
andSchemaShape
, which at one point was deliberately made as, "all right, we consider the claim inSchemaShape
to be, in the area of annotations specifically, a subset of what is actually present among the non-hidden columns ofDatavViewSchema
. This is in fact what the relevant method called fromTestEstimatorCore
does. (Though, it is reversed from what is actually in the comment -- what it actually tests is that what is delivered is a superset of what is promised).machinelearning/test/Microsoft.ML.TestFramework/DataPipe/TestDataPipeBase.cs
Lines 151 to 163 in 3b576fe
This notion of "subsetting" being the requirement is mentioned in the documentation and testing I could find on the relationship between
IEstimator
andITransformer
(see also here). But is superset actually the right thing? Let's consider another very important transformer: key-to-value. In that case, the estimators relies on a complete description of theKeyValue
annotation to detect what the input type is! So in that case this mere talk of superset and subset becomes increasingly complex, because sometimes it is necessary, and sometimes it is not.If we want an
ITransformer
andIEstimator
to be paired, this suggests a mutuality of information. One of the core tenants ofIDataView
is composability, but the foundation of that composibility isn't because we've just created a system that works to just accomodate the set of components we ourselves wrote. (Indeed, this is one thing that I argue in one of our documents about theIDataView
system and why it works.)So right now we have this system where sometimes annotations are subsets, but possibly not, depending on whether we consider a combination of annotations and estimators and transformers valuable. But, that seems like a pretty ad hoc way of engineering a system. How is it possible at all to make any meaningful statements about such a thing like, "we have this requirement, but possibly not, it only depends on if we've written anything yet that cares."
What does that mean? Are we prepared to make as part of the contract of
IEstimator
andITransformer
that some sorts of annotations are allowed to impact the schema (KeyValues
must!), but that some are not (we effectively were having the assumption in a few places thatSlotNames
andIsNormalized
could not, based on our uneven propagation of them)? That doesn't seem like a good idea. Are we allowed to evolve that definition in any way? So I wanted something much simpler, which leads me to this:Proposed Schema Relationship Between an Estimator and its Transformer
Edit: This principle is wrong, see this comment. But I still think enforcing the principle on our own tests with our own components makes sense and does way more help than harm, but we can't insist on it as a principle of the estimators and transformers themselves.
So, to make the proposed relationship concrete, I'll consider the following code (for any
est
anddata
):If both constructions of
promisedShape
anddeliveredShape
succeed, they should be the same, that is, the two constructedSchemaShape
objects should be indistinguishable (besides, of course, say, tests on reference equality).If the construction of
deliveredShape
would succeed, then the construction ofpromisedShape
should succeed as well. (That is, the schema-propagation logic of an estimator should be no more strict than the schema-propagation logic of its produced transformer.)Both represent a somewhat more "restrictive" view of the relationship of paired
IEstimator
andITransformer
. The previous state was sort of an ad hoc free for all. I have not yetestablished that it will be completely possible to do this; I know it will require some changes of behavior of some
ITransformer
s, especially around their annotations, but so far I have not seen any "show stoppers." Maybe I will though.Bugs
Unfortunately due to the fact that this "requirement" is meaningless unless we test for it, it is best if I introduce the stricter test, and test for this, all at the same time. Doing so results in several test failures -- some were due to stricter requirements, but some were due to the fact that some
IEstimator
implementations were just plain wrong. I will outline in comments below those bugs that I have found, since I have not completed that work yet. (I have only fixed two tests so far, and each takes me a few hours. Regrettably, given the subtlety of the issues at stake, this is one of those situations that requires more finesse and consideration than other changes.)/++
The text was updated successfully, but these errors were encountered: