Self-interactions are not included when the feature has value 1 #698

stephen-hoover · 2015-06-17T22:37:25Z

When using quadratic or cubic interaction terms, self interactions are not created from features which have a value of 1. Example:

echo '1 |f a b' | vw shows 3 features (correct: "a", "b", and the constant).
echo '1 |f a:2 b:2' | vw -q ff shows 6 features, which is also correct: "a", "b", constant, "a * a", "b * b", and "a * b".
echo '1 |f a b' | vw -q ff shows 4 features. If I add an "--invert_hash", the model output is

Constant:116060:0.158008
f^a:14355:0.158008
f^a*f^b:161933:0.158008
f^b:146788:0.158008

which is missing the "a * a" and "b * b" terms.

This behavior also happens at test time. If I extract model coefficients from an invert hash model file and compare predictions generated from those to predictions generated directly from VW, the predictions match everywhere except where a feature has a value of exactly 1.

I found this error in VW version 7.10.2, commit hash dc5532f, running on OS X 10.10.3. I know that the error was not present at commit hash bb68807, but I haven't done a bisection to locate exactly where it was introduced.

The text was updated successfully, but these errors were encountered:

stephen-hoover · 2015-06-17T23:20:36Z

I finished a git-bisect. It appears that the problem was introduced in either the commit ee1fc3c or 72ceea8 from Mon May 4 2015 (neither compiles for me, so I can't say which one).

trufanov-nok · 2015-06-24T10:17:07Z

This is expected behaviour. I'll write an article about new feature generation algorythm in Vw's wiki today. In short - it generates simple combinations of features for interactons of same namespace instead of permutations (as it was before).
That means that for -q ff and 2 features in f you'll get only a*b instead of a*a, a*b, b*a, b*b
And for 3 features in f you'll get only a*b, a*c, b*c

The reason behind that is that your model won’t benefit from b*a feature if it already includes a*b. Although they'll be in different places of memory, they'll have same value after training. It looks like such "duplicate" features also don't help with hash collisions resolution.
Also features like a*a or a*a*a is useless if your model includes feature a - they all will have same value after training. The only exception is when feature a has weight != 1.0. In this case feature values will be different and thus we don't filter out a*a like interactions for a:2.0.

These new rules of feature interactions generation supposed to decrease number of generated features by filtering out features that won't improve model. The training shall converge faster and results may be slightly better. So this approach is enabled by default.

You still can switch back to the old feature generation engine by specifying --permutations flag in VW's command line.

djstrong · 2015-06-24T12:14:49Z

Why "It looks like such "duplicate" features also don't help with hash collisions resolution."?

trufanov-nok · 2015-06-24T12:31:57Z

Wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-interactions

Why "It looks like such "duplicate" features also don't help with hash collisions resolution."?

I believe that model gains such "unnecessary" features so fast (with grow of features in self-interacting namespace) that very soon they become a source of collisions. For example, -q aa --cubic aaa with 30 features in a will produce 28k features per example, and only ~4.5k with simple combinations mode.

I think I've also seen a paper that proves that you can't solve collision problems by using 2 different hashes in hashing-trick storing hashes to the same address-space. At least not for non toy datasets. Our situation is very similar to usage of 2 hashes for same feature in hashing-trick.

djstrong · 2015-06-24T12:45:09Z

Thanks for answer. I refer to my problem: text classification. In bigram model my data set generates 800M features, I can use only 28 bits (~300M), and I have couple of special non-text features. So I can try to replicate these features using above mentioned method.
Better solution is to hash features by myself and ensure that words does not collide with important features. It would be great if vw could use separate memory for defined namespace.

trufanov-nok · 2015-06-24T13:16:32Z

Well, it would be better to find out beforehand how badly collisions affects your model predictability.
When I was competing on kaggle I've played with -b and found out that my model prediction power doesn't grow much after -b 24. My feature set wasn't big and could fit into RAM, so I assumed that collision effect isn't big enough and fighting with collisions isn't feasible.
But in your case it may means that collision effect is so big that switching between -b27 and -b28 is like dancing around -b10 -b11 for me.

Do you want to make 100% collision protected only a "couple of special non-text features" or all interactions with them? If first and you think this helps you plus know some C++ then it's not difficult to do. You just need to replace "special non-text features" with hashes in some interval [0,100] in your dataset. "By default VW hashes string features and does not hash integer features.". And then plug a small C++ check into VW code that will make sure that other features will never get a hash from this interval. Perhaps there is a easier way but I prefer adjust VW code for my purposes instead playing with scripts etc. I can point you to the right piece of VW code to do that.

And let's continue with this in another thread to not spam this thread as this discussion is offtopic

stephen-hoover · 2015-06-24T15:47:54Z

@trufanov-nok , it's great that the interaction terms no longer include both "a * b" and "b * a". I was happy to see that change in your recent refactor. But not including "a * a" is a problem. If the feature "a" always has value 1, it's true that "a*a" is redundant with "a". The problem comes if "a" can sometimes take the value 1 and sometimes take values other than 1.

Consider building a model on 1 feature with quadratic interaction terms, and say that you end up with a model that looks like y = 0.5 * x + 1.5 * x^2. If x = 1.001, VW outputs a prediction of 2.0035. If x = 0.999, VW outputs a prediction of 1.9965. However, if x = 1.0, VW outputs a prediction of 0.5. I don't think that these cases would be uncommon. You'd run into all the time if your features are counts, for example. I found it because I'm using the iris dataset for testing; it has continuous features, but the continuous value is sometimes exactly 1.

I think that a discontinuity like this in a linear model is undesirable, and it's certainly very counterintuitive. This discontinuity also exists in the cost function when training, which I find equally undesirable. I'd prefer to not have the duplicate "a * b" and "b * a" terms, but if that means the self-interaction terms disappear when they're equal to 1, I can add on the "--permutations" flag.

JohnLangford · 2015-06-24T16:11:28Z

This seems like a convincing argument to me---continuity is certainly
desirable. Typically, the a_a terms are dominated by a_b terms, so
there is little computational need to eliminate them. And w.r.t.
accuracy, discontinuity could bite hard.

-John

On 06/24/2015 11:47 AM, Stephen Hoover wrote:

@trufanov-nok https://github.com/trufanov-nok , it's great that the
interaction terms no longer include both "a * b" and "b * a". I was
happy to see that change in your recent refactor. But not including "a

a" is a problem. If the feature "a" always has value 1, it's true
that "a*a" is redundant with "a". The problem comes if "a" can
sometimes take the value 1 and sometimes take values other than 1.

Consider building a model on 1 feature with quadratic interaction
terms, and say that you end up with a model that looks like y = 0.5 *
x + 1.5 * x^2. If x = 1.001, VW outputs a prediction of 2.0035. If x =
0.999, VW outputs a prediction of 1.9965. However, if x = 1.0, VW
outputs a prediction of 0.5. I don't think that these cases would be
uncommon. You'd run into all the time if your features are counts, for
example. I found it because I'm using the iris dataset for testing; it
has continuous features, but the continuous value is sometimes exactly 1.

I think that a discontinuity like this in a linear model is
undesirable, and it's certainly very counterintuitive. This
discontinuity also exists in the cost function when training, which I
find equally undesirable. I'd prefer to not have the duplicate "a * b"
and "b * a" terms, but if that means the self-interaction terms
disappear when they're equal to 1, I can add on the "--permutations" flag.

—
Reply to this email directly or view it on GitHub
#698 (comment).

trufanov-nok · 2015-06-24T17:18:40Z

Well, I'm convinced too.

The problem is that if we allow a*a feature generation regardless of weight + enable this mode by default then this opens the door for generation of a*a*x features for --cubic interactions etc. And this in its way significantly decrease feature number reduction in simple combinations mode vs permutations mode.

I think out 2 fast fixes:

Let's keep everything as is, but replace "with weght != 1.0" rule with "with explicitly setted weight". This means VW will generate a*a for all features with weights != 1.0 plus for a:1.0 or a:1, but not for just a. This will help for features which are counts. @stephen-hoover will it solve all your cases? I doubt so.

Thus my second proposal is:

To allow a*a generation regardless of weight by default, but add 2 additional options for the command line:
--no_self_interactions to suppress a*a features generation at all. With this mode i'm trying to address datasets with only categorical features. You'll get only simple combinations of features with it which means minimal number of features.
--no_self_interactions_for_default_weighted_features to turn back to current implementation with modification described in proposal 1. I'll document this behaviour and pitfalls that Stephan mentioned. With this mode i'm trying to address datasets which have only few columns for features with weights and others are weightless categorical features.

My idea: let's let experienced user to decide what kind of dataset he has. So he'll be able to reduce number of generated features even more with one of these two flags.

What you guys think?

stephen-hoover · 2015-06-24T17:51:58Z

I actually like your first option, only generating self-interactions for terms with explicit weights. As long as this is well-documented, it should let people distinguish between features which could take on any weight and features which only take on "present" and "not present" values. It does lead to a gotcha if a user mixes explicit and implicit weights, but I do think that it would handle all of my use cases.

I think that another good option is your proposal 2b), to have a command line flag which turns on the behavior in your first proposal. That way a new user wouldn't get tripped up by something they didn't expect, but expert users could still benefit from not generating so many unnecessary interaction terms.

trufanov-nok · 2015-06-24T18:27:44Z

May be sticking with option 1 will be enough, @JohnLangford ?

martinpopel · 2015-06-24T18:54:02Z

I like option 1, it seems intuitive to me:
feature:1 is a integer or real-valued feature which has value 1.
feature is a boolean feature (with a true value).

I agree with the idea of making reasonable defaults for beginners, while allowing experts to tune it. On the other hand, the number of command line options is also a burden for beginners (vw -h | wc -l currently prints 235 and the number is still growing). So I would suggest to not add extra options (--no_self_interactions_for_default_weighted_features) which are not necessary.

JohnLangford · 2015-06-24T19:04:45Z

Option 1 seems nonviable because the only thing available at the moment
of the feature expansion is:
(a) the first character of the namespace.
(b) the feature index.
(c) the feature value.

Our baseline approach should probably be to allow self-interaction
features by default. If we can find a use case where eliminating this
is definitely useful we can add a flag. Martin's concern about
excessive options is real though, so I'm inclined to not add a flag
unless the flag has a known use case. most features occur in
namespaces which have more than one feature and hence these self
interaction terms are always dominated by the cross terms
computationally and representationally. Given this, I expect the
benefits of removing self-interaction terms to be negligible both
computationally and representationally.

-John

On 06/24/2015 02:54 PM, Martin Popel wrote:

I like option 1, it seems intuitive to me:
|feature:1| is a integer or real-valued feature which has value 1.
|feature| is a boolean feature (with a true value).

I agree with the idea of making reasonable defaults for beginners,
while allowing experts to tune it. On the other hand, the number of
command line options is also a burden for beginners (|vw -h | wc -l|
currently prints 235 and the number is still growing). So I would
suggest to not add extra options
(|--no_self_interactions_for_default_weighted_features|) which are not
necessary.

—
Reply to this email directly or view it on GitHub
#698 (comment).

trufanov-nok · 2015-06-24T19:09:27Z

Option 1 seems nonviable because the only thing available at the moment

But what if I add a boolean flag to example class which will be set on if weight was initialized here ?

JohnLangford · 2015-06-24T19:12:06Z

There is more than one feature in an example, so that seems inadequate.

-John

On 06/24/2015 03:09 PM, Alexander Trufanov wrote:

Option 1 seems nonviable because the only thing available at the
moment
But what if I add a boolean flag to example class which will be set on
if weight was initialized here
https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/parse_example.cc#L54
?

—
Reply to this email directly or view it on GitHub
#698 (comment).

trufanov-nok · 2015-06-24T19:33:19Z

Sorry, I meant a bool or char for feature struct.
Or perhaps better v_array in example class to store flags, so it could be reused without new memory allocation.
Such flag is needed only for single feature (not needed for features generated by interaction) so the array won't be big.

Nope, this won't work either. Shall take a time and think about this.

JohnLangford · 2015-06-24T19:47:22Z

It does not seem worth it.

This would imply that every feature takes up significantly more space
(if we want aligned access) and is more complex. A corner case does
not seem to justify the complexity implied as it will slow the common
case down.

-John

On 06/24/2015 03:33 PM, Alexander Trufanov wrote:

Sorry, I meant a bool or char for feature struct.

—
Reply to this email directly or view it on GitHub
#698 (comment).

trufanov-nok · 2015-06-24T21:43:28Z

Ok, i've done a required fix and tested VW with -q ff --cubic fff and 27 features in f.
It produces 18279 features for --permutation, 3006 features with simple combinations and 3654 features with proposed approach. I'll submit a PR for it tomorrow.

…ys true

trufanov-nok mentioned this issue Jun 25, 2015

self-interacting features regardless of their weight #707

Merged

JohnLangford closed this as completed in #707 Jun 26, 2015

mritunjaysharma394 pushed a commit to mritunjaysharma394/vowpal-wiki that referenced this issue Feb 27, 2020

According to VowpalWabbit/vowpal_wabbit#698, self-interaction is alwa…

8de1bff

…ys true

jackgerrits mentioned this issue Jun 18, 2021

Remove feature_self_interactions compile time value and adopt behavior of feature_self_interactions=true #3084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-interactions are not included when the feature has value 1 #698

Self-interactions are not included when the feature has value 1 #698

stephen-hoover commented Jun 17, 2015

stephen-hoover commented Jun 17, 2015

trufanov-nok commented Jun 24, 2015

djstrong commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

djstrong commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

stephen-hoover commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

stephen-hoover commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

martinpopel commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

Self-interactions are not included when the feature has value 1 #698

Self-interactions are not included when the feature has value 1 #698

Comments

stephen-hoover commented Jun 17, 2015

stephen-hoover commented Jun 17, 2015

trufanov-nok commented Jun 24, 2015

djstrong commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

djstrong commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

stephen-hoover commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

stephen-hoover commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

martinpopel commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015

JohnLangford commented Jun 24, 2015

trufanov-nok commented Jun 24, 2015