-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate proc_column only after all preprocessing parameters are merged in to prevent incorrect cached dataset reads #3069
Conversation
for more information, see https://pre-commit.ci
ludwig/data/cache/util.py
Outdated
@@ -12,7 +12,7 @@ def calculate_checksum(original_dataset: CacheableDataset, config: ModelConfigDi | |||
"dataset_checksum": original_dataset.checksum, | |||
"global_preprocessing": config.get(PREPROCESSING, {}), | |||
"global_defaults": config.get(DEFAULTS, {}), | |||
"feature_names": [feature[NAME] for feature in features], | |||
"feature_names": [feature[PROC_COLUMN] for feature in features], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to note: there is not necessarily a 1:1 map between feature name and proc col. It's possible that for a single proc col, there are multiple features that use it.
That said, proc cols is what we care about, so this should be fine (though slightly more aggressive at biasing towards cache misses than is strictly necessary). The one thing I would do is change the key from feature_names
to feature_proc_columns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgaddair Ah yeah, good callout.
It's possible that for a single proc col, there are multiple features that use it.
This would happen if two features have the same preprocessing parameters right?
If we want to be less aggressive about biasing towards cache misses, what would you recommend? Keeping feature_names and not using proc_column
? Also ok to do that since we're only computing this at the end of ModelConfig creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'll start off by saying it's a little risky because if we get it wrong, a false positive cache hit is obviously worse than false negative cache miss. That said, the idea is that technically so long as two configs (a) have the same columns (not features), and (b) those columns are preprocessed the same, then they can share the same preprocessed dataset. That's independent of how many features there are, their names, etc.
So, to do this the "right" way we would want to:
- Change the
proc_column
calculation to consider the fully rendered preprocessing of the feature after all fixed preprocessing params have been merged in (should already be doing this) AND include the feature type (right now we're not doing this for some reason, so this seems like a bit of an oversight that could get us into trouble, unless I;'m forgetting something). - Change the checksum calculation to exclude the per-feature metadata like feature types, etc. Only consider the unique SET of proc_column hashes (order and number of duplicates doesn't matter), ludwig_version, and dataset_info.
I think this should be a follow-up PR with A LOT of tests to go with it, though, haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I will address this in a follow-up PR!
For now, this PR addresses:
Change the proc_column calculation to consider the fully rendered preprocessing of the feature after all fixed preprocessing params have been merged in (should already be doing this)
Do we feel good with this change for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type
thing is a good point, ideally there's only one type associated with a column and users always use it, but there is nothing enforcing it, so two entries of the input feature list can have same column and different types. A possible case I can imagine is having two features given the same column that contains ordinal information. You may want to both have a category feature for it and a numerical one (as the numerical will already know about the relative ordering of the values). I think we never found issues with it for a couple reasons though. One is that I doubt anyone has done it. Second is that even if they did it, if the two features have different fully rendered preprocessing sections (which I beleive is true for all features, maybe with the exception of text / sequence and set/bag) then there wouldn't be any collision anyway because the hash is computed on the preproc dict. Anyway, I endorse the idea od including explicitly the type when computing the hash.
Regarding the proc column point. I can imagine the case of two different configs leading to the same set of proc columns. that's obvious when 2 configs only differ by training parameters or encoder parameters. But it could be the case also if they are identical and only differ by name of features that use the same column, or also number of features that use the same column. In all those cases we should keep the same cache. i believe that the implementation of:
"feature_names": [feature[PROC_COLUMN] for feature in features],
Is actually incorrect, as it may cover the case of different names, but notthe case of different number of features.
I believe that a correct implementation would be:
"feature_proc_cols": set(feature[PROC_COLUMN] for feature in features),
To avoid repeated entries of PROC_COLUMN
coming from different (and haigher nuebr of) features.
A concrete example would be these configs:
{input_features: [{name: a, column: x}, {name: b, column: x}]}
and
{input_features: [{name: a, column: x}, {name: c, column: x}]}
and
{input_features: [{name: a, column: x}, {name: b, column: x}, {name: c, column: x}]}
They should all produce the same cache and the cache that works for one should be reusable for the others.
It would be different if the c
feature in the last example had a different type from the others.
Anyway, I agree all these corner cases should be tested out with a separate PR, containing only this modification, and a very comprehensive set of tests to determine expected set of proc cols given a config and if the checksum match or do not. We would need both many positive adn negative cases.
So maybe the safest thing to do for now is to land the pr by reverting just this line to what it was before, and have a follow up PR to have full confidence on the logic I proposed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@w4nderlust Thanks for the thorough write up on the edge cases.
It seems like between the three of us, we're pretty aligned on what changes should be made going forward:
proc_col
hash should not just consider preprocessing parameters, but both preprocessing parameters along with the feature type. This is also going to require a pretty interesting backwards compatibility layer because we want to ensure that models that just used preprocessing parameters (and have thoseproc_cols
set) now are updated to reflect preprocessing + feature type combined hash.- Instead of using just
feature_names
in the info dict to compute the checksum hash, we should use the set of proc col hashes since it has higher guarantees about ensuring caches are only used when they should be.
Does that summarize it reasonably? @w4nderlust @tgaddair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think we're all aligned that this is a good next step. Thanks everyone :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove all the places in the code we're setting PROC_COLUMN
that isn't in generation of the ModelConfig? I think so.
Will take a look at this, I do agree. |
Seems like we use
Kinda confusing, I will rename 2. to be |
Can you point to the specific parts of the code where we do either of these? There may be a reason but without concrete examples I can't fully say. |
@w4nderlust Here's some links. Everything makes sense in context, but I think longer naming might just be better to clearly express what this represents. I think that we should call items in (2) like For 1, pretty much the references in this PR. For 2: |
I see. I actually feel ok with the way it is right now. It's kinda the same thing with Maybe if we really wanted to make a change, one thing we can do about it is whenever we set strings that are external facing we say |
Agreed! We can think of a good convention to delineate these, but that will be a separate PR as well. Going to keep the scope of this PR to what's currently in the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change!
This PR fixes situations where two configs using the same dataset would result in the same checksum, but the underlying feature proc_cols would be different because of different feature-specific preprocessing params.
This moves generation of the proc_col at the end of the instantiation of the ModelConfig object.