Generate proc_column only after all preprocessing parameters are merged in to prevent incorrect cached dataset reads #3069

arnavgarg1 · 2023-02-09T21:04:42Z

This PR fixes situations where two configs using the same dataset would result in the same checksum, but the underlying feature proc_cols would be different because of different feature-specific preprocessing params.

This moves generation of the proc_col at the end of the instantiation of the ModelConfig object.

… feature has been updated

for more information, see https://pre-commit.ci

github-actions · 2023-02-09T22:38:31Z

Unit Test Results

        6 files ±    0         6 suites ±0 4h 53m 9s ⏱️ + 37m 1s
  3 930 tests +  40   3 893 ✔️ +  40   37 💤 ±0 0 ❌ ±0
11 787 runs +120 11 676 ✔️ +120 111 💤 ±0 0 ❌ ±0

Results for commit b20b0c1. ± Comparison against base commit 4bfbb8f.

♻️ This comment has been updated with latest results.

tgaddair · 2023-02-10T23:17:14Z

ludwig/data/cache/util.py

@@ -12,7 +12,7 @@ def calculate_checksum(original_dataset: CacheableDataset, config: ModelConfigDi
        "dataset_checksum": original_dataset.checksum,
        "global_preprocessing": config.get(PREPROCESSING, {}),
        "global_defaults": config.get(DEFAULTS, {}),
-        "feature_names": [feature[NAME] for feature in features],
+        "feature_names": [feature[PROC_COLUMN] for feature in features],


Something to note: there is not necessarily a 1:1 map between feature name and proc col. It's possible that for a single proc col, there are multiple features that use it.

That said, proc cols is what we care about, so this should be fine (though slightly more aggressive at biasing towards cache misses than is strictly necessary). The one thing I would do is change the key from feature_names to feature_proc_columns.

@tgaddair Ah yeah, good callout.

It's possible that for a single proc col, there are multiple features that use it.

This would happen if two features have the same preprocessing parameters right?

If we want to be less aggressive about biasing towards cache misses, what would you recommend? Keeping feature_names and not using proc_column? Also ok to do that since we're only computing this at the end of ModelConfig creation

So, I'll start off by saying it's a little risky because if we get it wrong, a false positive cache hit is obviously worse than false negative cache miss. That said, the idea is that technically so long as two configs (a) have the same columns (not features), and (b) those columns are preprocessed the same, then they can share the same preprocessed dataset. That's independent of how many features there are, their names, etc.

So, to do this the "right" way we would want to:

Change the proc_column calculation to consider the fully rendered preprocessing of the feature after all fixed preprocessing params have been merged in (should already be doing this) AND include the feature type (right now we're not doing this for some reason, so this seems like a bit of an oversight that could get us into trouble, unless I;'m forgetting something).

Change the checksum calculation to exclude the per-feature metadata like feature types, etc. Only consider the unique SET of proc_column hashes (order and number of duplicates doesn't matter), ludwig_version, and dataset_info.

I think this should be a follow-up PR with A LOT of tests to go with it, though, haha.

Makes sense. I will address this in a follow-up PR!

For now, this PR addresses:

Change the proc_column calculation to consider the fully rendered preprocessing of the feature after all fixed preprocessing params have been merged in (should already be doing this)

Do we feel good with this change for now?

The type thing is a good point, ideally there's only one type associated with a column and users always use it, but there is nothing enforcing it, so two entries of the input feature list can have same column and different types. A possible case I can imagine is having two features given the same column that contains ordinal information. You may want to both have a category feature for it and a numerical one (as the numerical will already know about the relative ordering of the values). I think we never found issues with it for a couple reasons though. One is that I doubt anyone has done it. Second is that even if they did it, if the two features have different fully rendered preprocessing sections (which I beleive is true for all features, maybe with the exception of text / sequence and set/bag) then there wouldn't be any collision anyway because the hash is computed on the preproc dict. Anyway, I endorse the idea od including explicitly the type when computing the hash.

Regarding the proc column point. I can imagine the case of two different configs leading to the same set of proc columns. that's obvious when 2 configs only differ by training parameters or encoder parameters. But it could be the case also if they are identical and only differ by name of features that use the same column, or also number of features that use the same column. In all those cases we should keep the same cache. i believe that the implementation of:

"feature_names": [feature[PROC_COLUMN] for feature in features],

Is actually incorrect, as it may cover the case of different names, but notthe case of different number of features.
I believe that a correct implementation would be:

"feature_proc_cols": set(feature[PROC_COLUMN] for feature in features),

To avoid repeated entries of PROC_COLUMN coming from different (and haigher nuebr of) features.

A concrete example would be these configs:
{input_features: [{name: a, column: x}, {name: b, column: x}]}
and
{input_features: [{name: a, column: x}, {name: c, column: x}]}
and
{input_features: [{name: a, column: x}, {name: b, column: x}, {name: c, column: x}]}
They should all produce the same cache and the cache that works for one should be reusable for the others.

It would be different if the c feature in the last example had a different type from the others.

Anyway, I agree all these corner cases should be tested out with a separate PR, containing only this modification, and a very comprehensive set of tests to determine expected set of proc cols given a config and if the checksum match or do not. We would need both many positive adn negative cases.

So maybe the safest thing to do for now is to land the pr by reverting just this line to what it was before, and have a follow up PR to have full confidence on the logic I proposed above.

@w4nderlust Thanks for the thorough write up on the edge cases.

It seems like between the three of us, we're pretty aligned on what changes should be made going forward:

proc_col hash should not just consider preprocessing parameters, but both preprocessing parameters along with the feature type. This is also going to require a pretty interesting backwards compatibility layer because we want to ensure that models that just used preprocessing parameters (and have those proc_cols set) now are updated to reflect preprocessing + feature type combined hash.

Instead of using just feature_names in the info dict to compute the checksum hash, we should use the set of proc col hashes since it has higher guarantees about ensuring caches are only used when they should be.

Does that summarize it reasonably? @w4nderlust @tgaddair?

Yep, I think we're all aligned that this is a good next step. Thanks everyone :)

ludwig/schema/model_types/utils.py

tgaddair

Should we remove all the places in the code we're setting PROC_COLUMN that isn't in generation of the ModelConfig? I think so.

arnavgarg1 · 2023-02-10T23:37:29Z

Should we remove all the places in the code we're setting PROC_COLUMN that isn't in generation of the ModelConfig? I think so.

Will take a look at this, I do agree.

ludwig/data/cache/util.py

arnavgarg1 · 2023-02-11T01:47:04Z

Seems like we use proc_col pretty flexibly to either indicate:

the hash of the feature name that gets appended (so the feature name after the preprocessed dataset is loaded in)
to refer to "processed" columns during preprocessing, not the hash

Kinda confusing, I will rename 2. to be processed_col or processed_cols instead to delineate this in our code base

w4nderlust · 2023-02-11T01:50:24Z

Seems like we use proc_col pretty flexibly to either indicate:

the hash of the feature name that gets appended (so the feature name after the preprocessed dataset is loaded in)

to refer to "processed" columns during preprocessing, not the hash

Kinda confusing, I will rename 2. to be processed_col or processed_cols instead to delineate this in our code base

Can you point to the specific parts of the code where we do either of these? There may be a reason but without concrete examples I can't fully say.

arnavgarg1 · 2023-02-11T02:03:31Z

@w4nderlust Here's some links.

Everything makes sense in context, but I think longer naming might just be better to clearly express what this represents. I think that we should call items in (2) like processed_columns, processed_column and processed_column_name to just make this all very clear I guess. Maybe I'm being very picky here.

For 1, pretty much the references in this PR.

For 2:
- https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1186
- https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1413
- https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/dataframe/dask.py#L87 (also true with Pandas backend in similar fashion)

w4nderlust · 2023-02-11T02:13:03Z

@w4nderlust Here's some links.

Everything makes sense in context, but I think longer naming might just be better to clearly express what this represents. I think that we should call items in (2) like processed_columns, processed_column and processed_column_name to just make this all very clear I guess. Maybe I'm being very picky here.

For 1, pretty much the references in this PR.

For 2: - https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1186 - https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1413 - https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/dataframe/dask.py#L87 (also true with Pandas backend in similar fashion)

I see. I actually feel ok with the way it is right now. It's kinda the same thing with column, it isn't the column really, it's the column name, but we call it column for brevity in the cofig.

Maybe if we really wanted to make a change, one thing we can do about it is whenever we set strings that are external facing we say PROC_COLUMN and the intermediate internal variables could be called proc_cols_names/keys or something like that. Although if we decide to do it for clarity, we should do it in a followup PR.

arnavgarg1 · 2023-02-11T02:22:28Z

@w4nderlust Here's some links.
Everything makes sense in context, but I think longer naming might just be better to clearly express what this represents. I think that we should call items in (2) like processed_columns, processed_column and processed_column_name to just make this all very clear I guess. Maybe I'm being very picky here.
For 1, pretty much the references in this PR.
For 2: - https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1186 - https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1413 - https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/dataframe/dask.py#L87 (also true with Pandas backend in similar fashion)

I see. I actually feel ok with the way it is right now. It's kinda the same thing with column, it isn't the column really, it's the column name, but we call it column for brevity in the cofig.

Maybe if we really wanted to make a change, one thing we can do about it is whenever we set strings that are external facing we say PROC_COLUMN and the intermediate internal variables could be called proc_cols_names/keys or something like that. Although if we decide to do it for clarity, we should do it in a followup PR.

Agreed! We can think of a good convention to delineate these, but that will be a separate PR as well.

Going to keep the scope of this PR to what's currently in the PR.

tgaddair

Nice change!

tgaddair and others added 3 commits February 9, 2023 11:56

WIP: test proc col equivalence

46d8a69

Move set_derived_feature_columns_ after preprocessing config for each…

7ffd382

… feature has been updated

Use proc_col instead of name in checksum calculation

2bbf0ca

arnavgarg1 changed the title ~~Fix proc col~~ [WIP] Fix proc col Feb 9, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ee8158

for more information, see https://pre-commit.ci

Fix tests

0715b44

arnavgarg1 changed the title ~~[WIP] Fix proc col~~ [Schema] Generate proc_column only after fixed encoder params / cache embedding params are merged in Feb 9, 2023

arnavgarg1 marked this pull request as ready for review February 9, 2023 22:57

arnavgarg1 changed the title ~~[Schema] Generate proc_column only after fixed encoder params / cache embedding params are merged in~~ Generate proc_column only after all preprocessing parameters are merged in to prevent incorrect cached dataset reads Feb 9, 2023

Fix pre-commit

c78c67d

arnavgarg1 requested review from tgaddair, justinxzhao and w4nderlust February 10, 2023 02:17

w4nderlust approved these changes Feb 10, 2023

View reviewed changes

tgaddair reviewed Feb 10, 2023

View reviewed changes

ludwig/schema/model_types/utils.py Outdated Show resolved Hide resolved

tgaddair reviewed Feb 10, 2023

View reviewed changes

Address some PR feedback

35e01cf

Remove compute_feature_hash from other spots in Ludwig

dbdc336

tgaddair reviewed Feb 11, 2023

View reviewed changes

ludwig/data/cache/util.py Outdated Show resolved Hide resolved

Revert checksum calculation change

6e1cca2

fix pre-commit

b20b0c1

tgaddair approved these changes Feb 11, 2023

View reviewed changes

tgaddair merged commit 1f83833 into master Feb 11, 2023

tgaddair deleted the fix-proc-col branch February 11, 2023 04:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate proc_column only after all preprocessing parameters are merged in to prevent incorrect cached dataset reads #3069

Generate proc_column only after all preprocessing parameters are merged in to prevent incorrect cached dataset reads #3069

arnavgarg1 commented Feb 9, 2023 •

edited

Loading

github-actions bot commented Feb 9, 2023 •

edited

Loading

tgaddair Feb 10, 2023

arnavgarg1 Feb 10, 2023

tgaddair Feb 11, 2023

arnavgarg1 Feb 11, 2023

w4nderlust Feb 11, 2023 •

edited

Loading

arnavgarg1 Feb 11, 2023 •

edited

Loading

tgaddair Feb 11, 2023

tgaddair left a comment

arnavgarg1 commented Feb 10, 2023

arnavgarg1 commented Feb 11, 2023

w4nderlust commented Feb 11, 2023

arnavgarg1 commented Feb 11, 2023 •

edited

Loading

w4nderlust commented Feb 11, 2023 •

edited

Loading

arnavgarg1 commented Feb 11, 2023

tgaddair left a comment

Generate proc_column only after all preprocessing parameters are merged in to prevent incorrect cached dataset reads #3069

Generate proc_column only after all preprocessing parameters are merged in to prevent incorrect cached dataset reads #3069

Conversation

arnavgarg1 commented Feb 9, 2023 • edited Loading

github-actions bot commented Feb 9, 2023 • edited Loading

Unit Test Results

tgaddair Feb 10, 2023

Choose a reason for hiding this comment

arnavgarg1 Feb 10, 2023

Choose a reason for hiding this comment

tgaddair Feb 11, 2023

Choose a reason for hiding this comment

arnavgarg1 Feb 11, 2023

Choose a reason for hiding this comment

w4nderlust Feb 11, 2023 • edited Loading

Choose a reason for hiding this comment

arnavgarg1 Feb 11, 2023 • edited Loading

Choose a reason for hiding this comment

tgaddair Feb 11, 2023

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

arnavgarg1 commented Feb 10, 2023

arnavgarg1 commented Feb 11, 2023

w4nderlust commented Feb 11, 2023

arnavgarg1 commented Feb 11, 2023 • edited Loading

w4nderlust commented Feb 11, 2023 • edited Loading

arnavgarg1 commented Feb 11, 2023

tgaddair left a comment

Choose a reason for hiding this comment

arnavgarg1 commented Feb 9, 2023 •

edited

Loading

github-actions bot commented Feb 9, 2023 •

edited

Loading

w4nderlust Feb 11, 2023 •

edited

Loading

arnavgarg1 Feb 11, 2023 •

edited

Loading

arnavgarg1 commented Feb 11, 2023 •

edited

Loading

w4nderlust commented Feb 11, 2023 •

edited

Loading