Filter out extra fields, deduplicate fields in ingestion #404

zhilingc · 2020-01-04T03:28:16Z

Fixes #401, but also additionally does some preprocessing of the feature row in the ValidateFeatureRowDoFn to (1) filter out extra fields and (2) deduplicate fields

I'm still not 100% sure if we want to commit to this behaviour.

Pros:

Convenient
Not failing when extra columns are found makes ingestion jobs more forward-compatible
Consistent with behaviour of popular serialization methods

Cons:

Upstream issues will not be flagged to the user
Order of feature fields lost

That being said deduplication is not being done by field name, but the entire field, so no information should be lost.

feast-ci-bot · 2020-01-04T03:28:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhilingc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zhilingc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

woop · 2020-01-04T03:44:00Z

ingestion/src/main/java/feast/ingestion/transform/fn/ValidateFeatureRowDoFn.java

-              String.format(
-                  "FeatureRow contains field '%s' which do not exists in FeatureSet '%s' version '%d'. Please check the FeatureRow data.",
-                  field.getName(), featureSet.getReference());
+          // skip


What if we logged statistics in terms of unnecessary fields so that it doesnt result an an actual error?

Possible, but is it possible if i opened a separate PR for this? It would involve introducing a new TupleTag (for warnings)

Is it possible to use our metrics client here? We wont get information on the specific fields but at least we would know something is wrong.

we need to bubble the error down to the metrics writing fn

woop · 2020-01-04T03:46:03Z

ingestion/src/main/java/feast/ingestion/transform/fn/ValidateFeatureRowDoFn.java

@@ -81,6 +82,7 @@ public void processElement(ProcessContext context) {
            break;
          }
        }
+        fields.add(field);


Will a collision be happening here? And if so, how will it be handled?

Would it make sense to detect collision at the name level?

If there is a collision the 2nd element will not be written. I'm not sure if i want to commit to detecting collision at the name level at this point because it could result in data loss (that would go uncaught by the user)

Ok, so if we have two fields with the same name, what happens to the data when it is written to the store? (in the most pathological case)

woop · 2020-01-04T03:48:28Z

@zhilingc It's definitely a trade-off if we want to support feature rows that dont exactly comply with the feature set that has been defined, but I do quite like the idea of making Feast a little bit more forward-compatible and not overly brittle.

My intuition is that this approach seems sound and that it's the lesser of two evils, especially when we consider our plans w.r.t removing versions.

zhilingc · 2020-01-05T12:05:22Z

/retry

zhilingc · 2020-01-06T05:42:21Z

/retest

woop · 2020-01-06T06:06:54Z

/lgtm

Filter out duplicate fields, ignore extra fields

b4f9c6d

zhilingc requested a review from pradithya as a code owner January 4, 2020 03:28

feast-ci-bot added approved size/S labels Jan 4, 2020

zhilingc changed the title ~~Ingestion fixes~~ Filter out extra fields, deduplicate fields in ingestion Jan 4, 2020

woop reviewed Jan 4, 2020

View reviewed changes

woop mentioned this pull request Jan 4, 2020

Feature set versioning #386

Closed

Add test for filtering out extra fields

2eb613a

feast-ci-bot added size/M and removed size/S labels Jan 6, 2020

feast-ci-bot assigned woop Jan 6, 2020

feast-ci-bot added the lgtm label Jan 6, 2020

feast-ci-bot merged commit 3e25841 into feast-dev:master Jan 6, 2020

ches added the backport-candidate Changes that may be desired for backport to earlier Feast release tracks label May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out extra fields, deduplicate fields in ingestion #404

Filter out extra fields, deduplicate fields in ingestion #404

zhilingc commented Jan 4, 2020 •

edited

Loading

feast-ci-bot commented Jan 4, 2020

woop Jan 4, 2020

zhilingc Jan 4, 2020

woop Jan 4, 2020 •

edited

Loading

zhilingc Jan 4, 2020

woop Jan 4, 2020

woop Jan 4, 2020

zhilingc Jan 4, 2020

woop Jan 5, 2020

woop commented Jan 4, 2020 •

edited

Loading

zhilingc commented Jan 5, 2020

zhilingc commented Jan 6, 2020

woop commented Jan 6, 2020

Filter out extra fields, deduplicate fields in ingestion #404

Filter out extra fields, deduplicate fields in ingestion #404

Conversation

zhilingc commented Jan 4, 2020 • edited Loading

feast-ci-bot commented Jan 4, 2020

woop Jan 4, 2020

Choose a reason for hiding this comment

zhilingc Jan 4, 2020

Choose a reason for hiding this comment

woop Jan 4, 2020 • edited Loading

Choose a reason for hiding this comment

zhilingc Jan 4, 2020

Choose a reason for hiding this comment

woop Jan 4, 2020

Choose a reason for hiding this comment

woop Jan 4, 2020

Choose a reason for hiding this comment

zhilingc Jan 4, 2020

Choose a reason for hiding this comment

woop Jan 5, 2020

Choose a reason for hiding this comment

woop commented Jan 4, 2020 • edited Loading

zhilingc commented Jan 5, 2020

zhilingc commented Jan 6, 2020

woop commented Jan 6, 2020

zhilingc commented Jan 4, 2020 •

edited

Loading

woop Jan 4, 2020 •

edited

Loading

woop commented Jan 4, 2020 •

edited

Loading