Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out extra fields, deduplicate fields in ingestion #404

Merged
merged 2 commits into from
Jan 6, 2020

Conversation

zhilingc
Copy link
Collaborator

@zhilingc zhilingc commented Jan 4, 2020

Fixes #401, but also additionally does some preprocessing of the feature row in the ValidateFeatureRowDoFn to (1) filter out extra fields and (2) deduplicate fields

I'm still not 100% sure if we want to commit to this behaviour.

Pros:

  • Convenient
  • Not failing when extra columns are found makes ingestion jobs more forward-compatible
  • Consistent with behaviour of popular serialization methods

Cons:

  • Upstream issues will not be flagged to the user
  • Order of feature fields lost

That being said deduplication is not being done by field name, but the entire field, so no information should be lost.

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhilingc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zhilingc zhilingc changed the title Ingestion fixes Filter out extra fields, deduplicate fields in ingestion Jan 4, 2020
String.format(
"FeatureRow contains field '%s' which do not exists in FeatureSet '%s' version '%d'. Please check the FeatureRow data.",
field.getName(), featureSet.getReference());
// skip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we logged statistics in terms of unnecessary fields so that it doesnt result an an actual error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible, but is it possible if i opened a separate PR for this? It would involve introducing a new TupleTag (for warnings)

Copy link
Member

@woop woop Jan 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use our metrics client here? We wont get information on the specific fields but at least we would know something is wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to bubble the error down to the metrics writing fn

@@ -81,6 +82,7 @@ public void processElement(ProcessContext context) {
break;
}
}
fields.add(field);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a collision be happening here? And if so, how will it be handled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to detect collision at the name level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a collision the 2nd element will not be written. I'm not sure if i want to commit to detecting collision at the name level at this point because it could result in data loss (that would go uncaught by the user)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if we have two fields with the same name, what happens to the data when it is written to the store? (in the most pathological case)

@woop
Copy link
Member

woop commented Jan 4, 2020

@zhilingc It's definitely a trade-off if we want to support feature rows that dont exactly comply with the feature set that has been defined, but I do quite like the idea of making Feast a little bit more forward-compatible and not overly brittle.

My intuition is that this approach seems sound and that it's the lesser of two evils, especially when we consider our plans w.r.t removing versions.

@woop woop mentioned this pull request Jan 4, 2020
@zhilingc
Copy link
Collaborator Author

zhilingc commented Jan 5, 2020

/retry

@feast-ci-bot feast-ci-bot added size/M and removed size/S labels Jan 6, 2020
@zhilingc
Copy link
Collaborator Author

zhilingc commented Jan 6, 2020

/retest

@woop
Copy link
Member

woop commented Jan 6, 2020

/lgtm

@feast-ci-bot feast-ci-bot merged commit 3e25841 into feast-dev:master Jan 6, 2020
@ches ches added the backport-candidate Changes that may be desired for backport to earlier Feast release tracks label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved backport-candidate Changes that may be desired for backport to earlier Feast release tracks lgtm size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing argument in error string in ValidateFeatureRowDoFn
4 participants