Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Online Serving unable to retrieve feature data after Feature Set update. #908

Merged
merged 16 commits into from
Aug 1, 2020
Merged

Fix Online Serving unable to retrieve feature data after Feature Set update. #908

merged 16 commits into from
Aug 1, 2020

Conversation

mrzzy
Copy link
Collaborator

@mrzzy mrzzy commented Jul 30, 2020

Problem

Feast Online Serving will throw an when the user attempts Feature Retrieval in the following scenario:

  • user applies feature set.
  • user ingests feature data for that feature set.
  • user updates to feature set to add or archive features.
  • user is unable to retrieve Feature Data from Online Serving and gets the following error:
*status.statusError: rpc error: code = DataLoss desc = Failed to decode FeatureRow from bytes retrieved from redis: Possible data corruption

What this PR does / why we need it:
Update Ingestion's RedisCustomIO to encode feature rows by setting field name to a hash of the actual name.

  • Used hash to reduce the increase in storage used when storing encoded feature rows in redis.

This changes the encoding of Feature Rows stored in Redis.

Update Online Serving's FeatureRowDecoder to support decoding Feature Rows by name hash.

  • Missing fields in encoded Feature Row would be decoded in as empty values.
  • Extra Fields in encoded Feature Row would be omitted from the decoded feature row.

FeatureRowDecoder will continue to support decoding existing Feature Row in existing encoding stored in Redis.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

Encoding of Feature Rows stored in Redis hash changed:
- Fields in encoded Feature Row are now set to a hash of the field's actual name.

.map(
name -> {
String nameHash =
Hashing.murmur3_32().hashString(name, StandardCharsets.UTF_8).toString();
Copy link
Member

@woop woop Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the length of the hash string? Just want to make sure its as small as possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 characters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the fields stored in the feature row are float values (32 bit, 4 bytes), this would mean a ~3x increase in space consumption.
@woop @pyalex @khorshuheng

Copy link
Collaborator

@pyalex pyalex Aug 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all fields that are stored are float, there're a lot strings as well and int64. So everything is not so bad

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess that we can safely cut hash string to 4-5 chars

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrzzy, pyalex

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pyalex
Copy link
Collaborator

pyalex commented Aug 1, 2020

/test test-end-to-end-batch-dataflow

@pyalex
Copy link
Collaborator

pyalex commented Aug 1, 2020

/lgtm

@pyalex pyalex added the kind/bug label Aug 1, 2020
@feast-ci-bot feast-ci-bot merged commit 6803457 into feast-dev:master Aug 1, 2020
@mrzzy mrzzy mentioned this pull request Aug 2, 2020
pyalex pushed a commit that referenced this pull request Aug 2, 2020
…update. (#908)

* Update RedisCustomIO to write FeatureRows with field's name set to hash of field.

* Update FeatureRowDecoder to decode by name hash instead of order

* Bump pytest order numbers by 2 to make space for new tests

* Revert "Bump pytest order numbers by 2 to make space for new tests"

This reverts commit aecc9a6e9a70be3fd84d04f81442b518be01a4c6.

* Added e2e to check that feature rows with missing or extra fields can be retrieved

* Clarify docs about Feature Row v1 encoding and Feature Row v2 encoding

* Fix python lint

* Update FeatureRowDecoder's isEncodedV2 check to use anyMatch()

* Make missing field/extra field e2e tests independent of other tests.

* Update FeatureRowDecoder if/else statement into 2 ifs

* Fix python and java lint

* Fix java unit test failures

* Fix ImportJobTest java unit test

* Sync github workflows with master

* Sync .github folder with master for fix

* Replace v1/v2 encoding with v1/v2 decoder in docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants