Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode feature row before storing in Redis #530

Merged
merged 2 commits into from
Mar 13, 2020

Conversation

khorshuheng
Copy link
Collaborator

What this PR does / why we need it:
Currently, feature names are stored together with the values in Redis, which result in excessive memory foot print in cases where feature names are long strings.

This PR will encode the Feature Row prior to storing them in Redis. Encoding strategy:

  1. Entities are removed from the Redis Value as it already exist in the key.
  2. Feature names are removed, leave only the values. This can then be reconstructed during serving, based on the corresponding feature set spec. Since feature set information is present on the key, the spec is retrievable.

In cases where the Feature Row are from external Kafka source, where it is possible for the feature row to be malformed, there are a few checks in place:

  1. It's fine for the feature row to have feature fields in a different order than that specified in the feature set spec, as the encoding process will sort the feature in alphabetical order based on name.
  2. Duplicated fields are tolerated.

Good to have, but out of scope for this PR:

  1. Under normal circumstances, the decoding process should always succeed, unless the feature set spec obtained during serving is somehow inconsistent with the feature set spec used to encode the feature row due to some bugs. There should be a way to verify this during run time. One approach, is to store checksum/fingerprint of the FeatureSet configuration along with the value in Redis. Not implemented in this PR as there needs to be more discussion on the implementation details.

Which issue(s) this PR fixes:

Fixes #515.

Does this PR introduce a user-facing change?:

After this patch, the ingestion job will start storing the feature row in encoded format. However, the serving job is still able to read the non encoded feature row, hence it is not necessary to re-ingest the data prior to the application of the patch, unless the user wish to reduce the memory footprint of the existing keys.


@khorshuheng
Copy link
Collaborator Author

/test test-core-and-ingestion

private byte[] getValue(FeatureRow featureRow) {
FeatureRowEncoder encoder =
new FeatureRowEncoder(featureSets.get(featureRow.getFeatureSet()).getSpec());
return encoder.encode(featureRow).toByteArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this make more sense as a static method encode(featureRow, featureSet)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, a static method does make more sense. Unless we need to support multiple encoding mechanism, which is unlikely.

* @return boolean
*/
public Boolean isEncodingValid(FeatureRow featureRow) {
return featureRow.getFieldsList().size() == spec.getFeaturesList().size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you discover an old feature row (that is still within max age), how will you interpret it? Are we using the versions here to be able to interpret old feature rows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis Key contains both the feature set name and version. Since both ingestion and serving encode/decode based on the feature set name and version specified in Redis Key, the spec should be consistent, and therefore isEncodingValid will always return true.

Originally i added this method to partially address the concern raised in #515:

If field values are only going to be associated with field names at runtime by external configuration has any thought been given to a method for ensuring that the same configuration that was used to write the data is the configuration used to read the data? Something such as a checksum/fingerprint of the FeatureSet configuration stored alongside the data in Redis (or in the key) will help prevent a mismatch of configuration and data due to a bug somewhere else in the system.

Perhaps i should remove this method from this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this thread and the PR as a whole especially the part quoted below, does this approach fall down and need to be reworked as versions get removed?

In cases where the Feature Row are from external Kafka source, where it is possible for the feature row to be malformed, there are a few checks in place:

  1. It's fine for the feature row to have feature fields in a different order than that specified in the feature set spec, as the encoding process will sort the feature in alphabetical order based on name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, @zhilingc has brought this up in a comment thread on her versions removal RFC. Ouch 🤕

@khorshuheng
Copy link
Collaborator Author

/test test-end-to-end-batch

@zhilingc
Copy link
Collaborator

/lgtm

@zhilingc
Copy link
Collaborator

/approve

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khorshuheng, zhilingc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit e7a1a39 into feast-dev:master Mar 13, 2020
khorshuheng added a commit to khorshuheng/feast that referenced this pull request Mar 16, 2020
* Encode feature row before storing in Redis

* Include encoding as part of RedisMutationDoFn

Co-authored-by: Khor Shu Heng <khor.heng@gojek.com>
khorshuheng added a commit that referenced this pull request Mar 16, 2020
* Encode feature row before storing in Redis

* Include encoding as part of RedisMutationDoFn

Co-authored-by: Khor Shu Heng <khor.heng@gojek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Redis memory foot print for ingestion / serving
5 participants