Feature set versioning #386

khorshuheng · 2019-12-23T08:19:14Z

This issue is created to compile the feedbacks from users regarding Feature set versioning in Feast.

Feedback

Some users do not care about feature versioning at all and would like to use latest feature set all the time. The current workaround is to make an API call to retrieve the latest version of a feature set, then use it to construct the correct feature id string.
Bumping up feature set version might be redundant in the case of backward compatible changes, as neither the ingestion and retrieval workflow should be affected.

Potential solution and associated complexity

We can make feature set version optional for feature id. In the case where feature set version is not specified, we should use the latest version. However, this begs the question on whether this should be handled on the SDK side or the server side. Either way, the parsing of the feature id string will become significantly more complicated.
We can scrap feature set versioning, and in return, does not allow any changes in Feature Set that is not backward incompatible. For example, there should not be any change in data type nor column drops. A column can be marked as deprecated, however, and will be filtered out during ingestion and retrieval. The risk of removing feature set versioning, is that there might be legitimate use case for versioning, such as when multiple teams are sharing feature set and not all the team might want to use the latest feature set.

Comments from more Feast users will be useful to determine which direction should we pursue for future version of Feast. In particular, it will be good to know if how are the existing users utilizing the feature set versioning.

woop · 2020-01-04T04:51:31Z

One of the motivations for versions was to allow for a Feast Serving deployment to subscribe to specific feature sets which dont change over time. It would then decouple the users that modify feature sets from production systems.

The fear was that if we didnt have versions then users would implement their own versioning strategies in the feature set name itself, which would result in loads of feature sets in the global namespace.

My thoughts on the current state of Feast as it relates to versions:

Given that we have implemented project level isolation, I believe that versions are even less useful than before, and in fact they simply add unnecessary complexity.
Data scientists have argued that versions dont add any value to their workflow.
@thirteen37 has argued that in most database systems that it is up to the administrator to ensure safe migrations from one state to another. For example if there is a production dependency on a table then the administrator should ensure that the correct columns are added, data is backfilled, that both the previous and future clients are supported, and that a roll back can happen if something goes wrong. Taking on this responsibility within Feast adds a lot of complexity that seems outside of our capabilities and/or value proposition.

To get back to your proposals above, which I will number (1) and (2).

This is already implemented. Users do not have to provide versions when requesting features for retrieval. However, this is only happening on the retrieval side, not ingestion. It also does not address all issues with versions and the complexity they introduce.
I am more inclined to take this approach. We can scrap versions altogether.

woop · 2020-01-04T05:20:31Z

Below is a straw-man proposal for us to attack. Let's assume that we remove versions entirely. The functionality is then as follows

Terminology:

Ingestion: This includes the Python SDK that writes to Kafka or a stream that pushes feature rows to Kafka.
Population: Jobs that consume feature rows from a Kafka and populates a Feast store like Redis, BigQuery, Hive, or Cassandra.

Functionality:

Users can still create, register, and manage feature sets through the Core API as well as SDK.
Users can add new features. This adds a new column in an RDBMS like BigQuery or extra fields when writing to a KV store like Redis.
Users can delete features. This does not remove columns in an RDBMS, but simply marks them as deprecated. These columns/fields cannot be reused and cannot be queried. They are invisible to end users, but the data itself continues to exist in stores unless garbage collection is implemented.
Users can not update a feature type of an existing feature. This can possibly be implemented at a later date if it is necessary, but it does not seem to be worth the added development work.
Users can not add, remove, or modify entities in their feature set. It's not clear that there is a strong need for this, and it's also not clear how this will impact certain stores that require indexing on entities.

Implementation:

Versions are not exposed to end users. Not at the feature set level and especially not at feature retrieval. Optionally we can still maintain version tracking as a form of metadata, but it would only be a counter that increments upon change to a feature set. It has no explicit functionality in our APIs.
Ingestion does not depend on versions. Users can do either a batch ingestion or streaming ingestion to a Kafka topic based purely on a feature set name, as well as the fields within a feature set at that point in time.
The population jobs that fill up stores don't care about versions. They parse source data on the stream based on their current understanding of feature sets and feature row contents.
- If there are missing or erroneous fields in a feature row then a failure occurs. Rows are sent to a deadletter queue. "deleted" features are allowed to be missing.
- If there are additional/unwanted fields in a feature row then these are ignored and the rest of the row is parsed. This ties in to Filter out extra fields, deduplicate fields in ingestion #404
- When a population job writes to a store, it writes null values to any deleted feature/field.
If a change is made to a feature set then the feature set moves to an updating state. Batch ingestions cannot be started until these changes are applied. Once the changes are applied, batch ingestion can function again and will use the latest schema. Any ingestion that has already been running will still continue to run, including streaming ingestion.
Retrieval will allow for additional columns or fields to exist, but will simply ignore them. If fields are missing, then retrieval should fail. It's possible that different "versions" of a feature set could be looked up in a single query, for example multiple keys for an online retrieval could have different fields. As long as the correct fields exist for the feature set at that point in time, the query should succeed.

zhilingc · 2020-03-18T10:08:39Z

I've drafted a quick RFC for this, in hopes of picking this up soon (hopefully after the storage refactor).

https://docs.google.com/document/d/1P44LHd724JloQtpn10naAg5MlzcrkeOlN3QzAn9O8uY

ches mentioned this issue Dec 24, 2019

Subscription configuration #390

Closed

ches mentioned this issue Jan 9, 2020

Forward-port Cassandra storage support for v0.4+ #335

Closed

woop added kind/discussion kind/feature New feature or request priority/p1 and removed kind/discussion labels Jan 26, 2020

ches mentioned this issue Feb 14, 2020

Remove feature set versions from Feast serving API #462

Closed

woop mentioned this issue Feb 22, 2020

Feast API: Feature references, concept hierarchy, and data model #479

Closed

woop mentioned this issue Mar 9, 2020

Feast 0.5 release #527

Closed

woop added this to the v0.5.0 milestone Mar 10, 2020

woop added kind/housekeeping kind/techdebt labels Mar 10, 2020

ches mentioned this issue Mar 28, 2020

Forward-port Support for Cassandra [WIP] #475

Closed

ches pinned this issue Mar 28, 2020

mrzzy mentioned this issue May 3, 2020

Replace string references used in non user facing components in Feast #674

Closed

zhilingc mentioned this issue May 4, 2020

Add support for feature set updates and remove versions #676

Merged

feast-ci-bot closed this as completed in #676 May 13, 2020

woop unpinned this issue May 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature set versioning #386

Feature set versioning #386

khorshuheng commented Dec 23, 2019 •

edited

Loading

woop commented Jan 4, 2020 •

edited

Loading

woop commented Jan 4, 2020 •

edited

Loading

zhilingc commented Mar 18, 2020

Feature set versioning #386

Feature set versioning #386

Comments

khorshuheng commented Dec 23, 2019 • edited Loading

woop commented Jan 4, 2020 • edited Loading

woop commented Jan 4, 2020 • edited Loading

Terminology:

Functionality:

Implementation:

zhilingc commented Mar 18, 2020

khorshuheng commented Dec 23, 2019 •

edited

Loading

woop commented Jan 4, 2020 •

edited

Loading

woop commented Jan 4, 2020 •

edited

Loading