Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature set versioning #386

Closed
khorshuheng opened this issue Dec 23, 2019 · 3 comments · Fixed by #676
Closed

Feature set versioning #386

khorshuheng opened this issue Dec 23, 2019 · 3 comments · Fixed by #676

Comments

@khorshuheng
Copy link
Collaborator

khorshuheng commented Dec 23, 2019

This issue is created to compile the feedbacks from users regarding Feature set versioning in Feast.

Feedback

  • Some users do not care about feature versioning at all and would like to use latest feature set all the time. The current workaround is to make an API call to retrieve the latest version of a feature set, then use it to construct the correct feature id string.
  • Bumping up feature set version might be redundant in the case of backward compatible changes, as neither the ingestion and retrieval workflow should be affected.

Potential solution and associated complexity

  • We can make feature set version optional for feature id. In the case where feature set version is not specified, we should use the latest version. However, this begs the question on whether this should be handled on the SDK side or the server side. Either way, the parsing of the feature id string will become significantly more complicated.
  • We can scrap feature set versioning, and in return, does not allow any changes in Feature Set that is not backward incompatible. For example, there should not be any change in data type nor column drops. A column can be marked as deprecated, however, and will be filtered out during ingestion and retrieval. The risk of removing feature set versioning, is that there might be legitimate use case for versioning, such as when multiple teams are sharing feature set and not all the team might want to use the latest feature set.

Comments from more Feast users will be useful to determine which direction should we pursue for future version of Feast. In particular, it will be good to know if how are the existing users utilizing the feature set versioning.

@woop
Copy link
Member

woop commented Jan 4, 2020

One of the motivations for versions was to allow for a Feast Serving deployment to subscribe to specific feature sets which dont change over time. It would then decouple the users that modify feature sets from production systems.

The fear was that if we didnt have versions then users would implement their own versioning strategies in the feature set name itself, which would result in loads of feature sets in the global namespace.

My thoughts on the current state of Feast as it relates to versions:

  • Given that we have implemented project level isolation, I believe that versions are even less useful than before, and in fact they simply add unnecessary complexity.
  • Data scientists have argued that versions dont add any value to their workflow.
  • @thirteen37 has argued that in most database systems that it is up to the administrator to ensure safe migrations from one state to another. For example if there is a production dependency on a table then the administrator should ensure that the correct columns are added, data is backfilled, that both the previous and future clients are supported, and that a roll back can happen if something goes wrong. Taking on this responsibility within Feast adds a lot of complexity that seems outside of our capabilities and/or value proposition.

To get back to your proposals above, which I will number (1) and (2).

  1. This is already implemented. Users do not have to provide versions when requesting features for retrieval. However, this is only happening on the retrieval side, not ingestion. It also does not address all issues with versions and the complexity they introduce.
  2. I am more inclined to take this approach. We can scrap versions altogether.

@woop
Copy link
Member

woop commented Jan 4, 2020

Below is a straw-man proposal for us to attack. Let's assume that we remove versions entirely. The functionality is then as follows

Terminology:

  • Ingestion: This includes the Python SDK that writes to Kafka or a stream that pushes feature rows to Kafka.
  • Population: Jobs that consume feature rows from a Kafka and populates a Feast store like Redis, BigQuery, Hive, or Cassandra.

Functionality:

  • Users can still create, register, and manage feature sets through the Core API as well as SDK.
  • Users can add new features. This adds a new column in an RDBMS like BigQuery or extra fields when writing to a KV store like Redis.
  • Users can delete features. This does not remove columns in an RDBMS, but simply marks them as deprecated. These columns/fields cannot be reused and cannot be queried. They are invisible to end users, but the data itself continues to exist in stores unless garbage collection is implemented.
  • Users can not update a feature type of an existing feature. This can possibly be implemented at a later date if it is necessary, but it does not seem to be worth the added development work.
  • Users can not add, remove, or modify entities in their feature set. It's not clear that there is a strong need for this, and it's also not clear how this will impact certain stores that require indexing on entities.

Implementation:

  • Versions are not exposed to end users. Not at the feature set level and especially not at feature retrieval. Optionally we can still maintain version tracking as a form of metadata, but it would only be a counter that increments upon change to a feature set. It has no explicit functionality in our APIs.
  • Ingestion does not depend on versions. Users can do either a batch ingestion or streaming ingestion to a Kafka topic based purely on a feature set name, as well as the fields within a feature set at that point in time.
  • The population jobs that fill up stores don't care about versions. They parse source data on the stream based on their current understanding of feature sets and feature row contents.
    • If there are missing or erroneous fields in a feature row then a failure occurs. Rows are sent to a deadletter queue. "deleted" features are allowed to be missing.
    • If there are additional/unwanted fields in a feature row then these are ignored and the rest of the row is parsed. This ties in to Filter out extra fields, deduplicate fields in ingestion #404
    • When a population job writes to a store, it writes null values to any deleted feature/field.
  • If a change is made to a feature set then the feature set moves to an updating state. Batch ingestions cannot be started until these changes are applied. Once the changes are applied, batch ingestion can function again and will use the latest schema. Any ingestion that has already been running will still continue to run, including streaming ingestion.
  • Retrieval will allow for additional columns or fields to exist, but will simply ignore them. If fields are missing, then retrieval should fail. It's possible that different "versions" of a feature set could be looked up in a single query, for example multiple keys for an online retrieval could have different fields. As long as the correct fields exist for the feature set at that point in time, the query should succeed.

@zhilingc
Copy link
Collaborator

I've drafted a quick RFC for this, in hopes of picking this up soon (hopefully after the storage refactor).

https://docs.google.com/document/d/1P44LHd724JloQtpn10naAg5MlzcrkeOlN3QzAn9O8uY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants