Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Labels to Source Object #835

Closed
davidheryanto opened this issue Jun 29, 2020 · 10 comments
Closed

Add Labels to Source Object #835

davidheryanto opened this issue Jun 29, 2020 · 10 comments
Labels
kind/feature New feature or request

Comments

@davidheryanto
Copy link
Collaborator

davidheryanto commented Jun 29, 2020

Is your feature request related to a problem? Please describe.
Currently, source in Feast is defined as the following in the FeatureSet spec

spec:
  name: transaction
  maxAge: 86400s
  entities:
  - name: transaction_id
  features:
  - name: amount
    valueType: FLOAT
  ...
  # Source definition
  source:
    kafkaSourceConfig:
      bootstrapServers: mybroker:9092
      topic: transaction
    type: KAFKA

From management perspective, when we list all the sources we use in Feast, we sometimes want to be able to filter or group these sources according to the organisational structure, for instance, team managing the source, the country where the source data comes from.

Current source proto, however, does not allow us to provide such additional information.

Describe the solution you'd like
Add labels field to the Source proto. This is similar to the labels field we currently have in FeatureSet
https://github.com/feast-dev/feast/blob/master/protos/feast/core/FeatureSet.proto#L63

message Source {

    // The kind of data source Feast should connect to in order to retrieve FeatureRow value
    SourceType type = 1;

    // Source specific configuration
    oneof source_config {
        KafkaSourceConfig kafka_source_config = 2;
    }

    // Labels are key-value pairs attached to a Source object. They are intended to help users
    // organize Source objects in Feast. They do not directly imply semantics to the core system.
    //
    // Requirements for label keys and values:
    // - Keys have a min length of 1 character and max length of 63 characters. Values can
    //   be empty and have a max length of 63 characters.
    // - Can only contain lowercase letters, numeric characters, underscores and dashes. All
    //   characters must use UTF-8 encoding.
    // - Keys can only start with a lowercase letter.
    map<string, string> labels = 3;
}

Describe alternatives you've considered
N/A

Additional context
These labels information in the Source object can potentially be used to label ingestion jobs that are using the respective sources as well. In order to help organize ingestion jobs objects.

@davidheryanto davidheryanto added the kind/feature New feature or request label Jun 29, 2020
@woop
Copy link
Member

woop commented Jun 29, 2020

Unfortunately I think the labels need to be within the KafkaSourceConfig in order for it to be picked up by jobs. @mrzzy cmiiw

@davidheryanto
Copy link
Collaborator Author

davidheryanto commented Jun 29, 2020

Looking at this I believe the JobManager has access to the Source object, so it can access the labels field in the Source
https://github.com/feast-dev/feast/blob/master/core/src/main/java/feast/core/job/dataflow/DataflowJobManager.java#L247

Unless you feel these labels should only apply to Source of type KAFKA?

@mrzzy
Copy link
Collaborator

mrzzy commented Jun 30, 2020

Sources has been generalized in #685 to contain type and an encoded config field. The Job model has also updated to copied type and config fields wholesale into the Job model to be more representative of how Ingestion Job worked.

To get this to work with the least amount of engineering is to embed the labels in the KafkaSourceConfig. However I agree with @davidheryanto that labels should not be coupled to Kafka. Getting generic source labels to work is more involved as we need to update how we spawn/embed source in jobs.

@woop
Copy link
Member

woop commented Jul 1, 2020

Sources has been generalized in #685 to contain type and an encoded config field. The Job model has also updated to copied type and config fields wholesale into the Job model to be more representative of how Ingestion Job worked.

To get this to work with the least amount of engineering is to embed the labels in the KafkaSourceConfig. However I agree with @davidheryanto that labels should not be coupled to Kafka. Getting generic source labels to work is more involved as we need to update how we spawn/embed source in jobs.

This problem probably indicates that we have chosen a poor abstraction with config and type.

@davidheryanto
Copy link
Collaborator Author

davidheryanto commented Jul 1, 2020

Ok so you feel we should not add labels field to the Source proto? @mrzzy @woop

The main objective is to add metadata to the Source objects so that we can organize and categorize the Source we have.

@mrzzy
Copy link
Collaborator

mrzzy commented Jul 4, 2020

@davidheryanto Under the data model each source is 1:1 to each Feature Set. This means each source is intrinsically tied to the Feature Set, instead of a independent source object (ie it might be more fitting to look at each source as a embedded field in a Feature Set). Furthermore, the current data model allows sources to be duplicated (ie same config and type) which makes attempting to categorize and attribute them difficult, even if this is implemented.

I think a viable solution to this problem is label the Feature Sets. Feature Sets are fundamentally a data ingestion concept, defining a data source/data schema on how data is sourced into Feast. As an added plus, Feature Sets also currently already supports labeling and we also recently added an API to filter Feature Sets by labels.

As for automatically labeling Ingestion Jobs with labels, we could add automatic propagation of the labels from the Feature Sets to the Ingestion Jobs.

@woop
Copy link
Member

woop commented Jul 4, 2020

As for automatically labeling Ingestion Jobs with labels, we could add automatic propagation of the labels from the Feature Sets to the Ingestion Jobs.

I actually quite like this idea.

@davidheryanto
Copy link
Collaborator Author

davidheryanto commented Jul 5, 2020

This is not directly related to the issue topic but regarding the propagation of labels from FeatureSet objects to IngestionJob:

As for automatically labeling Ingestion Jobs with labels, we could add automatic propagation of the labels from the Feature Sets to the Ingestion Jobs

Since I believe currently an IngestionJob can ingest data for multiple FeatureSets, what should be our merge approach for the labels in the IngestionJob when multiple FeatureSets contains different labels?

For example an ingestion job that involves these 3 FeatureSets:

  1. FeatureSet1, labels: key1:val1, key:2:val2

  2. FeatureSet2, labels: key1:val1-alt, key:3:val3

  3. FeatureSet3, labels: key2:val2-alt

What should be the final labels in the IngestionJob?

I think appending the values for labels with the same key may not be suitable (e.g. for label key key1 the value is val1,val1-alt) because the resulting label values may be too long and not that useful for categorization.

@mrzzy
Copy link
Collaborator

mrzzy commented Jul 7, 2020

I am in favor of appending the values from the individual Feature Sets when the label name conflicts. This does not mean that multiple values are concatenated together to form a single value but rather that a single label on an Ingestion Job can be associated a set of unique values.

In the case you described the labels exposed by the Ingestion Job would look something like this:

  • key1: Set<val1, val1-alt>
  • key2: Set<val2, val2-alt>
  • key3: Set<val3>

I think appending the values for labels with the same key may not be suitable (e.g. for label key key1 the value is val1,val1-alt) because the resulting label values may be too long and not that useful for categorization.

A less abstract walkthrough how this might be used for categorization

  1. FeatureSet1: team: pricing, product: pricing, model: xgboost
  2. FeatureSet2: team: pricing, model: dnn
  3. FeatureSet3: team: marketing, product: incentives

The corresponding Ingestion Job would expose the following labels:

  • team: Set<pricing, marketing>
  • product: Set<incentives, pricing>
  • model: Set<xgboost, dnn>

Listing in Ingestion Job API added in #548 can be extended to take into account and allow matching of labels based on any of the associated values so team=pricing or team=marketing would list the Ingestion Job in question.

@davidheryanto
Copy link
Collaborator Author

Again unrelated to the issue title, but rather how to use the available labels from Feast resources to label ingestion job

I think since we have moved FeatureSet specs to be provided (dynamically) AFTER ingestion job creation versus BEFORE job creation previously, it is more tricky to propagate the label from FeatureSets to ingestion jobs.

This is because (at least for Dataflow jobs) the labels can only be provided at job creation time. This means with our current approach, at Dataflow job creation time we have no info of FeatureSet labels to propagate to the Dataflow job.

I will close this issue if we have no further alternatives for now.
@woop @mrzzy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants