Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalesce FeatureRows for improved "latest" value consistency in serving stores #88

Closed
tims opened this issue Jan 18, 2019 · 6 comments
Closed
Assignees
Labels
kind/feature New feature or request

Comments

@tims
Copy link
Contributor

tims commented Jan 18, 2019

Is your feature request related to a problem? Please describe.
Now that we plan to store features in the serving store as Latest only, instead of the full history. We need to add logic to ensure writes of feature values from older rows don't clobber new ones. This is especially a problem for batch, but we'd like to solve it generally for streaming as well.

Describe the solution you'd like
For batch streams, we will combine feature rows by key. Where the key is entity name + entity key.

For streaming, we will also do this, but we emit rows with a timer, keeping the previous value in state for future combines. We will evict state with another timeout timer, this will allow for very large key infrequently updated key spaces to be less demanding.

The feature rows should be stored in the warehouse store in the combined form, not as their raw separate rows, because we want it to reflect the rows as they are written to the serving stores. This ensures that a the equivalent of a Select * AS OF, query would reflect what was available in the serving stores.

We should also make sure that we can turn this feature off as an option in each import spec, as this could introduce performance hits in very high throughput streams with large backlogs. And is also not required for batch jobs that know they only have one row per entity key.

Describe alternatives you've considered
Alternatives are to keep each feature row separately in the serving stores and combine them in the serving api (which is what we started with). This is not as performant as many use cases require, and we'd prefer to optimize for serving speed. We are willing to accept that we cannot guarantee true consistency using the above approach, as it's possible that data already in the store may conflict with data being ingested in a new job. But we will assume that incoming feature data is eventually consistent and will overwrite any issues.

Additional context
The PR I'm working on for this issue is WIP and will require #87 to be finished first, as we should be grouping on the {entityName}+{entityKey}, not with the event timestamp + granularity. It would be more work to deal with the latest values in the stores store separately, so it's best we wait.

@tims
Copy link
Contributor Author

tims commented Jan 18, 2019

/assign tims

@tims
Copy link
Contributor Author

tims commented Jan 18, 2019

Should we have this enabled or disabled by default?

@tims tims added the kind/feature New feature or request label Jan 18, 2019
@tims tims changed the title Coalesce FeatureRows for improve "latest" values consistency in serving stores Coalesce FeatureRows for improved "latest" value consistency in serving stores Jan 18, 2019
@zhilingc
Copy link
Collaborator

I don't think you should coalesce rows for the warehouse store. The warehouse store should have the most complete dataset possible, for all timestamps, regardless whether the feature rows were late or not.

@pradithya
Copy link
Collaborator

Should we have this enabled or disabled by default?

I think it should be ON by default as it impact the correctness of ingestion result

@tims
Copy link
Contributor Author

tims commented Feb 8, 2019

Pushed change to PR so it defaults to true.

Also, @zhilingc the PR no longer coalesce's rows writing to the warehouse, they get written as is.

@tims tims closed this as completed Feb 12, 2019
@tims
Copy link
Contributor Author

tims commented Feb 12, 2019

Closed with merge of Coalesce rows PR #89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants