Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make materialization more scalable + performant #2594

Closed
adchia opened this issue Apr 21, 2022 · 2 comments
Closed

Make materialization more scalable + performant #2594

adchia opened this issue Apr 21, 2022 · 2 comments
Assignees
Labels
Community Contribution Needed We want community to contribute kind/feature New feature or request kind/project A top level project to be tracked in GitHub Projects priority/p0 Highest priority wontfix This will not be worked on

Comments

@adchia
Copy link
Collaborator

adchia commented Apr 21, 2022

This issue discusses common issues users face when materializing features to the online store in Feast.

User problems

Generally, users with large datasets can face issues on reliably loading data into the online store to meet their online needs.

1. Materialization in the default provider is not scalable

As per #2071,

Currently, the materialization process loads all the data from the Offline Store to an Arrow table, then converts all the data to Protobuf, then writes all the data to the Online Store. This process requires holding the entire dataset in memory which is not practical.

2. Materialization can be slow

For users that aren't working with a small number of feature views or large number of unique entities, Feast's python based materialization works fine. However, this does not hold true for many users.

The default provider is slow to materialize data. Users can report multiple hours to do incremental materialization, or worse materialization never completes.

Users have had to build custom providers to solve this (e.g. by kicking off Dataflow or Spark jobs to more quickly materialize large amounts of data)

3. Materialization not always reliable

There are several datastore specific issues such as #2027 and #2323, where batch write transactions can time out:

File "/usr/local/lib/python3.7/dist-packages/google/api_core/grpc_helpers.py", line 69, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 The referenced transaction has expired or is no longer valid.

In datastore, there are also contention errors (#1575):

Materializing 1 feature views from 2021-04-29 21:19:00-07:00 to 2021-04-29 21:19:05-07:00 into the datastore online store.
my_fv:
  8%|████▋                                                    | 2720/33120 [00:07<01:18, 388.10it/s]

google.api_core.exceptions.Aborted: 409 too much contention on these datastore entities. please try again. entity groups:

Describe the solution you'd like

A clear and concise description of what you want to happen.

There are multiple ways of addressing this. Some ideas

@adchia adchia added kind/feature New feature or request priority/p1 labels Apr 21, 2022
@adchia adchia added priority/p0 Highest priority Community Contribution Needed We want community to contribute and removed priority/p1 labels Apr 21, 2022
@adchia adchia added the kind/project A top level project to be tracked in GitHub Projects label May 24, 2022
@adchia adchia assigned achals and unassigned tsotnet May 24, 2022
@achals
Copy link
Member

achals commented Jun 26, 2022

Rfc for scalable and pluggable dataloading https://docs.google.com/document/d/1J7XdwwgQ9dY_uoV9zkRVGQjK9Sy43WISEW6D5V9qzGo/edit


(July 20th)

This is now landed and live on feast on master. https://rtd.feast.dev/en/master/index.html#module-feast.infra.materialization

@stale
Copy link

stale bot commented Dec 20, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Dec 20, 2022
@stale stale bot closed this as completed Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Contribution Needed We want community to contribute kind/feature New feature or request kind/project A top level project to be tracked in GitHub Projects priority/p0 Highest priority wontfix This will not be worked on
Projects
Status: Done
Development

No branches or pull requests

3 participants