Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Feature Repo, Types and FAQ doc updates #3049

Merged
merged 2 commits into from
Aug 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions docs/getting-started/concepts/data-ingestion.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data ingestion

### Data source
## Data source

The data source refers to raw underlying data (e.g. a table in BigQuery).

Expand All @@ -18,7 +18,7 @@ Feast supports primarily **time-stamped** tabular data as data sources. There ar
* **\[Alpha] Stream sources** allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
* **(Experimental) Request data sources:** This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into **on-demand feature views**, which allow light-weight feature engineering and combining features across sources.

### Batch data ingestion
## Batch data ingestion

Ingesting from batch sources is only necessary to power real-time models. This is done through **materialization**. Under the hood, Feast manages an _offline store_ (to scalably generate training data from batch sources) and an _online store_ (to provide low-latency access to features for real-time models).

Expand Down Expand Up @@ -58,6 +58,8 @@ materialize_python = PythonOperator(

<summary>Code example: CLI based materialization</summary>



#### How to run this in the CLI

```bash
Expand All @@ -77,7 +79,14 @@ materialize_bash = BashOperator(

</details>

### Stream data ingestion
### Batch data schema inference

If the `schema` parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during `feast apply`.
The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse,
or if a query is provided to the source, by running the query with a `LIMIT` clause and inspecting the result.


## Stream data ingestion

Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.

Expand Down
14 changes: 14 additions & 0 deletions docs/getting-started/concepts/feast-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Data Types in Feast

Feast frequently has to mediate data across platforms and systems, each with its own unique type system.
To make this possible, Feast itself has a type system for all the types it is able to handle natively.

Feast's type system is built on top of [protobuf](https://github.com/protocolbuffers/protobuf). The messages that make up the type system can be found [here](https://github.com/feast-dev/feast/blob/master/protos/feast/types/Value.proto), and the corresponding python classes that wrap them can be found [here](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/types.py).

Feast supports primitive data types (numerical values, strings, bytes, booleans and timestamps). The only complex data type Feast supports is Arrays, and arrays cannot contain other arrays.

Each feature or schema field in Feast is associated with a data type, which is stored in Feast's [registry](registry.md). These types are also used to ensure that Feast operates on values correctly (e.g. making sure that timestamp columns used for [point-in-time correct joins](point-in-time-joins.md) actually have the timestamp type).

As a result, each system that feast interacts with needs a way to translate data types from the native platform, into a feast type. E.g., Snowflake SQL types are converted to Feast types [here](https://rtd.feast.dev/en/master/feast.html#feast.type_map.snowflake_python_type_to_feast_value_type). The onus is therefore on authors of offline or online store connectors to make sure that this type mapping happens correctly.

**Note**: Feast currently does *not* support a null type in its type system.
13 changes: 13 additions & 0 deletions docs/getting-started/concepts/feature-repo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Feature Repository
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achals oops just realized we already have a page for feature repos

we should make sure to merge this in with that page somehow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge


## Feature Repo

A feature repository is the collection of python files that define entities, feature views and data sources. Feature Repos also have a `feature_store.yaml` file at their root.

Users can collaborate by making and reviewing changes to Feast object definitions (feature views, entities, etc) in the feature repo.
But, these objects must be applied, either through API, or the CLI, for them to be available by downstream Feast actions (such as materialization, or retrieving online features). Internally, Feast only looks at the registry when performing these actions, and not at the feature repo directly.

## Declarative Feature Definitions

When using the CLI to apply changes (via `feast apply`), the CLI determines the state of the feature repo from the source files and updates the registry state to reflect the definitions in the feature repo files.
This means that new feature views are added to the registry, existing feature views are updated as necessary, and Feast objects removed from the source files are deleted from the registry.
2 changes: 2 additions & 0 deletions docs/getting-started/concepts/feature-view.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,8 @@ Together with [data sources](data-ingestion.md), they indicate to Feast where to

Feature names must be unique within a [feature view](feature-view.md#feature-view).

Each field can have additional metadata associated with it, specified as key-value [tags](https://rtd.feast.dev/en/master/feast.html#feast.field.Field).

## \[Alpha] On demand feature views

On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths.
Expand Down
13 changes: 12 additions & 1 deletion docs/getting-started/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ It is a good idea though to lock down the registry file so only the CI/CD pipeli

### Does Feast support streaming sources?

Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support [push based ingestion](../reference/data-sources/push.md). Streaming transformations are actively being worked on.
Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support [push based ingestion](../reference/data-sources/push.md). Feast also defines a [stream processor](../tutorials/building-streaming-features.md) that allows a deeper integration with stream sources.

### Does Feast support feature transformation?

Expand Down Expand Up @@ -94,6 +94,17 @@ The list of supported offline and online stores can be found [here](../reference

Yes. Using a GCP or AWS provider in `feature_store.yaml` primarily sets default offline / online stores and configures where the remote registry file can live (Using the AWS provider also allows for deployment to AWS Lambda). You can override the offline and online stores to be in different clouds if you wish.

### What is the difference between a data source and an offline store?

The data source and the offline store are closely tied, but separate concepts.
The offline store controls how feast talks to a data store for historical feature retrieval, and the data source points to specific table (or query) within a data store. Offline stores are infrastructure-level connectors to data stores like Snowflake.

Additional differences:

- Data sources may be specific to a project (e.g. feed ranking), but offline stores are agnostic and used across projects.
- A feast project may define several data sources that power different feature views, but a feast project has a single offline store.
- Feast users typically need to define data sources when using feast, but only need to use/configure existing offline stores without creating new ones.

### How can I add a custom online store?

Please follow the instructions [here](../how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md).
Expand Down
5 changes: 5 additions & 0 deletions docs/tutorials/using-scalable-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ However, there's inherent limitations with a file-based registry, since changing

An alternative to the file-based registry is the [SQLRegistry](https://rtd.feast.dev/en/latest/feast.infra.registry_stores.html#feast.infra.registry_stores.sql.SqlRegistry) which ships with Feast. This implementation stores the registry in a relational database, and allows for changes to individual objects atomically.
Under the hood, the SQL Registry implementation uses [SQLAlchemy](https://docs.sqlalchemy.org/en/14/) to abstract over the different databases. Consequently, any [database supported](https://docs.sqlalchemy.org/en/14/core/engines.html#supported-databases) by SQLAlchemy can be used by the SQL Registry.
The following databases are supported and tested out of the box:
- PostgreSQL
- MySQL
- Sqlite

Feast can use the SQL Registry via a config change in the feature_store.yaml file. An example of how to configure this would be:

```yaml
Expand Down