feast-dev · feast-ci-bot · Aug 9, 2022 · Aug 9, 2022 · Aug 9, 2022 · felixwang9817
@@ -1,6 +1,6 @@
 # Data ingestion
 
-### Data source
+## Data source
 
 The data source refers to raw underlying data (e.g. a table in BigQuery).
 
@@ -18,7 +18,7 @@ Feast supports primarily **time-stamped** tabular data as data sources. There ar
   * **\[Alpha] Stream sources** allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.
 * **(Experimental) Request data sources:** This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into **on-demand feature views**, which allow light-weight feature engineering and combining features across sources.
 
-### Batch data ingestion
+## Batch data ingestion
 
 Ingesting from batch sources is only necessary to power real-time models. This is done through **materialization**. Under the hood, Feast manages an _offline store_ (to scalably generate training data from batch sources) and an _online store_ (to provide low-latency access to features for real-time models).
 
@@ -58,6 +58,8 @@ materialize_python = PythonOperator(
 
 <summary>Code example: CLI based materialization</summary>
 
+
+
 #### How to run this in the CLI
 
 ```bash
@@ -77,7 +79,14 @@ materialize_bash = BashOperator(
 
 </details>
 
-### Stream data ingestion
+### Batch data schema inference
+
+If the `schema` parameter is not specified when defining a data source, Feast attempts to infer the schema of the data source during `feast apply`. 
+The way it does this depends on the implementation of the offline store. For the offline stores that ship with Feast out of the box this inference is performed by inspecting the schema of the table in the cloud data warehouse, 
+or if a query is provided to the source, by running the query with a `LIMIT` clause and inspecting the result.   
+
+
+## Stream data ingestion
 
 Ingesting from stream sources happens either via a Push API or via a contrib processor that leverages an existing Spark context.
 

@@ -0,0 +1,14 @@
+# Data Types in Feast
+
+Feast frequently has to mediate data across platforms and systems, each with its own unique type system. 
+To make this possible, Feast itself has a type system for all the types it is able to handle natively.
+
+Feast's type system is built on top of [protobuf](https://github.com/protocolbuffers/protobuf). The messages that make up the type system can be found [here](https://github.com/feast-dev/feast/blob/master/protos/feast/types/Value.proto), and the corresponding python classes that wrap them can be found [here](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/types.py).
+
+Feast supports primitive data types (numerical values, strings, bytes, booleans and timestamps). The only complex data type Feast supports is Arrays, and arrays cannot contain other arrays.
+
+Each feature or schema field in Feast is associated with a data type, which is stored in Feast's [registry](registry.md). These types are also used to ensure that Feast operates on values correctly (e.g. making sure that timestamp columns used for [point-in-time correct joins](point-in-time-joins.md) actually have the timestamp type).
+
+As a result, each system that feast interacts with needs a way to translate data types from the native platform, into a feast type. E.g., Snowflake SQL types are converted to Feast types [here](https://rtd.feast.dev/en/master/feast.html#feast.type_map.snowflake_python_type_to_feast_value_type). The onus is therefore on authors of offline or online store connectors to make sure that this type mapping happens correctly.
+
+**Note**: Feast currently does *not* support a null type in its type system.
@@ -0,0 +1,13 @@
+# Feature Repository
+
+## Feature Repo
+
+A feature repository is the collection of python files that define entities, feature views and data sources. Feature Repos also have a `feature_store.yaml` file at their root. 
+
+Users can collaborate by making and reviewing changes to Feast object definitions (feature views, entities, etc) in the feature repo.
+But, these objects must be applied, either through API, or the CLI, for them to be available by downstream Feast actions (such as materialization, or retrieving online features). Internally, Feast only looks at the registry when performing these actions, and not at the feature repo directly.
+
+## Declarative Feature Definitions
+
+When using the CLI to apply changes (via `feast apply`), the CLI determines the state of the feature repo from the source files and updates the registry state to reflect the definitions in the feature repo files.
+This means that new feature views are added to the registry, existing feature views are updated as necessary, and Feast objects removed from the source files are deleted from the registry. 
@@ -150,6 +150,8 @@ Together with [data sources](data-ingestion.md), they indicate to Feast where to
 
 Feature names must be unique within a [feature view](feature-view.md#feature-view).
 
+Each field can have additional metadata associated with it, specified as key-value [tags](https://rtd.feast.dev/en/master/feast.html#feast.field.Field).
+
 ## \[Alpha] On demand feature views
 
 On demand feature views allows data scientists to use existing features and request time data (features only available at request time) to transform and create new features. Users define python transformation logic which is executed in both historical retrieval and online retrieval paths.

@@ -46,7 +46,7 @@ It is a good idea though to lock down the registry file so only the CI/CD pipeli
 
 ### Does Feast support streaming sources?
 
-Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support [push based ingestion](../reference/data-sources/push.md). Streaming transformations are actively being worked on.
+Yes. In earlier versions of Feast, we used Feast Spark to manage ingestion from stream sources. In the current version of Feast, we support [push based ingestion](../reference/data-sources/push.md). Feast also defines a [stream processor](../tutorials/building-streaming-features.md) that allows a deeper integration with stream sources.
 
 ### Does Feast support feature transformation?
 
@@ -94,6 +94,17 @@ The list of supported offline and online stores can be found [here](../reference
 
 Yes. Using a GCP or AWS provider in `feature_store.yaml` primarily sets default offline / online stores and configures where the remote registry file can live (Using the AWS provider also allows for deployment to AWS Lambda). You can override the offline and online stores to be in different clouds if you wish.
 
+### What is the difference between a data source and an offline store?
+
+The data source and the offline store are closely tied, but separate concepts. 
+The offline store controls how feast talks to a data store for historical feature retrieval, and the data source points to specific table (or query) within a data store. Offline stores are infrastructure-level connectors to data stores like Snowflake.
+
+Additional differences:
+
+- Data sources may be specific to a project (e.g. feed ranking), but offline stores are agnostic and used across projects.
+- A feast project may define several data sources that power different feature views, but a feast project has a single offline store.
+- Feast users typically need to define data sources when using feast, but only need to use/configure existing offline stores without creating new ones.
+
 ### How can I add a custom online store?
 
 Please follow the instructions [here](../how-to-guides/customizing-feast/adding-support-for-a-new-online-store.md).

@@ -13,6 +13,11 @@ However, there's inherent limitations with a file-based registry, since changing
 
 An alternative to the file-based registry is the [SQLRegistry](https://rtd.feast.dev/en/latest/feast.infra.registry_stores.html#feast.infra.registry_stores.sql.SqlRegistry) which ships with Feast. This implementation stores the registry in a relational database, and allows for changes to individual objects atomically.
 Under the hood, the SQL Registry implementation uses [SQLAlchemy](https://docs.sqlalchemy.org/en/14/) to abstract over the different databases. Consequently, any [database supported](https://docs.sqlalchemy.org/en/14/core/engines.html#supported-databases) by SQLAlchemy can be used by the SQL Registry.
+The following databases are supported and tested out of the box:
+- PostgreSQL
+- MySQL
+- Sqlite
+
 Feast can use the SQL Registry via a config change in the feature_store.yaml file. An example of how to configure this would be:
 
 ```yaml