[HOPSWORKS-3248] rename on_demand_feature_group to external_feature_g…

…roup (#669)
logicalclocks · Jul 12, 2022 · c2a86ca · c2a86ca
1 parent 2102bfa
commit c2a86ca
Show file tree

Hide file tree

Showing 26 changed files with 396 additions and 236 deletions.
diff --git a/auto_doc.py b/auto_doc.py
@@ -42,9 +42,9 @@
             ],
         ),
     },
-    "on_demand_feature_group.md": {
-        "fg_create": ["hsfs.feature_store.FeatureStore.create_on_demand_feature_group"],
-        "fg_get": ["hsfs.feature_store.FeatureStore.get_on_demand_feature_group"],
+    "external_feature_group.md": {
+        "fg_create": ["hsfs.feature_store.FeatureStore.create_external_feature_group"],
+        "fg_get": ["hsfs.feature_store.FeatureStore.get_external_feature_group"],
         "fg_properties": keras_autodoc.get_properties(
             "hsfs.feature_group.OnDemandFeatureGroup"
         ),

diff --git a/docs/integrations/hdinsight.md b/docs/integrations/hdinsight.md
@@ -90,7 +90,7 @@ The Hadoop and Spark installations of the HDInsight cluster need to be configure
 
 !!! attention "Using Hive and the Feature Store"
 
-    HDInsight clusters cannot use their local Hive when being configured for the Feature Store as the Feature Store relies on custom Hive binaries and its own Metastore which will overwrite the local one. If you rely on Hive for feature engineering then it is advised to write your data to an external data storage such as ADLS from your main HDInsight cluster and in the Feature Store, create an [on-demand](https://docs.hopsworks.ai/overview/#feature-groups) Feature Group on the storage container in ADLS.
+    HDInsight clusters cannot use their local Hive when being configured for the Feature Store as the Feature Store relies on custom Hive binaries and its own Metastore which will overwrite the local one. If you rely on Hive for feature engineering then it is advised to write your data to an external data storage such as ADLS from your main HDInsight cluster and in the Feature Store, create an [external](https://docs.hopsworks.ai/overview/#feature-groups) Feature Group on the storage container in ADLS.
 
 Hadoop hadoop-env.sh:
 ```

diff --git a/docs/integrations/storage-connectors.md b/docs/integrations/storage-connectors.md
@@ -1,6 +1,6 @@
 # Storage Connectors
 
-You can define storage connectors in Hopsworks for batch and streaming data sources. Storage connectors securely store information in Hopsworks about how to securely connect to external data stores. They can be used in both programs and in Hopsworks to easily and securely connect and ingest data to the Feature Store. External (on-demand) Feature Groups can also be defined with storage connectors, where only the metadata is stored in Hopsworks. 
+You can define storage connectors in Hopsworks for batch and streaming data sources. Storage connectors securely store information in Hopsworks about how to securely connect to external data stores. They can be used in both programs and in Hopsworks to easily and securely connect and ingest data to the Feature Store. External Feature Groups can also be defined with storage connectors, where only the metadata is stored in Hopsworks.
 
 Storage connectors provide two main mechanisms for authentication: using credentials or an authentication role (IAM Role on AWS or Managed Identity on Azure). Hopsworks supports both a single IAM role (AWS) or Managed Identity (Azure) for the whole Hopsworks cluster or more advanced multiple IAM roles (AWS) or Managed Identities (Azure) that can only be assumed by users with a specific role in a specific project.
 
@@ -21,9 +21,8 @@ Storage connectors provide two main mechanisms for authentication: using credent
 
 ## Programmatic Connectors (Spark, Python, Java/Scala, Flink)
 
-It is also possible to use the rich ecosystem of connectors available in programs run on Hopsworks. Just Spark has tens of open-source libraries for connecting to relational databases, key-value stores, file systems, object stores, search databases, and graph databases. In Hopsworks, you can securely save your credentials as secrets, and securely access them with API calls when you need to connect to your external store. 
+It is also possible to use the rich ecosystem of connectors available in programs run on Hopsworks. Just Spark has tens of open-source libraries for connecting to relational databases, key-value stores, file systems, object stores, search databases, and graph databases. In Hopsworks, you can securely save your credentials as secrets, and securely access them with API calls when you need to connect to your external store.
 
 ## Next Steps
 
 For more information about how to use the Feature Store, see the [Quickstart Guide](../quickstart.md).
-
diff --git a/docs/integrations/storage-connectors/snowflake.md b/docs/integrations/storage-connectors/snowflake.md
@@ -1,6 +1,6 @@
 Snowflake is a popular cloud-native data warehouse service, and supports scalable feature computation with SQL. However, Snowflake is not viable as an online feature store that serves features to models in production, with its columnar database layout its latency is too high compared to OLTP databases or key-value stores.
 
-To interact with Snowflake and to register and read [external feature groups](../../../generated/on_demand_feature_group) users need to define a storage connector using the UI:
+To interact with Snowflake and to register and read [external feature groups](../../../generated/external_feature_group) users need to define a storage connector using the UI:
 
 <p align="center">
   <figure>

diff --git a/docs/setup.md b/docs/setup.md
@@ -8,8 +8,8 @@ If you are using Spark or Python within Hopsworks, there is no further configura
 
 ## Storage Connectors
 
-Storage connectors encapsulate the configuration information needed for a Spark or Python execution engine to securely read and write to a specific storage. The [storage connector guide](integrations/storage-connectors.md) explains step by step how to configure different data sources (such as S3, Azure Data Lake, Redshift, Snowflake, any JDBC data source) and how they can be used to ingest data and define external (on-demand) Feature Groups.
- 
+Storage connectors encapsulate the configuration information needed for a Spark or Python execution engine to securely read and write to a specific storage. The [storage connector guide](integrations/storage-connectors.md) explains step by step how to configure different data sources (such as S3, Azure Data Lake, Redshift, Snowflake, any JDBC data source) and how they can be used to ingest data and define external Feature Groups.
+
 ## Databricks
 
 Connecting to the Feature Store from Databricks requires setting up a Feature Store API Key for Databricks and installing one of the HSFS client libraries on your Databricks cluster. The [Databricks integration guide](integrations/databricks/configuration.md) explains step by step how to connect to the Feature Store from Databricks.

diff --git a/docs/templates/external_feature_group.md b/docs/templates/external_feature_group.md
@@ -0,0 +1,105 @@
+#External Feature Groups
+
+External Feature Groups are Feature Groups for which the data is stored on an external storage system (e.g. Data Warehouse, S3, ADLS).
+From an API perspective, external feature groups can be used in the same way as regular feature groups. Users can pick features from external feature groups to create training datasets. External feature groups can be also used as data source to create derived features, meaning features on which additional feature engineering is applied.
+
+External feature groups rely on [Storage Connectors](../../integrations/storage-connectors/) to identify the location and to authenticate with the external storage.
+When the external feature group is defined on top of an external database capabale of running SQL statements (i.e. when using the JDBC, Redshift or Snowflake connectors), the external feature group needs to be defined as a SQL statement. SQL statements can contain feature engineering transformations, when reading the external feature group, the SQL statement is pushed down to the storage for execution.
+
+=== "Python"
+
+    !!! example "Define a SQL based external feature group"
+        ```python
+        # Retrieve the storage connector defined before
+        redshift_conn = fs.get_storage_connector("telco_redshift_cluster")
+        telco_ext = fs.create_external_feature_group(name="telco_redshift",
+                                                version=1,
+                                                query="select * from telco",
+                                                description="External feature group for telecom customer data",
+                                                storage_connector=redshift_conn,
+                                                statistics_config=True)
+        telco_ext.save()
+        ```
+
+=== "Scala"
+
+    !!! example "Connecting from Hopsworks"
+        ```scala
+        val redshiftConn = fs.getRedshiftConnector("telco_redshift_cluster")
+        val telcoExt = (fs.createExternalFeatureGroup()
+                    .name("telco_redshift_scala")
+                    .version(2)
+                    .query("select * from telco")
+                    .description("External feature group for telecom customer data")
+                    .storageConnector(redshiftConn)
+                    .statisticsConfig(new StatisticsConfig(true, true, true, false))
+                    .build())
+        telcoExt.save()
+        ```
+
+
+When defining an external feature group on top of a object store/external filesystem (i.e. when using the S3 or the ADLS connector) the underlying data is required to have a schema. The underlying data can be stored in ORC, Parquet, Delta, Hudi or Avro, and the schema for the feature group will be extracted by the files metadata.
+
+=== "Python"
+
+    !!! example "Define a SQL based external feature group"
+        ```python
+        # Retrieve the storage connector defined before
+        s3_conn = fs.get_storage_connector("telco_s3_bucket")
+        telco_ext = fs.create_external_feature_group(name="telco_s3",
+                                                version=1,
+                                                data_format="parquet",
+                                                description="External feature group for telecom customer data",
+                                                storage_connector=s3_conn,
+                                                statistics_config=True)
+        telco_ext.save()
+        ```
+
+=== "Scala"
+
+    !!! example "Connecting from Hopsworks"
+        ```scala
+        val s3Conn = fs.getS3Connector("telco_s3_bucket")
+        val telcoExt = (fs.createExtrenalFeatureGroup()
+                    .name("telco_s3")
+                    .version(1)
+                    .dataFormat(ExternalDataFormat.PARQUET)
+                    .description("External feature group for telecom customer data")
+                    .storageConnector(s3Conn)
+                    .statisticsConfig(new StatisticsConfig(true, true, true, false))
+                    .build())
+        telcoExt.save()
+        ```
+
+## Use cases
+
+There are two use cases in which a user can benefit from external feature groups:
+
+- **Existing feature engineering pipelines**: in case users have recently migrated to Hopsworks Feature Store and they have existing feature engineering pipelines in production. Users can register the output of the existing pipelines as external feature groups in Hopsworks, and immediately use their features to build training datasets. With external feature groups, users do not have to modify the existing pipelines to write to the Hopsworks Feature Store.
+
+- **Data Ingestion**: external feature groups can be used as a data source. The benefit of using external feature groups to ingest data from external sources is that the Hopsworks Feature Store keeps track of where the data is located and how to authenticate with the external storage system. In addition to that, the Hopsworks Feature Store tracks also the schema of the underlying data and will make sure that, if something changes in the underlying schema, the ingestion pipeline fails with a clear error.
+
+## Limitations
+
+Hopsworks Feature Store does not support time-travel capabilities for external feature groups. Moreover, as the data resides on external systems, external feature groups cannot be made available online for low latency serving. To make data from an external feature group available online, users need to define an online enabled feature group and hava a job that periodically reads data from the external feature group and writes in the online feature group.
+
+!!! warning "Python support"
+
+    Currently the HSFS library does not support calling the `read()` or `show()` methods on external feature groups. Likewise it is not possibile to call the `read()` or `show()` methods on queries containing external feature groups.
+    Nevertheless, external feature groups can be used from a Python engine to create training datasets.
+
+## Creation
+
+{{fg_create}}
+
+## Retrieval
+
+{{fg_get}}
+
+## Properties
+
+{{fg_properties}}
+
+## Methods
+
+{{fg_methods}}
diff --git a/docs/templates/on_demand_feature_group.md b/docs/templates/on_demand_feature_group.md
diff --git a/...ogicalclocks/hsfs/OnDemandDataFormat.java → ...ogicalclocks/hsfs/ExternalDataFormat.java b/...ogicalclocks/hsfs/OnDemandDataFormat.java → ...ogicalclocks/hsfs/ExternalDataFormat.java
@@ -16,7 +16,7 @@
 
 package com.logicalclocks.hsfs;
 
-public enum OnDemandDataFormat {
+public enum ExternalDataFormat {
   ORC,
   PARQUET,
   AVRO,