diff --git a/docs/concepts/feathr-capabilities.md b/docs/concepts/feathr-capabilities.md index 6b4c5df23..65881d2ea 100644 --- a/docs/concepts/feathr-capabilities.md +++ b/docs/concepts/feathr-capabilities.md @@ -143,8 +143,8 @@ schema = AvroJsonSchema(schemaStr=""" } """) stream_source = KafKaSource(name="kafkaStreamingSource", - kafkaConfig=KafkaConfig(brokers=["feathrazureci.servicebus.windows.net:9093"], - topics=["feathrcieventhub"], + kafkaConfig=KafkaConfig(brokers=[".servicebus.windows.net:9093"], + topics=[""], schema=schema) ) diff --git a/docs/concepts/feature-definition.md b/docs/concepts/feature-definition.md index 51ddb6742..7116d6630 100644 --- a/docs/concepts/feature-definition.md +++ b/docs/concepts/feature-definition.md @@ -23,7 +23,7 @@ See an examples below: ```python batch_source = HdfsSource(name="nycTaxiBatchSource", - path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv", + path="abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv", event_timestamp_column="lpep_dropoff_datetime", timestamp_format="yyyy-MM-dd HH:mm:ss") ``` diff --git a/docs/concepts/feature-registry.md b/docs/concepts/feature-registry.md index fe56218b3..9bc00b275 100644 --- a/docs/concepts/feature-registry.md +++ b/docs/concepts/feature-registry.md @@ -61,7 +61,7 @@ Alternatively, you can set the feature registry and the API endpoint in the conf ```yaml feature_registry: # The API endpoint of the registry service - api_endpoint: "https://feathr-sql-registry.azurewebsites.net/api/v1" + api_endpoint: "https://.azurewebsites.net/api/v1" ``` ### Register and List Features diff --git a/docs/concepts/materializing-features.md b/docs/concepts/materializing-features.md index 13466427c..28d824525 100644 --- a/docs/concepts/materializing-features.md +++ b/docs/concepts/materializing-features.md @@ -96,7 +96,7 @@ The API call is very similar to materializing features to online store, and here ```python client = FeathrClient() -offlineSink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/") +offlineSink = HdfsSink(output_path="abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/materialize_offline_test_data/") # Materialize two features into a Offline store. settings = MaterializationSettings("nycTaxiMaterializationJob", sinks=[offlineSink], @@ -121,14 +121,14 @@ settings = MaterializationSettings("nycTaxiTable", ``` This will materialize features with cutoff time from `2020/05/10` to `2020/05/20` correspondingly, and the output will have 11 folders, from -`abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/10` to `abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`. Note that currently Feathr only supports materializing data in daily step (i.e. even if you specify an hourly step, the generated features in offline store will still be presented in a daily hierarchy). For more details on how `BackfillTime` works, refer to the [BackfillTime section](#feature-backfill) above. +`abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/10` to `abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20`. Note that currently Feathr only supports materializing data in daily step (i.e. even if you specify an hourly step, the generated features in offline store will still be presented in a daily hierarchy). For more details on how `BackfillTime` works, refer to the [BackfillTime section](#feature-backfill) above. You can also specify the format of the materialized features in the offline store by using `execution_configurations` like below. Please refer to the [documentation](../how-to-guides/feathr-job-configuration.md) here for those configuration details. ```python from feathr import HdfsSink -offlineSink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_data/") +offlineSink = HdfsSink(output_path="abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/materialize_offline_data/") # Materialize two features into a Offline store. settings = MaterializationSettings("nycTaxiMaterializationJob", sinks=[offlineSink], @@ -141,7 +141,7 @@ For reading those materialized features, Feathr has a convenient helper function ```python from feathr import get_result_df -path = "abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/" +path = "abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/" res = get_result_df(client=client, format="parquet", res_url=path) ``` diff --git a/docs/concepts/point-in-time-join.md b/docs/concepts/point-in-time-join.md index 910352690..1e51bddb6 100644 --- a/docs/concepts/point-in-time-join.md +++ b/docs/concepts/point-in-time-join.md @@ -105,12 +105,12 @@ And below shows the join definitions: feature_query = FeatureQuery( feature_list=["feature_X"], key=UserId) settings = ObservationSettings( - observation_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv", + observation_path="abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv", event_timestamp_column="Date", timestamp_format="MM/DD") client.get_offline_features(observation_settings=settings, feature_query=feature_query, - output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/output.avro") + output_path="abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/demo_data/output.avro") ``` ## Advanced Point-in-time Lookup diff --git a/docs/quickstart_synapse.md b/docs/quickstart_synapse.md index 5cc2830a5..5dee17931 100644 --- a/docs/quickstart_synapse.md +++ b/docs/quickstart_synapse.md @@ -164,8 +164,8 @@ The following feature join config is used: ```python feature_query = [FeatureQuery(feature_list=["f_location_avg_fare"], key=["DOLocationID"])] settings = ObservationSettings( - observation_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv", - output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/output.avro", + observation_path="abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/demo_data/green_tripdata_2020-04.csv", + output_path="abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/demo_data/output.avro", event_timestamp_column="lpep_dropoff_datetime", timestamp_format="yyyy-MM-dd HH:mm:ss") client.get_offline_features(feature_query=feature_query, observation_settings=settings) ``` diff --git a/docs/samples/customer360/Customer360.ipynb b/docs/samples/customer360/Customer360.ipynb index 664ae5b3e..4b202e13a 100644 --- a/docs/samples/customer360/Customer360.ipynb +++ b/docs/samples/customer360/Customer360.ipynb @@ -1,34 +1,30 @@ { - "cells":[ + "cells": [ { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"a89791bc-cfc2-4105-a541-a3392af3c314", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "a89791bc-cfc2-4105-a541-a3392af3c314", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Feathr Feature Store For Customer360 on Azure - Demo Notebook" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"f4072c36-b190-4c8a-af43-dc004854aea4", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "f4072c36-b190-4c8a-af43-dc004854aea4", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "This notebook illustrates the use of Feathr Feature Store to create one of the use case for Customer 360. This usecase predicts Sales amount by the Discount offered. It includes following steps:\n", " \n", "1. Install and set up Feathr with Azure\n", @@ -45,18 +41,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"1632aaa6-35de-4d7f-9f88-ecfb1f927bb0", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1632aaa6-35de-4d7f-9f88-ecfb1f927bb0", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Prerequisite: Provision cloud resources\n", "\n", "First step is to provision required cloud resources if you want to use Feathr. Feathr provides a python based client to interact with cloud resources.\n", @@ -70,40 +64,34 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"80223a8e-8901-421c-b63d-4e11a6da5d88", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "80223a8e-8901-421c-b63d-4e11a6da5d88", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Sample Dataset\n", "\n", "In this demo, we use Feathr Feature Store to showcase Customer360 Features using Feathr. The dataset can be mounted onto a azure blob storage account and seen by executing the following command. The dataset is present in the current directory and it is referenced from [here](https://community.tableau.com/s/question/0D54T00000CWeX8SAL/sample-superstore-sales-excelxls)" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"d38f6dc4-51f7-44cd-a82d-cd08e08260e4", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "d38f6dc4-51f7-44cd-a82d-cd08e08260e4", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "\n", "key = \"blobstorekey\"\n", "acnt = \"studiofeathrazuredevsto\"\n", @@ -134,56 +122,48 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"32261356-9c9e-4988-9754-ad6fc1c447e1", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "32261356-9c9e-4988-9754-ad6fc1c447e1", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Prerequisite: Install Feathr\n", "\n", "Install Feathr using pip:" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"5c988222-113b-49b2-8069-d5a44a9cb05b", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "5c988222-113b-49b2-8069-d5a44a9cb05b", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "! pip install --force-reinstall git+https://github.com/linkedin/feathr.git@registry_fix#subdirectory=feathr_project pandavro scikit-learn" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"1d87942f-db42-48cd-bf8f-f79c3214ce92", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1d87942f-db42-48cd-bf8f-f79c3214ce92", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Prerequisite: Configure the required environment\n", "\n", "In the first step (Provision cloud resources), you should have provisioned all the required cloud resources. If you use Feathr CLI to create a workspace, you should have a folder with a file called `feathr_config.yaml` in it with all the required configurations. Otherwise, update the configuration below.\n", @@ -192,22 +172,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"571bc437-8a46-4f7f-83aa-2bf50e5c5cbb", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "571bc437-8a46-4f7f-83aa-2bf50e5c5cbb", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "import tempfile\n", "yaml_config = \"\"\"\n", "\n", @@ -235,13 +211,13 @@ " s3_endpoint: 's3.amazonaws.com'\n", " jdbc:\n", " jdbc_enabled: false\n", - " jdbc_database: 'feathrtestdb'\n", - " jdbc_table: 'feathrtesttable'\n", + " jdbc_database: ''\n", + " jdbc_table: ''\n", " snowflake:\n", " snowflake_enabled: false\n", - " url: \"dqllago-ol19457.snowflakecomputing.com\"\n", - " user: \"feathrintegration\"\n", - " role: \"ACCOUNTADMIN\"\n", + " url: \".snowflakecomputing.com\"\n", + " user: \"\"\n", + " role: \"\"\n", "spark_config:\n", " spark_cluster: 'databricks'\n", " spark_result_output_parts: '1'\n", @@ -252,20 +228,20 @@ " executor_size: 'Small'\n", " executor_num: 1\n", " databricks:\n", - " workspace_instance_url: \"https://adb-6578934.54.azuredatabricks.net/\"\n", + " workspace_instance_url: \"https://.azuredatabricks.net/\"\n", " workspace_token_value: \"\"\n", " config_template: '{\"run_name\":\"\",\"new_cluster\":{\"spark_version\":\"9.1.x-scala2.12\",\"node_type_id\":\"Standard_D3_v2\",\"num_workers\":2,\"spark_conf\":{}},\"libraries\":[{\"jar\":\"\"}],\"spark_jar_task\":{\"main_class_name\":\"\",\"parameters\":[\"\"]}}'\n", " \n", " work_dir: 'dbfs:/customer360'\n", "online_store:\n", " redis:\n", - " host: 'studio-feathrazure-dev-redis.redis.cache.windows.net'\n", + " host: '.redis.cache.windows.net'\n", " port: 6380\n", " ssl_enabled: True\n", "feature_registry:\n", " purview:\n", " type_system_initialization: true\n", - " purview_name: 'studio-feathrazure-dev-pview'\n", + " purview_name: ''\n", " delimiter: '__'\n", "\"\"\"\n", "# write this configuration string to a temporary location and load it to Feathr\n", @@ -275,38 +251,32 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"b37406db-23a6-40c4-966f-ccc0f8a3c853", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b37406db-23a6-40c4-966f-ccc0f8a3c853", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Import necessary libraries" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"3a5438fa-42fa-40eb-9a4e-d24ac68f9042", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "3a5438fa-42fa-40eb-9a4e-d24ac68f9042", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "import glob\n", "import os\n", "import tempfile\n", @@ -331,40 +301,34 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"3bda4f77-8418-460f-83ad-bb442f9a0525", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "3bda4f77-8418-460f-83ad-bb442f9a0525", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Setup necessary environment variables\n", "\n", "You have to setup the environment variables in order to run this sample. More environment variables can be set by referring to [feathr_config.yaml](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It should also have more explanations on the meaning of each variable." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"041041b0-ac69-4ab5-a993-509471bf334c", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "041041b0-ac69-4ab5-a993-509471bf334c", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "import os\n", "os.environ['REDIS_PASSWORD'] = ''\n", "os.environ['AZURE_CLIENT_ID'] = ''\n", @@ -378,75 +342,63 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"ec09a17d-ec64-4b9f-999f-9a71a508eaed", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "ec09a17d-ec64-4b9f-999f-9a71a508eaed", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Initialize a feathr client" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"88aec1c1-2bdc-42d2-918d-48c0e28fdd0f", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "88aec1c1-2bdc-42d2-918d-48c0e28fdd0f", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client = FeathrClient(config_path=tmp.name)" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"64f272f0-7008-4de7-89fe-a9d32f5573a0", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "64f272f0-7008-4de7-89fe-a9d32f5573a0", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Define Sources Section\n", "A feature source is needed for anchored features that describes the raw data in which the feature values are computed from. See the python documentation to get the details on each input column." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"6162dfd7-0791-4e9b-8200-da7710272c1e", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "6162dfd7-0791-4e9b-8200-da7710272c1e", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "batch_source = HdfsSource(name=\"cosmos_final_data\",\n", " path=\"abfss://container@blobaccountname.dfs.core.windows.net/data/customer360.csv\",\n", " event_timestamp_column=\"sales_order_dt\",\n", @@ -454,56 +406,48 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"9d08913f-e416-46e3-9bf3-31f50e41139f", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "9d08913f-e416-46e3-9bf3-31f50e41139f", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Defining Features with Feathr:\n", "In Feathr, a feature is viewed as a function, mapping from entity id or key, and timestamp to a feature value." ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"aa303679-7be2-430b-8194-19c90a28c4af", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "aa303679-7be2-430b-8194-19c90a28c4af", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Define Anchors and Features\n", "A feature is called an anchored feature when the feature is directly extracted from the source data, rather than computed on top of other features." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"9d95d006-8d9a-4e63-b7b0-2c88a166c6cb", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "9d95d006-8d9a-4e63-b7b0-2c88a166c6cb", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "f_sales_cust_id = Feature(name = \"f_sales_cust_id\",\n", " feature_type = STRING, transform = \"sales_cust_id\" )\n", "\n", @@ -538,39 +482,33 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"b84cebbe-884f-4665-9df7-dc3a16037fc5", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b84cebbe-884f-4665-9df7-dc3a16037fc5", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Define Derived Features\n", "Derived features are the features that are computed from other features. They could be computed from anchored features, or other derived features." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"d620ae76-b8ed-4bfe-a0dd-2e50ffd79212", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "d620ae76-b8ed-4bfe-a0dd-2e50ffd79212", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "f_total_sales_amount = DerivedFeature(name = \"f_total_sales_amount\",\n", " feature_type = FLOAT,\n", " input_features = [f_sales_item_quantity,f_sales_sell_price],\n", @@ -589,40 +527,34 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"3230211b-c978-44d3-9996-edd53fa952f0", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "3230211b-c978-44d3-9996-edd53fa952f0", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Define Aggregate features and anchor the features to batch source.\n", "\n", "Note that if the data source is from the observation data, the source section should be INPUT_CONTEXT to indicate the source of those defined anchors." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"ef204054-6638-4ff5-ba46-330256f553ed", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "ef204054-6638-4ff5-ba46-330256f553ed", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "customer_ID = TypedKey(key_column=\"sales_cust_id\",\n", " key_column_type=ValueType.INT32,\n", " description=\"customer ID\",\n", @@ -655,112 +587,94 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"1e803a63-80b6-40ad-9419-422ca1db3d97", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1e803a63-80b6-40ad-9419-422ca1db3d97", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Building Features\n", "And then we need to build those features so that it can be consumed later. Note that we have to build both the \"anchor\" and the \"derived\" features (which is not anchored to a source)." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"949a79bd-e4e9-487a-9a5c-b04cdecba3b3", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "949a79bd-e4e9-487a-9a5c-b04cdecba3b3", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.build_features(anchor_list=[request_anchor,agg_anchor], derived_feature_list=[f_total_sales_amount, f_total_sales_discount,f_total_amount_paid])" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"684aef42-53e1-4548-b604-9a581abda253", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "684aef42-53e1-4548-b604-9a581abda253", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Registering Features\n", "We can also register the features with an Apache Atlas compatible service, such as Azure Purview, and share the registered features across teams:" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"c486ea65-9ee9-4f73-aa61-33873ade8fae", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "c486ea65-9ee9-4f73-aa61-33873ade8fae", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.register_features()" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"00f5b4d9-5054-4a53-a511-2b8380c08ef5", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "00f5b4d9-5054-4a53-a511-2b8380c08ef5", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.list_registered_features(project_name=\"customer360\")" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"30e09585-3917-4d3b-8681-15360ad74972", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "30e09585-3917-4d3b-8681-15360ad74972", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Create training data using point-in-time correct feature join\n", "A training dataset usually contains entity id columns, multiple feature columns, event timestamp column and label/target column.\n", "\n", @@ -768,22 +682,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"2c2baf2c-835f-4aa9-8e01-b7c0e8711081", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "2c2baf2c-835f-4aa9-8e01-b7c0e8711081", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "feature_query = FeatureQuery(\n", " feature_list=[\"f_avg_item_ordered_by_customer\",\"f_avg_customer_discount_amount\",\"f_avg_customer_sales_amount\",\"f_total_sales_discount\"], key=customer_ID)\n", "settings = ObservationSettings(\n", @@ -793,39 +703,33 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"8cf3eb6a-d014-429f-9f57-28aa2870785d", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "8cf3eb6a-d014-429f-9f57-28aa2870785d", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Materialize feature value into offline storage\n", "While Feathr can compute the feature value from the feature definition on-the-fly at request time, it can also pre-compute and materialize the feature value to offline and/or online storage." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"f1209ef3-f865-44fd-8721-14c6fa131d1b", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "f1209ef3-f865-44fd-8721-14c6fa131d1b", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.get_offline_features(observation_settings=settings,\n", " feature_query=feature_query,\n", " output_path=\"abfss://container@blobaccountname.dfs.core.windows.net/data/output/output.avro\")\n", @@ -833,38 +737,32 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"b54f11fe-0a2e-4223-9090-68b12d3b3fb4", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b54f11fe-0a2e-4223-9090-68b12d3b3fb4", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Reading training data from offline storage" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"f805989c-d7f6-43e5-bd3c-0299c2f1beb7", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "f805989c-d7f6-43e5-bd3c-0299c2f1beb7", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "path = '/mnt/studio-feathrazure-dev-fs/cosmos/output/output'\n", "df= spark.read.format(\"avro\").load(path)\n", "\n", @@ -873,40 +771,34 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"7e398ee6-e2eb-4cf6-90b1-a71dc693a2c0", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "7e398ee6-e2eb-4cf6-90b1-a71dc693a2c0", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "####Train a ML model\n", "\n", "After getting all the features, let's train a machine learning model with the converted feature by Feathr:" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"a6948c79-0b06-41a7-8df0-4332d40a5b8a", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "a6948c79-0b06-41a7-8df0-4332d40a5b8a", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "X = df['f_total_sales_discount']\n", "y = df['f_total_sales_amount']\n", "\n", @@ -938,39 +830,33 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"69cb5413-7327-4e64-81d3-f76010a6af52", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "69cb5413-7327-4e64-81d3-f76010a6af52", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Materialize feature value into online storage\n", "We can push the generated features to the online store like below" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"706dcf1b-64d1-47d0-8bbe-88c8af82a464", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "706dcf1b-64d1-47d0-8bbe-88c8af82a464", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "\n", "redisSink = RedisSink(table_name=\"Customer360\")\n", "settings = MaterializationSettings(\"cosmos_feathr_table\",\n", @@ -982,64 +868,54 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"6f6cba7c-255b-4713-8c0b-023bdb4c2c55", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "6f6cba7c-255b-4713-8c0b-023bdb4c2c55", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "#### Fetching feature value for online inference\n", "For features that are already materialized by the previous step, their latest value can be queried via the client's get_online_features or multi_get_online_features API." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"40409c79-79fc-400e-a32b-fce3bdc682e6", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "40409c79-79fc-400e-a32b-fce3bdc682e6", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.get_online_features(feature_table = \"Customer360\",\n", " key = \"KB-16240\",\n", " feature_names = ['f_avg_item_ordered_by_customer'])" ] } ], - "metadata":{ - "application/vnd.databricks.v1+notebook":{ - "dashboards":[ - - ], - "language":"python", - "notebookMetadata":{ - "pythonIndentUnit":4 - }, - "notebookName":"Customer360_MS_V2", - "notebookOrigID":2897062443582288, - "widgets":{ - - } - }, - "language_info":{ - "name":"python" + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [], + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 4 + }, + "notebookName": "Customer360_MS_V2", + "notebookOrigID": 2897062443582288, + "widgets": {} + }, + "language_info": { + "name": "python" } }, - "nbformat":4, - "nbformat_minor":0 + "nbformat": 4, + "nbformat_minor": 0 } diff --git a/docs/samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb b/docs/samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb index 648ddbc9d..82aaf3832 100644 --- a/docs/samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb +++ b/docs/samples/databricks/databricks_quickstart_nyc_taxi_driver.ipynb @@ -342,15 +342,16 @@ " wasb_enabled: true\n", " s3:\n", " s3_enabled: false\n", - " s3_endpoint: 's3.amazonaws.com'\n", + " s3_endpoint: ''\n", " jdbc:\n", " jdbc_enabled: false\n", - " jdbc_database: 'feathrtestdb'\n", - " jdbc_table: 'feathrtesttable'\n", + " jdbc_database: ''\n", + " jdbc_table: ''\n", " snowflake:\n", - " url: \"dqllago-ol19457.snowflakecomputing.com\"\n", - " user: \"feathrintegration\"\n", - " role: \"ACCOUNTADMIN\"\n", + " snowflake_enabled: false\n", + " url: \".snowflakecomputing.com\"\n", + " user: \"\"\n", + " role: \"\"\n", "spark_config:\n", " # choice for spark runtime. Currently support: azure_synapse, databricks\n", " # The `databricks` configs will be ignored if `azure_synapse` is set and vice versa.\n", @@ -359,13 +360,13 @@ "\n", "online_store:\n", " redis:\n", - " host: 'feathrazuretest3redis.redis.cache.windows.net'\n", + " host: '.redis.cache.windows.net'\n", " port: 6380\n", " ssl_enabled: True\n", "feature_registry:\n", " purview:\n", " type_system_initialization: true\n", - " purview_name: 'feathrazuretest3-purview1'\n", + " purview_name: ''\n", " delimiter: '__'\n", "\"\"\"\n", "tmp = tempfile.NamedTemporaryFile(mode='w', delete=False)\n", diff --git a/docs/samples/fraud_detection_demo.ipynb b/docs/samples/fraud_detection_demo.ipynb index 45d6d7982..88c672160 100644 --- a/docs/samples/fraud_detection_demo.ipynb +++ b/docs/samples/fraud_detection_demo.ipynb @@ -220,36 +220,36 @@ " wasb_enabled: true\n", " s3:\n", " s3_enabled: false\n", - " s3_endpoint: 's3.amazonaws.com'\n", + " s3_endpoint: ''\n", " jdbc:\n", " jdbc_enabled: false\n", - " jdbc_database: 'feathrtestdb'\n", - " jdbc_table: 'feathrtesttable'\n", + " jdbc_database: ''\n", + " jdbc_table: ''\n", " snowflake:\n", - " snowflake_enabled: true\n", - " url: \"dqllago-ol19457.snowflakecomputing.com\"\n", - " user: \"feathrintegration\"\n", - " role: \"ACCOUNTADMIN\"\n", + " snowflake_enabled: false\n", + " url: \".snowflakecomputing.com\"\n", + " user: \"\"\n", + " role: \"\"\n", "spark_config:\n", " spark_cluster: 'azure_synapse'\n", " spark_result_output_parts: '1'\n", " azure_synapse:\n", - " dev_url: 'https://feathrazuretest3synapse.dev.azuresynapse.net'\n", + " dev_url: 'https://.dev.azuresynapse.net'\n", " pool_name: 'spark3'\n", - " workspace_dir: 'abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/fraud_detection_test'\n", + " workspace_dir: 'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/fraud_detection_test'\n", " executor_size: 'Small'\n", " executor_num: 1\n", " databricks:\n", - " workspace_instance_url: 'https://adb-2474129336842816.16.azuredatabricks.net'\n", + " workspace_instance_url: 'https://.azuredatabricks.net'\n", " config_template: {'run_name':'','new_cluster':{'spark_version':'9.1.x-scala2.12','node_type_id':'Standard_D3_v2','num_workers':2,'spark_conf':{}},'libraries':[{'jar':''}],'spark_jar_task':{'main_class_name':'','parameters':['']}}\n", " work_dir: 'dbfs:/fraud_detection_test'\n", "online_store:\n", " redis:\n", - " host: 'feathrazuretest3redis.redis.cache.windows.net'\n", + " host: '.redis.cache.windows.net'\n", " port: 6380\n", " ssl_enabled: True\n", "feature_registry:\n", - " api_endpoint: \"https://feathr-sql-registry.azurewebsites.net/api/v1\"\n", + " api_endpoint: \"https://.azurewebsites.net/api/v1\"\n", "\"\"\"\n", "tmp = tempfile.NamedTemporaryFile(mode='w', delete=False)\n", "with open(tmp.name, \"w\") as text_file:\n", diff --git a/docs/samples/product_recommendation_demo.ipynb b/docs/samples/product_recommendation_demo.ipynb index 4ead35504..aa7699eb5 100644 --- a/docs/samples/product_recommendation_demo.ipynb +++ b/docs/samples/product_recommendation_demo.ipynb @@ -221,10 +221,10 @@ " jdbc_database: 'feathrtestdb'\n", " jdbc_table: 'feathrtesttable'\n", " snowflake:\n", - " snowflake_enabled: true\n", - " url: \"dqllago-ol19457.snowflakecomputing.com\"\n", - " user: \"feathrintegration\"\n", - " role: \"ACCOUNTADMIN\"\n", + " snowflake_enabled: false\n", + " url: \".snowflakecomputing.com\"\n", + " user: \"\"\n", + " role: \"\"\n", "spark_config:\n", " spark_cluster: 'azure_synapse'\n", " spark_result_output_parts: '1'\n", diff --git a/docs/samples/product_recommendation_demo_advanced.ipynb b/docs/samples/product_recommendation_demo_advanced.ipynb index 89c9c63e5..fff2a1cd5 100644 --- a/docs/samples/product_recommendation_demo_advanced.ipynb +++ b/docs/samples/product_recommendation_demo_advanced.ipynb @@ -1,18 +1,16 @@ { - "cells":[ + "cells": [ { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"e5545a38-44a7-4aca-be6d-a66c51c75ec8", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "e5545a38-44a7-4aca-be6d-a66c51c75ec8", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "# Feathr Feature Store on Azure Demo Notebook\n", "\n", "This notebook illustrates the use of Feathr Feature Store to create a model that predict users' rating for different products for a e-commerce website.\n", @@ -37,18 +35,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"52b7d651-19d4-44b0-a7a8-03549f49e524", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "52b7d651-19d4-44b0-a7a8-03549f49e524", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Prerequisite: Use Quick Start Template to Provision Azure Resources\n", "\n", "First step is to provision required cloud resources if you want to use Feathr. Feathr provides a python based client to interact with cloud resources.\n", @@ -60,18 +56,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"1ec709d2-62ef-48c7-b915-9790afdac589", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1ec709d2-62ef-48c7-b915-9790afdac589", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Prerequisite: Install Feathr \n", "\n", "Install Feathr using pip:\n", @@ -80,132 +74,110 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"ab5d219b-b827-4f25-9918-d7cb7b47938e", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "ab5d219b-b827-4f25-9918-d7cb7b47938e", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Prerequisite: Configure the required environment with Feathr Quick Start Template\n", "\n", "In the first step (Provision cloud resources), you should have provisioned all the required cloud resources. Run the code below to install Feathr, login to Azure to get the required credentials to access more cloud resources." ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"146a1443-ce8b-4b8e-8169-2417af8bcb62", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "146a1443-ce8b-4b8e-8169-2417af8bcb62", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "**REQUIRED STEP: Fill in the resource prefix when provisioning the resources**" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"99b2d855-dae1-4ac8-8492-406dad242326", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "99b2d855-dae1-4ac8-8492-406dad242326", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "resource_prefix = \"ckim2\"" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"95ad2a97-b8e7-4189-8463-51fe419d29c5", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "95ad2a97-b8e7-4189-8463-51fe419d29c5", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "! pip install feathr azure-cli pandavro scikit-learn\n" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"a4d7dc6a-d753-4fb6-9683-2766f9a046c7", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "a4d7dc6a-d753-4fb6-9683-2766f9a046c7", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "Login to Azure with a device code (You will see instructions in the output):" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"42cf1691-b8de-48d2-b174-0c269950d470", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "42cf1691-b8de-48d2-b174-0c269950d470", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "! az login --use-device-code" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"0f3135eb-15c5-4f46-90ff-881a21cc59df", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "0f3135eb-15c5-4f46-90ff-881a21cc59df", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "import glob\n", "import os\n", "import tempfile\n", @@ -230,18 +202,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"a58b69e8-fbd2-48dd-81cb-85163dfbb676", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "a58b69e8-fbd2-48dd-81cb-85163dfbb676", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "**Permission**\n", "\n", "To proceed with the following steps, you may need additional permission: permission to access the keyvault, permission to access the Storage Blob as a Contributor and permission to submit jobs to Synapse cluster. Skip this step if you have already given yourself the access. Otherwise, run the following lines of command in the Cloud Shell before running the cell below.\n", @@ -260,51 +230,39 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - - }, - "outputs":[ - - ], - "source":[ - - ] + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"510120a8-d456-4aa1-9b0b-6e10bd774b78", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "510120a8-d456-4aa1-9b0b-6e10bd774b78", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "**Get all the required credentials from Azure KeyVault**" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"b589fc31-11f9-4bea-963a-9dab88cd6689", - "showTitle":true, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b589fc31-11f9-4bea-963a-9dab88cd6689", + "showTitle": true, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# Get all the required credentials from Azure Key Vault\n", "key_vault_name=resource_prefix+\"kv\"\n", "synapse_workspace_url=resource_prefix+\"syws\"\n", @@ -335,18 +293,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"4a1f37e9-eb40-4791-9904-19e13a98f5c9", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "4a1f37e9-eb40-4791-9904-19e13a98f5c9", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Prerequisite: Configure the required environment (Don't need to update if using the above Quick Start Template)\n", "\n", "In the first step (Provision cloud resources), you should have provisioned all the required cloud resources. If you use Feathr CLI to create a workspace, you should have a folder with a file called `feathr_config.yaml` in it with all the required configurations. Otherwise, update the configuration below.\n", @@ -355,22 +311,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"c7cd2bc7-237c-4170-a9b7-ae94f279bbba", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "c7cd2bc7-237c-4170-a9b7-ae94f279bbba", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "import tempfile\n", "yaml_config = \"\"\"\n", "# Please refer to https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml for explanations on the meaning of each field.\n", @@ -389,32 +341,33 @@ " s3_endpoint: 's3.amazonaws.com'\n", " jdbc:\n", " jdbc_enabled: false\n", - " jdbc_database: 'feathrtestdb'\n", - " jdbc_table: 'feathrtesttable'\n", + " jdbc_database: ''\n", + " jdbc_table: ''\n", " snowflake:\n", - " url: \"dqllago-ol19457.snowflakecomputing.com\"\n", - " user: \"feathrintegration\"\n", - " role: \"ACCOUNTADMIN\"\n", + " snowflake_enabled: false\n", + " url: \".snowflakecomputing.com\"\n", + " user: \"\"\n", + " role: \"\"\n", "spark_config:\n", " spark_cluster: 'azure_synapse'\n", " spark_result_output_parts: '1'\n", " azure_synapse:\n", - " dev_url: 'https://feathrazuretest3synapse.dev.azuresynapse.net'\n", + " dev_url: 'https://.dev.azuresynapse.net'\n", " pool_name: 'spark3'\n", - " workspace_dir: 'abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/feathr_getting_started'\n", + " workspace_dir: 'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_getting_started'\n", " executor_size: 'Small'\n", " executor_num: 1\n", " databricks:\n", - " workspace_instance_url: 'https://adb-2474129336842816.16.azuredatabricks.net'\n", + " workspace_instance_url: 'https://.azuredatabricks.net'\n", " config_template: {'run_name':'','new_cluster':{'spark_version':'9.1.x-scala2.12','node_type_id':'Standard_D3_v2','num_workers':2,'spark_conf':{}},'libraries':[{'jar':''}],'spark_jar_task':{'main_class_name':'','parameters':['']}}\n", " work_dir: 'dbfs:/feathr_getting_started'\n", "online_store:\n", " redis:\n", - " host: 'feathrazuretest3redis.redis.cache.windows.net'\n", + " host: '.redis.cache.windows.net'\n", " port: 6380\n", " ssl_enabled: True\n", "feature_registry:\n", - " api_endpoint: \"https://feathr-sql-registry.azurewebsites.net/api/v1\"\n", + " api_endpoint: \"https://.azurewebsites.net/api/v1\"\n", "\"\"\"\n", "tmp = tempfile.NamedTemporaryFile(mode='w', delete=False)\n", "with open(tmp.name, \"w\") as text_file:\n", @@ -422,18 +375,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"91548af7-5d87-4743-9db4-8fac7ba67804", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "91548af7-5d87-4743-9db4-8fac7ba67804", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Setup necessary environment variables (Skip if using the above Quick Start Template)\n", "\n", "You should setup the environment variables in order to run this sample. More environment variables can be set by referring to [feathr_config.yaml](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml) and use that as the source of truth. It also has more explanations on the meaning of each variable.\n", @@ -443,75 +394,63 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"794492ed-66b0-4787-adc6-3f234c4739a9", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "794492ed-66b0-4787-adc6-3f234c4739a9", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "# Initialize Feathr Client" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"0c748f9d-210b-4c1d-a414-b30328d5e219", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "0c748f9d-210b-4c1d-a414-b30328d5e219", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client = FeathrClient(config_path=tmp.name)" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"46b45998-d933-4417-b152-7db091c0d5bd", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "46b45998-d933-4417-b152-7db091c0d5bd", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Explore the raw source data\n", "We have 4 datasets to work with: one observation dataset(a.k.a. label dataset), two raw datasets to generate features for users, one raw datasets to generate features for product." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"591b1801-5783-4d88-b7b7-ff3bbcfa0a9e", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "591b1801-5783-4d88-b7b7-ff3bbcfa0a9e", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# Observation dataset(a.k.a. label dataset)\n", "# Observation dataset usually comes with a event_timestamp to denote when the observation happened.\n", "# The label here is product_rating. Our model objective is to predict a user's rating for this product.\n", @@ -520,22 +459,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"11b8a74f-c0e1-4556-9a97-f17f8a90a795", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "11b8a74f-c0e1-4556-9a97-f17f8a90a795", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# User profile dataset\n", "# Used to generate user features\n", "import pandas as pd\n", @@ -543,22 +478,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"12f237da-a7fb-48c2-985e-a8cdfa3bb3fc", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "12f237da-a7fb-48c2-985e-a8cdfa3bb3fc", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# User purchase history dataset.\n", "# Used to generate user features. This is activity type data, so we need to use aggregation to genearte features.\n", "import pandas as pd\n", @@ -566,22 +497,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"333ef001-50c8-4556-b484-78715b657dbb", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "333ef001-50c8-4556-b484-78715b657dbb", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# Product detail dataset.\n", "# Used to generate product features.\n", "import pandas as pd\n", @@ -589,18 +516,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"bdc5a2e1-ccd4-4d61-9168-b0e4f571587b", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "bdc5a2e1-ccd4-4d61-9168-b0e4f571587b", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Defining Features with Feathr\n", "Let's try to create features from those raw source data.\n", "In Feathr, a feature is viewed as a function, mapping from entity id or key, and timestamp to a feature value. For more details on feature definition, please refer to the [Feathr Feature Definition Guide](https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-definition.md)\n", @@ -613,18 +538,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"30e2c57d-6487-4d72-bd78-80d17325f1a9", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "30e2c57d-6487-4d72-bd78-80d17325f1a9", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "Note: in some cases, such as features defined on top of request data, may have no entity key or timestamp.\n", "It is merely a function/transformation executing against request data at runtime.\n", "For example, the day of week of the request, which is calculated by converting the request UNIX timestamp.\n", @@ -632,18 +555,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"64fc4ef8-ccde-4724-8eff-1263c08de39f", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "64fc4ef8-ccde-4724-8eff-1263c08de39f", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "### Define Sources Section with UDFs\n", "\n", "#### Define Anchors and Features\n", @@ -654,22 +575,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"c32249b5-599b-4337-bebf-c33693354685", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "c32249b5-599b-4337-bebf-c33693354685", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "from pyspark.sql import SparkSession, DataFrame\n", "def feathr_udf_preprocessing(df: DataFrame) -> DataFrame:\n", " from pyspark.sql.functions import col\n", @@ -683,22 +600,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"2961afe9-4bdc-48ba-a63f-229081f557a3", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "2961afe9-4bdc-48ba-a63f-229081f557a3", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# Let's define some features for users so our recommendation can be customized for users.\n", "user_id = TypedKey(key_column=\"user_id\",\n", " key_column_type=ValueType.INT32,\n", @@ -734,22 +647,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"4da453e8-a8fd-40b8-a1e6-2a0e7cac3f6e", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "4da453e8-a8fd-40b8-a1e6-2a0e7cac3f6e", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# Let's define some features for the products so our recommendation can be customized for proudcts.\n", "product_batch_source = HdfsSource(name=\"productProfileData\",\n", " path=\"wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/product_recommendation_sample/product_detail_mock_data.csv\")\n", @@ -779,18 +688,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"78e240b4-dcab-499f-b6ed-72a14bfab968", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "78e240b4-dcab-499f-b6ed-72a14bfab968", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "### Window aggregation features\n", "\n", "For window aggregation features, see the supported fields below:\n", @@ -811,22 +718,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"b62a9041-73dc-45e1-add5-8fe01ebf355f", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b62a9041-73dc-45e1-add5-8fe01ebf355f", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "purchase_history_data = HdfsSource(name=\"purchase_history_data\",\n", " path=\"wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/product_recommendation_sample/user_purchase_history_mock_data.csv\",\n", " event_timestamp_column=\"purchase_date\",\n", @@ -846,39 +749,33 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"a04373b5-8ab9-4c36-892f-6aa8129df999", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "a04373b5-8ab9-4c36-892f-6aa8129df999", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "### Derived Features Section\n", "Derived features are the features that are computed from other features. They could be computed from anchored features, or other derived features." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"688a4562-d8e9-468a-a900-77e750a3c903", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "688a4562-d8e9-468a-a900-77e750a3c903", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "feature_user_purchasing_power = DerivedFeature(name=\"feature_user_purchasing_power\",\n", " key=user_id,\n", " feature_type=FLOAT,\n", @@ -888,55 +785,47 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"f4d8f829-bfbc-4d6f-bc32-3a419a32e3d3", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "f4d8f829-bfbc-4d6f-bc32-3a419a32e3d3", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "And then we need to build those features so that it can be consumed later. Note that we have to build both the \"anchor\" and the \"derived\" features (which is not anchored to a source)." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"4c617bb8-2605-4d40-acc9-2156c86dfc56", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "4c617bb8-2605-4d40-acc9-2156c86dfc56", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.build_features(anchor_list=[user_agg_feature_anchor, user_feature_anchor, product_anchor], derived_feature_list=[\n", " feature_user_purchasing_power])" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"6b2877d0-2ab8-4c07-99d4-effc7336ee8a", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "6b2877d0-2ab8-4c07-99d4-effc7336ee8a", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Create training data using point-in-time correct feature join\n", "\n", "A training dataset usually contains entity id columns, multiple feature columns, event timestamp column and label/target column. \n", @@ -948,22 +837,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"30302a53-561f-4b85-ba25-8de9fc843c63", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "30302a53-561f-4b85-ba25-8de9fc843c63", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "if client.spark_runtime == 'databricks':\n", " output_path = 'dbfs:/feathrazure_test.avro'\n", "else:\n", @@ -999,40 +884,34 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"cc7b6276-70c1-494f-83ca-53d442e3198a", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "cc7b6276-70c1-494f-83ca-53d442e3198a", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Download the training dataset and show the result\n", "\n", "Let's use the helper function `get_result_df` to download the result and view it:" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"120c9a21-1e1d-4ef5-8fe9-00d35a93cbf1", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "120c9a21-1e1d-4ef5-8fe9-00d35a93cbf1", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "def get_result_df(client: FeathrClient) -> pd.DataFrame:\n", " \"\"\"Download the job result dataset from cloud as a Pandas dataframe.\"\"\"\n", " res_url = client.get_job_result_uri(block=True, timeout_sec=600)\n", @@ -1052,39 +931,33 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"497d6a3b-94e2-4087-94b1-0a5d7baf3ab3", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "497d6a3b-94e2-4087-94b1-0a5d7baf3ab3", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Train a machine learning model\n", "After getting all the features, let's train a machine learning model with the converted feature by Feathr:" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"9bd661ae-430e-449b-9a62-9155828de099", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "9bd661ae-430e-449b-9a62-9155828de099", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "from sklearn.ensemble import GradientBoostingRegressor\n", "final_df = df_res\n", "\n", @@ -1123,18 +996,16 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"fda62a21-e7d6-4044-879f-bc05f77d248e", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "fda62a21-e7d6-4044-879f-bc05f77d248e", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Materialize feature value into offline/online storage\n", "\n", "While Feathr can compute the feature value from the feature definition on-the-fly at request time, it can also pre-compute\n", @@ -1144,22 +1015,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"3375f18d-cb64-4f13-8789-07b9d9c5835e", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "3375f18d-cb64-4f13-8789-07b9d9c5835e", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# Materialize user features\n", "# (You can only materialize features of same entity key into one table so we can only materialize user features first.)\n", "backfill_time = BackfillTime(start=datetime(\n", @@ -1175,34 +1042,30 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"7fb61ed8-6db4-461c-bd86-a5ff268a7c3d", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "7fb61ed8-6db4-461c-bd86-a5ff268a7c3d", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "We can then get the features from the online store (Redis):" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"ed5da7df-8095-403e-91a6-c5d2104eaf68", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "ed5da7df-8095-403e-91a6-c5d2104eaf68", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Fetching feature value for online inference\n", "\n", "For features that are already materialized by the previous step, their latest value can be queried via the client's\n", @@ -1210,82 +1073,68 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"9d8f3710-d2d4-463a-b452-99bd56bb3482", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "9d8f3710-d2d4-463a-b452-99bd56bb3482", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.get_online_features('user_features', '2', [\n", " 'feature_user_age', 'feature_user_gift_card_balance'])" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"e8aa6e5f-5b2d-4778-bafa-5a3a45fdd3b5", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "e8aa6e5f-5b2d-4778-bafa-5a3a45fdd3b5", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.multi_get_online_features('user_features', ['1', '2'], [\n", " 'feature_user_age', 'feature_user_gift_card_balance'])\n" ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"b19b73c6-7b0e-4b22-8eb1-8afdc328df74", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "b19b73c6-7b0e-4b22-8eb1-8afdc328df74", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "## Materialize product features\n", "\n", "We can also materialize product features into a separate table." ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"7a28cc6f-06f7-4915-9f3e-0a057467b77b", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "7a28cc6f-06f7-4915-9f3e-0a057467b77b", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "# Materialize product features\n", "backfill_time = BackfillTime(start=datetime(\n", " 2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1))\n", @@ -1300,22 +1149,18 @@ ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"8732aad1-7b22-4efc-8e2c-722030ae8bfb", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "8732aad1-7b22-4efc-8e2c-722030ae8bfb", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.get_online_features('product_feature_setting', '2', [\n", " 'feature_product_price'])\n", "\n", @@ -1324,83 +1169,73 @@ ] }, { - "cell_type":"markdown", - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"acd29f4d-715b-4889-954d-b648ea8e2a0f", - "showTitle":false, - "title":"" + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "acd29f4d-715b-4889-954d-b648ea8e2a0f", + "showTitle": false, + "title": "" } }, - "source":[ + "source": [ "### Registering and Fetching features\n", "\n", "We can also register the features with an Apache Atlas compatible service, such as Azure Purview, and share the registered features across teams:" ] }, { - "cell_type":"code", - "execution_count":null, - "metadata":{ - "application/vnd.databricks.v1+cell":{ - "inputWidgets":{ - - }, - "nuid":"1255ed12-5030-43b6-b733-5a467874b708", - "showTitle":false, - "title":"" + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "inputWidgets": {}, + "nuid": "1255ed12-5030-43b6-b733-5a467874b708", + "showTitle": false, + "title": "" } }, - "outputs":[ - - ], - "source":[ + "outputs": [], + "source": [ "client.register_features()\n", "client.list_registered_features(project_name=\"feathr_getting_started\")" ] } ], - "metadata":{ - "application/vnd.databricks.v1+notebook":{ - "dashboards":[ - - ], - "language":"python", - "notebookMetadata":{ - "pythonIndentUnit":4 + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [], + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 4 }, - "notebookName":"product_recommendation_demo_advanced", - "notebookOrigID":411375353096492, - "widgets":{ - - } + "notebookName": "product_recommendation_demo_advanced", + "notebookOrigID": 411375353096492, + "widgets": {} }, - "kernelspec":{ - "display_name":"Python 3.9.5 ('base')", - "language":"python", - "name":"python3" + "kernelspec": { + "display_name": "Python 3.9.5 ('base')", + "language": "python", + "name": "python3" }, - "language_info":{ - "codemirror_mode":{ - "name":"ipython", - "version":3 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 }, - "file_extension":".py", - "mimetype":"text/x-python", - "name":"python", - "nbconvert_exporter":"python", - "pygments_lexer":"ipython3", - "version":"3.9.5" + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.5" }, - "vscode":{ - "interpreter":{ - "hash":"3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf" + "vscode": { + "interpreter": { + "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf" } } }, - "nbformat":4, - "nbformat_minor":0 + "nbformat": 4, + "nbformat_minor": 0 }