Update docs to allow Feathr use Spark UDFs (feathr-ai#585)

* Update client.py * Spark UDF doc * update docs * Create feathr-rbac-role-initialization.png * resolve comments, update docs for RBAC
ahlag · Aug 26, 2022 · acc1a4d · acc1a4d
1 parent 90f328d
commit acc1a4d
Show file tree

Hide file tree

Showing 10 changed files with 240 additions and 39 deletions.
diff --git a/docs/concepts/feathr-udfs.md b/docs/concepts/feathr-udfs.md
@@ -163,6 +163,8 @@ Other than the UDF in the input level, Feathr also allows some level of customiz
 
 For those row-level transformations, [Spark SQL built-in functions](https://spark.apache.org/docs/latest/api/sql/index.html) are supported. For example you can call [`dayofmonth`](https://spark.apache.org/docs/latest/api/sql/index.html#dayofmonth), [`xpath_double`](https://spark.apache.org/docs/latest/api/sql/index.html#xpath_double), or [`percent_rank`](https://spark.apache.org/docs/latest/api/sql/index.html#percent_rank) etc. in the Spark SQL built-in functions in the `transformation` parameter for Feathr.
 
+If you want to customize the capability here, i.e. not using the Spark built-in functions but want to use some customized functions, please refer to [Developing Customized Feathr Spark UDF document](../how-to-guides/feathr-spark-udf-advanced.md).
+
 Some examples are shown below. Note that usually they are row level transformations, and if you want to do some aggregations across rows, please check out [Window aggregation features](../concepts/feature-definition.md#window-aggregation-features).
 
 ```python

diff --git a/docs/concepts/registry-access-control.md b/docs/concepts/registry-access-control.md
@@ -10,15 +10,15 @@ A project level role based access control (RBAC) plugin is available to help you
 
 It provides a simple authorization system built on OAuth tokens along with a SQL database as backend storage for `userrole` records.
 
-
 ## How Registry Access Control works?
 
 With Registry Access Control, you can
+
 - Share your project metadata(read-only) by assigning `consumer` role.
 - Invite contributors to update features in this project by assigning `producer` role.
 - Transfer project admin by assigning `admin` role.
 
-For `admin`, `producer`, `consumer` explanation, please refer to [Role](#role) section. 
+For `admin`, `producer`, `consumer` explanation, please refer to [Role](#role) section.
 
 ### Scope
 
@@ -38,45 +38,49 @@ Feature level access control is **NOT** supported yet. Users are encouraged to g
 ### Role
 
 A _role_ is a collection of permissions. We have 3 built-in roles with different permissions:
-| Role     | Description                | Permissions         |
+| Role | Description | Permissions |
 | -------- | -------------------------- | ------------------- |
-| Admin    | The owner of project       | Read, Write, Manage |
-| Producer | The contributor of project | Read, Write         |
-| Consumer | The reader of project      | Read                |
+| Admin | The owner of project | Read, Write, Manage |
+| Producer | The contributor of project | Read, Write |
+| Consumer | The reader of project | Read |
 
 ### Permission
+
 _permission_ refers to the a certain kind of access to registry metadata or role assignment records.
-| Permission | Description                                               |
+| Permission | Description |
 | ---------- | --------------------------------------------------------- |
-| Read       | Read registry meta data; `GET` Registry APIs              |
-| Write      | Write registry meta data; `POST` Registry APIs            |
-| Manage     | Create and manage role assignment records with management APIs |
+| Read | Read registry meta data; `GET` Registry APIs |
+| Write | Write registry meta data; `POST` Registry APIs |
+| Manage | Create and manage role assignment records with management APIs |
 
 ### User
-A _user_ can be an email account or an [Azure AppId](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app). 
+
+A _user_ can be an email account or an [Azure AppId](https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app).
 
 All Registry API requests requires `token` in header to identify the requestor and validate the permission.
+
 - Feathr UI uses the id token of login account. User credentials will be auto generated with [@azure/msal-browser](https://www.npmjs.com/package/@azure/msal-browser)
-- Feathr Client let users to pass their own credentials. 
+- Feathr Client let users to pass their own credentials.
 - In particular, Feathr samples get token with [DefaultAzureCredential()](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python).
 
 Please make sure your token is valid when getting `500` or `401` Exceptions.
 
-_AAD Group_ is **NOT** supported yet. 
+_AAD Group_ is **NOT** supported yet.
 
 ### Role Assignment
 
 A _Role Assignment_ is the process of add a `user-role` mapping record into backend storage table.
 
 [Feature Registry](https://linkedin.github.io/feathr/concepts/feature-registry.html#access-control-management-page) section briefly introduced the access control management page, where project admins can manage role assignments.
-Management APIs are not exposed in Feathr Client by design. As we don't want to put control plane together with data plane. 
-
+Management APIs are not exposed in Feathr Client by design. As we don't want to put control plane together with data plane.
 
 ## How to enable Registry Access Control?
+
 [Azure Resource Provisioning](https://linkedin.github.io/feathr/how-to-guides/azure-deployment-arm.html) section has detailed instructions on resource provisioning. For RBAC specific, you will need to manually:
+
 1. Choose `Yes` for `Enable RBAC` in ARM Template, and provision the resources.
 2. Create a `userrole` table in provisioned SQL database with [RBAC Schema](../../registry/access_control/scripts/schema.sql).
 3. Initialize the `userrole` table refer to commands in [test data](../../registry/access_control/scripts/test_data.sql).
-4. Login to the Web UI and navigate to the management page, and the roles you initialized in #3 should be in table. 
+4. Login to the Web UI and navigate to the management page, and the roles you initialized in #3 should be in table.
 
-For more details, please refer to the [Feathr Registry Access Control Gateway Specifications](../../../feathr/registry/access_control/README.md). 
+For more details, please refer to the [Feathr Registry Access Control Gateway Specifications](../../../feathr/registry/access_control/README.md).
diff --git a/docs/how-to-guides/azure-deployment-arm.md b/docs/how-to-guides/azure-deployment-arm.md
@@ -35,10 +35,10 @@ The very first step is to create an Azure Active Directory (AAD) application to
 ```bash
 # This is the prefix you want to name your resources with, make a note of it, you will need it during deployment.
 #  Note: please keep the `resourcePrefix` short (less than 15 chars), since some of the Azure resources need the full name to be less than 24 characters. Only lowercase alphanumeric characters are allowed for resource prefix.
-prefix="userprefix1"
+resource_prefix="userprefix1"
 
 # Please don't change this name, a corresponding webapp with same name gets created in subsequent steps.
-sitename="${prefix}webapp"
+sitename="${resource_prefix}webapp"
 
 # Use the following configuration command to enable dynamic install of az extensions without a prompt. This is required for the az account command group used in the following steps.
 az config set extension.use_dynamic_install=yes_without_prompt
@@ -77,7 +77,7 @@ Click the button below to deploy a minimal set of Feathr resources. This is not
 
 ### 3. Grant Key Vault and Synapse access to selected users (Optional)
 
-You will need to assign the right permission to users in order for them to access Azure key vault, permission to access the Storage Blob as a Contributor, and permission to submit jobs to Synapse cluster.
+You will need to assign the right permission to users in order for them to access Azure key vault, permission to access the Storage Blob as a Contributor, and permission to submit jobs to Synapse cluster. This is useful if you want to allow multiple users access the same environment.
 
 Skip this step if you have already given yourself the access. Otherwise, run the following lines of command in the [Cloud Shell](https://shell.azure.com/bash).
 
@@ -93,24 +93,39 @@ az role assignment create --assignee $userId --role "Storage Blob Data Contribut
 az synapse role assignment create --workspace-name $synapse_workspace_name --role "Synapse Contributor" --assignee $userId
 ```
 
-### 4. Assign the right permission for Azure Purview
+### 4. Assign the right permission for Azure Purview (Optional)
 
 If you are using Purview registry there is an additional step required for the deployment to work. Registry Server authenticates with Azure Purview using Managed Identity that was created by ARM template. The Managed Identity needs to be added to Azure Purview Collections as a **Data Curator**. For more details, please refer to [Access control in the Microsoft Purview governance portal](https://docs.microsoft.com/en-us/azure/purview/catalog-permissions).
 
 ![purview data curator role add](../images/purview_permission_setting.png)
 
-Only collection admins can perform the above operation, the user who created this Purview account is already one. If you want to add additional admins, you can do so by clicking on _Root collection permission_ option on Azure Purview page. The name is usually called `{prefix}identity`.
+Only collection admins can perform the above operation, the user who created this Purview account is already one. If you want to add additional admins, you can do so by clicking on _Root collection permission_ option on Azure Purview page. The name is usually called `{resource_prefix}identity`.
 
 Congratulations, you have successfully deployed Feathr on Azure. You can access your resources by going to the resource group that you created for the deployment. A good first test would be to access Feathr UI, you can access it by clicking on App Service URL. The URL would have the following format:
 
 ```bash
-https://{prefix}webapp.azurewebsites.net
+https://{resource_prefix}webapp.azurewebsites.net
 ```
 
 ![app service url](../images/app-service-url.png)
 
 ![feathr ui landing page](../images/feathr-ui-landingpage.png)
 
+
+### 5. Initialize RBAC access table (Optional)
+
+If you want to use RBAC access for your deployment, you also need to manually initialize the user access table. Replace `[your-email-account]` with the email account that you are currently using, and this email will be the global admin for Feathr feature registry.
+
+You need to execute the command below in the database that you have created (see screenshot below). The database is usually something like `{resource_prefix}db`.
+
+```SQL
+insert into userroles (project_name, user_name, role_name, create_by, create_reason, create_time) values ('global', '[your-email-account]','admin', '[your-email-account]', 'Initialize First Global Admin',  getutcdate())
+```
+
+![Feathr RBAC initialization](../images/feathr-rbac-role-initialization.png)
+
+For more details on RBAC, refer to [Feathr Registry Access Control](../how-to-guides/../concepts/registry-access-control.md) for more details.
+
 ## Next Steps
 
 Follow the quick start guide [here](https://linkedin.github.io/feathr/quickstart_synapse.html) to try out a notebook example.

diff --git a/docs/how-to-guides/feathr-spark-udf-advanced.md b/docs/how-to-guides/feathr-spark-udf-advanced.md
@@ -0,0 +1,171 @@
+---
+layout: default
+title: Developing Customized Feathr Spark UDF
+parent: How-to Guides
+---
+
+# Developing Customized Feathr Spark UDF
+
+Feathr provides flexible ways for end users to define featurization logics. One of the advanced use case would be to use complex, customized logic to do transformations in Feathr. This document describes the required steps to do that.
+
+Although Feathr uses Spark as the execution engine, this is transparent to end users; However, for advanced use cases such as using Spark UDF, users have to have basic knowledge on Spark.
+
+The thinking here is to have users define arbitrary functions using Spark UDF framework, register the UDF as permanent functions in Spark, and have Feathr calling the function.
+
+Most of the content in this document are out of Feathr's scope, but we just document the steps here to make it easier for end users to develop Spark UDFs and understand a bit more.
+
+## Difference between Spark UDF scopes
+
+Before we get started, there is an important concept for Spark UDF. There are two types of UDFs - UDF available in session scope, and UDFs that are shared across different sessions (permanent functions).
+
+For example, in the [Scalar User Defined Functions (UDFs)](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html) in Spark, there is an example like this:
+
+```scala
+val spark = SparkSession
+  .builder()
+  .appName("Spark SQL UDF scalar example")
+  .getOrCreate()
+
+spark.udf.register("oneArgFilter", (n: Int) => { n > 5 })
+spark.range(1, 10).createOrReplaceTempView("test")
+spark.sql("SELECT * FROM test WHERE oneArgFilter(id)").show()
+```
+
+The way `spark.udf.register("oneArgFilter", (n: Int) => { n > 5 })` is called indicates that this UDF is a session scoped UDF, and cannot be shared across different sessions.
+
+Instead, if we want to share the UDFs across different sessions, we should call the CREATE FUNCTION statement, which is used to create a temporary or permanent function in Spark, like below. Refer to the [Spark documentation](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-function.html) to learn more.
+
+```SQL
+CREATE OR REPLACE FUNCTION simple_feathr_udf_add20_string AS 'org.example.SimpleFeathrUDFString' USING JAR 'dbfs:/FileStore/jars/SimpleFeathrUDF.jar';
+```
+
+Basically, temporary functions are scoped at a session level where as permanent functions are created in the persistent catalog and are made available to all sessions. The resources specified in the USING clause are made available to all executors when they are executed for the first time.
+
+## Step 1: Creating a JAR package for the UDF
+
+According to the [Spark documentation](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-function.html), end users need to implement a class extending one of the base classes like below:
+
+- Should extend UDF or UDAF in org.apache.hadoop.hive.ql.exec package.
+- Should extend AbstractGenericUDAFResolver, GenericUDF, or GenericUDTF in org.apache.hadoop.hive.ql.udf.generic package.
+- Should extend UserDefinedAggregateFunction in org.apache.spark.sql.expressions package.
+
+Currently, only row level transformation is supported in Feathr. I.e. you should always extend `org.apache.hadoop.hive.ql.exec`.
+
+## Write a simple UDF
+
+For example we can develop a Java class like below, creating a new class called `SimpleFeathrUDFString` which takes a string as input, parse it, and return a number plus 20.
+
+```java
+package org.example;
+
+import org.apache.hadoop.hive.ql.exec.UDF;
+
+public class SimpleFeathrUDFString extends UDF {
+   public int evaluate(String value) {
+        int number = Integer.parseInt(value);
+        return number + 20;
+    }
+}
+```
+
+The corresponding `pom.xml` will be like below. Remember to have `org.apache.hive:hive-exec` and `org.apache.hadoop:hadoop-common` as dependencies so that the Jar can be compiled successfully:
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+
+    <groupId>org.example</groupId>
+    <artifactId>SimpleFeathrUDF</artifactId>
+    <version>1.0-SNAPSHOT</version>
+    <dependencies>
+
+        <dependency>
+            <groupId>org.apache.hive</groupId>
+            <artifactId>hive-exec</artifactId>
+            <version>3.1.3</version>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-common</artifactId>
+            <version>3.3.1</version>
+        </dependency>
+
+    </dependencies>
+
+    <properties>
+        <maven.compiler.source>8</maven.compiler.source>
+        <maven.compiler.target>8</maven.compiler.target>
+    </properties>
+
+</project>
+```
+
+Compile the result to a Jar file. For example, if you are using IDEs such as IntelliJ, you can create an artifact like this:
+
+![Feathr Spark UDF](../images/feathr-spark-udf-artifact.png)
+
+After that, you should have a jar file containing your UDF code.
+
+## Step 2: Upload the JAR to the Spark cluster and register the UDF
+
+The second step is to upload the jar that we just compiled to a shared location, and have it registered in the Spark environment.
+
+For databricks, you can upload it to the DBFS that your current spark cluster is using, like below:
+
+![Feathr Spark Upload](../images/feathr-spark-udf-upload.png)
+
+For Synapse, you can upload the jar to the default storage account that the spark cluster is associated with.
+
+After uploading the JAR, you should register the UDF as a permanent UDF like this. Usually it is a one-time task, so you can use the built-in notebooks in Spark cluster to do this.
+
+For example, in databricks, you can login to the notebook and execute the command below:
+![Feathr Spark UDF](../images/feathr-spark-udf-test.png)
+
+```SQL
+CREATE OR REPLACE FUNCTION simple_feathr_udf_add20_string AS 'org.example.SimpleFeathrUDFString' USING JAR 'dbfs:/FileStore/jars/SimpleFeathrUDF.jar';
+```
+
+For more on the syntax, refer to the [spark docs](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-function.html).
+
+You can test if the UDF is registered in Spark or not by typing those in the notebook environment:
+
+```SQL
+CREATE TABLE IF NOT EXISTS feathr_test_table(c1 INT);
+INSERT INTO feathr_test_table VALUES (1), (2);
+SELECT simple_feathr_udf_add20_string(c1) AS function_return_value FROM feathr_test_table;
+```
+
+## Step 3: Using the UDF in Feathr
+
+The only caveat here is to disable Feathr from doing more optimizations. Since Feathr is designed to optimize for large scale workloads, and UDFs are black boxes to optimizer, so we need to disable some of the Feathr optimizations, such as bloom filters, to allow UDFs to run.
+
+This is straightforward to do. In Feathr, when calling `get_offline_features` or `materialize_features` APIs, you need to specify `execution_configurations={"spark.feathr.row.bloomfilter.maxThreshold":"0"}` so that Feathr don't optimize for the UDFs, like below:
+
+```python
+client.get_offline_features(observation_settings=settings,
+                            feature_query=feature_query,
+                            output_path=output_path,
+                            execution_configurations={"spark.feathr.row.bloomfilter.maxThreshold":"0"}
+                           )
+
+```
+
+```python
+client.materialize_features(settings, execution_configurations={"spark.feathr.row.bloomfilter.maxThreshold":"0"})
+```
+
+That's it! Using the UDF just as regular Spark functions. For example, we might want to define a feature like this:
+
+```python
+Feature(name="f_udf_transform",
+        feature_type=INT32,
+        transform="simple_feathr_udf_add20_string(PULocationID)")
+```
+
+And you will see the result like this, where we have called `f_udf_transform` and it transforms the `PULocationID` column and adding 20 to the value there.
+![Feathr Spark UDF](../images/feathr-spark-udf-result.png)
+
+For more details on how to call those UDFs in Feathr, please refer to the [Feathr User Defined Functions (UDFs) document](../concepts/feathr-udfs.md).
diff --git a/docs/images/feathr-rbac-role-initialization.png b/docs/images/feathr-rbac-role-initialization.png
diff --git a/docs/images/feathr-spark-udf-artifact.png b/docs/images/feathr-spark-udf-artifact.png
diff --git a/docs/images/feathr-spark-udf-result.png b/docs/images/feathr-spark-udf-result.png
diff --git a/docs/images/feathr-spark-udf-test.png b/docs/images/feathr-spark-udf-test.png
diff --git a/docs/images/feathr-spark-udf-upload.png b/docs/images/feathr-spark-udf-upload.png