-
Notifications
You must be signed in to change notification settings - Fork 44
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Steffen Grohsschmiedt
committed
Jan 21, 2021
1 parent
4867510
commit da00316
Showing
20 changed files
with
429 additions
and
3 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
# Configure HDInsight for the Hopsworks Feature Store | ||
To enable HDInsight to access the Hopsworks Feature Store, you need to set up a Hopsworks API key, add a script action and configurations to your HDInsight cluster. | ||
|
||
!!! info "Prerequisites" | ||
A HDInsight cluster with cluster type Spark is required to connect to the Feature Store. You can either use an existing cluster or create a new one. | ||
|
||
!!! info "Network Connectivity" | ||
|
||
To be able to connect to the Feature Store, please ensure that your HDInsight cluster and the Hopsworks Feature Store are either in the same [Virtual Network](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview) or [Virtual Network Peering](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-manage-peering) is set up between the different networks. In addition, ensure that the Network Security Group of your Hopsworks instance is configured to allow incoming traffic from your HDInsight cluster on ports 443, 3306, 8020, 30010, 9083 and 9085. See [Network security groups](https://docs.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview) for more information. | ||
|
||
## Step 1: Set up a Hopsworks API key | ||
In order for HDInsight clusters to be able to communicate with the Hopsworks Feature Store, the clients running on HDInsight need to be able to access a Hopsworks API key. | ||
|
||
In Hopsworks, click on your *username* in the top-right corner and select *Settings* to open the user settings. Select *API keys*. Give the key a name and select the project scope before creating the key. Make sure you have the key handy for the next steps. | ||
|
||
!!! success "Scopes" | ||
The API key should contain at least the following scopes: | ||
|
||
1. featurestore | ||
2. project | ||
3. job | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../../assets/images/azure/hdinsight/step-0.png" alt="Generating an API key on Hopsworks"> | ||
<figcaption>API keys can be created in the User Settings on Hopsworks</figcaption> | ||
</figure> | ||
</p> | ||
|
||
!!! info | ||
You are only able to retrieve the API key once. If you forget to copy it to your clipboard, delete it and create a new one. | ||
|
||
## Step 2: Use a script action to install the Feature Store connector | ||
|
||
HDInsight requires Hopsworks connectors to be able to communicate with the Hopsworks Feature Store. These connectors can be installed with the script action shown below. Copy the content into a file, name the file `hopsworks.sh` and replace MY_INSTANCE, MY_PROJECT, MY_VERSION, MY_API_KEY and MY_CONDA_ENV with your values. Copy the `hopsworks.sh` file into any storage that is readable by your HDInsight clusters and take note of the URI of that file e.g., `https://account.blob.core.windows.net/scripts/hopsworks.sh`. | ||
|
||
The script action needs to be applied head and worker nodes and can be applied during cluster creation or to an existing cluster. Ensure to persist the script action so that it is run on newly created nodes. For more information about how to use script actions, see [Customize Azure HDInsight clusters by using script actions](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux). | ||
|
||
!!! attention "Matching Hopsworks version" | ||
The **major version of `HSFS`** needs to match the **major version of Hopsworks**. Check [PyPI](https://pypi.org/project/hsfs/#history) for available releases. | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/hopsworks-version.png" alt="HSFS version needs to match the major version of Hopsworks"> | ||
<figcaption>You find the Hopsworks version inside any of your Project's settings tab on Hopsworks</figcaption> | ||
</figure> | ||
</p> | ||
|
||
Feature Store script action: | ||
```bash | ||
set -e | ||
|
||
HOST="MY_INSTANCE.cloud.hopsworks.ai" # DNS of your Feature Store instance | ||
PROJECT="MY_PROJECT" # Port to reach your Hopsworks instance, defaults to 443 | ||
HSFS_VERSION="MY_VERSION" # The major version of HSFS needs to match the major version of Hopsworks | ||
API_KEY="MY_API_KEY" # The API key to authenticate with Hopsworks | ||
CONDA_ENV="MY_CONDA_ENV" # py35 is the default for HDI 3.6 | ||
|
||
apt-get --assume-yes install python3-dev | ||
apt-get --assume-yes install jq | ||
|
||
/usr/bin/anaconda/envs/$CONDA_ENV/bin/pip install hsfs==$HSFS_VERSION | ||
|
||
PROJECT_ID=$(curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/getProjectInfo/$PROJECT | jq -r .projectId) | ||
|
||
mkdir -p /usr/lib/hopsworks | ||
chown root:hadoop /usr/lib/hopsworks | ||
cd /usr/lib/hopsworks | ||
|
||
curl -o client.tar.gz -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/client | ||
|
||
tar -xvf client.tar.gz | ||
tar -xzf client/apache-hive-*-bin.tar.gz | ||
mv apache-hive-*-bin apache-hive-bin | ||
rm client.tar.gz | ||
rm client/apache-hive-*-bin.tar.gz | ||
|
||
curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .kStore | base64 -d > keyStore.jks | ||
|
||
curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .tStore | base64 -d > trustStore.jks | ||
|
||
echo -n $(curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .password) > material_passwd | ||
|
||
chown -R root:hadoop /usr/lib/hopsworks | ||
``` | ||
|
||
## Step 3: Configure HDInsight for Feature Store access | ||
|
||
The Hadoop and Spark installations of the HDInsight cluster need to be configured in order to access the Feature Store. This can be achieved either by using a [bootstrap script](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-bootstrap) when creating clusters or using [Ambari](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-manage-ambari) on existing clusters. Apply the following configurations to your HDInsight cluster. | ||
|
||
!!! attention "Using Hive and the Feature Store" | ||
|
||
HDInsight clusters cannot use their local Hive when being configured for the Feature Store as the Feature Store relies on custom Hive binaries and its own Metastore which will overwrite the local one. If you rely on Hive for feature engineering then it is advised to write your data to an external data storage such as ADLS from your main HDInsight cluster and use a second HDInsight cluster to read that data and access the Feature Store. | ||
|
||
Hadoop hadoop-env.sh: | ||
``` | ||
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/hopsworks/client/* | ||
``` | ||
|
||
Hadoop core-site.xml: | ||
``` | ||
hops.ipc.server.ssl.enabled=true | ||
fs.hopsfs.impl=io.hops.hopsfs.client.HopsFileSystem | ||
client.rpc.ssl.enabled.protocol=TLSv1.2 | ||
hops.ssl.keystore.name=/usr/lib/hopsworks/keyStore.jks | ||
hops.rpc.socket.factory.class.default=io.hops.hadoop.shaded.org.apache.hadoop.net.HopsSSLSocketFactory | ||
hops.ssl.keystores.passwd.name=/usr/lib/hopsworks/material_passwd | ||
hops.ssl.hostname.verifier=ALLOW_ALL | ||
hops.ssl.trustore.name=/usr/lib/hopsworks/trustStore.jks | ||
``` | ||
|
||
Spark spark-defaults.conf: | ||
``` | ||
spark.executor.extraClassPath=/usr/lib/hopsworks/client/* | ||
spark.driver.extraClassPath=/usr/lib/hopsworks/client/* | ||
spark.sql.hive.metastore.jars=/usr/lib/hopsworks/apache-hive-bin/lib/* | ||
``` | ||
|
||
Spark hive-site.xml: | ||
``` | ||
hive.metastore.uris=thrift://MY_HOPSWORKS_INSTANCE_PRIVATE_IP:9083 | ||
``` | ||
|
||
!!! info | ||
Replace MY_HOPSWORKS_INSTANCE_PRIVATE_IP with the private IP address of you Hopsworks Feature Store. | ||
|
||
## Step 4: Configure the Feature Store for HDInsight | ||
|
||
In order for the Feature Store to work correctly with HDInsight, ensure that it is using Parquet instead of ORC for storing features. Open Hopsworks and navigate to the Admin panel: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../../assets/images/azure/hdinsight/variables-step-0.png" alt="Open the admin panel"> | ||
<figcaption>Open the admin panel</figcaption> | ||
</figure> | ||
</p> | ||
|
||
In the admin panel, select `Edit variables`: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../../assets/images/azure/hdinsight/variables-step-1.png" alt="Select Edit variables"> | ||
<figcaption>Select Edit variables</figcaption> | ||
</figure> | ||
</p> | ||
|
||
Search for the name `featurestore_default_storage_format` and ensure it is set to `PARQUET`. To set it to Parquet, select the edit icon to the right, update the value and accept the change. Press `Reload variables` to ensure that everything is updated correctly: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../../assets/images/azure/hdinsight/variables-step-2.png" alt="Ensure featurestore_default_storage_format is set to PARQUET"> | ||
<figcaption>Ensure featurestore_default_storage_format is set to PARQUET</figcaption> | ||
</figure> | ||
</p> | ||
|
||
## Step 5: Connect to the Feature Store | ||
|
||
You are now ready to connect to the Hopsworks Feature Store, for instance using a Jupyter notebook in HDInsight with a PySpark3 kernel: | ||
|
||
```python | ||
import hsfs | ||
|
||
# Put the API key into Key Vault for any production setup: | ||
# See, https://azure.microsoft.com/en-us/services/key-vault/ | ||
secret_value = 'MY_API_KEY' | ||
|
||
# Create a connection | ||
conn = hsfs.connection( | ||
host='MY_INSTANCE.cloud.hopsworks.ai', # DNS of your Feature Store instance | ||
port=443, # Port to reach your Hopsworks instance, defaults to 443 | ||
project='MY_PROJECT', # Name of your Hopsworks Feature Store project | ||
api_key_value=secret_value, # The API key to authenticate with Hopsworks | ||
hostname_verification=True # Disable for self-signed certificates | ||
) | ||
|
||
# Get the feature store handle for the project's feature store | ||
fs = conn.get_feature_store() | ||
``` | ||
|
||
## Next Steps | ||
|
||
For more information about how to use the Feature Store, see the [Quickstart Guide](../quickstart.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
# Azure Machine Learning Designer Integration | ||
|
||
Connecting to the Feature Store from the Azure Machine Learning Designer requires setting up a Feature Store API key for the Designer and installing the **HSFS** on the Designer. This guide explains step by step how to connect to the Feature Store from Azure Machine Learning Designer. | ||
|
||
!!! info "Network Connectivity" | ||
|
||
To be able to connect to the Feature Store, please ensure that the Network Security Group of your Hopsworks instance on Azure is configured to allow incoming traffic from your compute target on ports 443, 9083 and 9085. See [Network security groups](https://docs.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview) for more information. If your compute target is not in the same VNet as your Hopsworks instance and the Hopsworks instance is not accessible from the internet then you will need to configure [Virtual Network Peering](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-manage-peering). | ||
|
||
## Generate an API key | ||
|
||
In Hopsworks, click on your *username* in the top-right corner and select *Settings* to open the user settings. Select *API keys*. Give the key a name and select the job, featurestore and project scopes before creating the key. Copy the key into your clipboard for the next step. | ||
|
||
!!! success "Scopes" | ||
The API key should contain at least the following scopes: | ||
|
||
1. featurestore | ||
2. project | ||
3. job | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-0.png" alt="Generate an API key on Hopsworks"> | ||
<figcaption>API keys can be created in the User Settings on Hopsworks</figcaption> | ||
</figure> | ||
</p> | ||
|
||
!!! info | ||
You are only ably to retrieve the API key once. If you miss to copy it to your clipboard, delete it again and create a new one. | ||
|
||
## Connect to the Feature Store | ||
|
||
To connect to the Feature Store from the Azure Machine Learning Designer, create a new pipeline or open an existing one: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-1.png" alt="Add an Execute Python Script step"> | ||
<figcaption>Add an Execute Python Script step</figcaption> | ||
</figure> | ||
</p> | ||
|
||
In the pipeline, add a new `Execute Python Script` step and replace the Python script from the next step: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-2.png" alt="Add the code to access the Feature Store"> | ||
<figcaption>Add the code to access the Feature Store</figcaption> | ||
</figure> | ||
</p> | ||
|
||
!!! info "Updating the script" | ||
|
||
Replace MY_VERSION, MY_API_KEY, MY_INSTANCE, MY_PROJECT and MY_FEATURE_GROUP with the respective values. The major version set for MY_VERSION needs to match the major version of Hopsworks. Check [PyPI](https://pypi.org/project/hsfs/#history) for available releases. | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/hopsworks-version.png" alt="HSFS version needs to match the major version of Hopsworks"> | ||
<figcaption>You find the Hopsworks version inside any of your Project's settings tab on Hopsworks</figcaption> | ||
</figure> | ||
</p> | ||
|
||
```python | ||
import os | ||
import importlib.util | ||
|
||
|
||
package_name = 'hsfs' | ||
version = 'MY_VERSION' | ||
spec = importlib.util.find_spec(package_name) | ||
if spec is None: | ||
import os | ||
os.system(f"pip install %s[hive]==%s" % (package_name, version)) | ||
|
||
# Put the API key into Key Vault for any production setup: | ||
# See, https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-secrets-in-runs | ||
#from azureml.core import Experiment, Run | ||
#run = Run.get_context() | ||
#secret_value = run.get_secret(name="fs-api-key") | ||
secret_value = 'MY_API_KEY' | ||
|
||
def azureml_main(dataframe1 = None, dataframe2 = None): | ||
|
||
import hsfs | ||
conn = hsfs.connection( | ||
host='MY_INSTANCE.cloud.hopsworks.ai', # DNS of your Feature Store instance | ||
port=443, # Port to reach your Hopsworks instance, defaults to 443 | ||
project='MY_PROJECT', # Name of your Hopsworks Feature Store project | ||
api_key_value=secret_value, # The API key to authenticate with Hopsworks | ||
hostname_verification=True, # Disable for self-signed certificates | ||
engine='hive' # Choose Hive as engine | ||
) | ||
fs = conn.get_feature_store() # Get the project's default feature store | ||
|
||
return fs.get_feature_group('MY_FEATURE_GROUP', version=1).read(), | ||
``` | ||
|
||
Select a compute target and save the step. The step is now ready to use: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-3.png" alt="Select a compute target"> | ||
<figcaption>Select a compute target</figcaption> | ||
</figure> | ||
</p> | ||
|
||
As a next step, you have to connect the previously created `Execute Python Script` step with the next step in the pipeline. For instance, to export the features to a CSV file, create a `Export Data` step: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-4.png" alt="Add an Export Data step"> | ||
<figcaption>Add an Export Data step</figcaption> | ||
</figure> | ||
</p> | ||
|
||
Configure the `Export Data` step to write to you data store of choice: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-5.png" alt="Configure the Export Data step"> | ||
<figcaption>Configure the Export Data step</figcaption> | ||
</figure> | ||
</p> | ||
|
||
Connect the to steps by drawing a line between them: | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-6.png" alt="Connect the steps"> | ||
<figcaption>Connect the steps</figcaption> | ||
</figure> | ||
</p> | ||
|
||
Finally, submit the pipeline and wait for it to finish: | ||
|
||
!!! info "Performance on the first execution" | ||
|
||
The `Execute Python Script` step can be slow when being executed for the first time as the HSFS library needs to be installed on the compute target. Subsequent executions on the same compute target should use the already installed library. | ||
|
||
<p align="center"> | ||
<figure> | ||
<img src="../../assets/images/azure/designer/step-7.png" alt="Execute the pipeline"> | ||
<figcaption>Execute the pipeline</figcaption> | ||
</figure> | ||
</p> | ||
|
||
## Next Steps | ||
|
||
For more information about how to use the Feature Store, see the [Quickstart Guide](../quickstart.md). |
Oops, something went wrong.