Skip to content

Commit

Permalink
Add docs for MLStudio and HDInsight
Browse files Browse the repository at this point in the history
  • Loading branch information
Steffen Grohsschmiedt committed Jan 21, 2021
1 parent 4867510 commit da00316
Show file tree
Hide file tree
Showing 20 changed files with 429 additions and 3 deletions.
Binary file added docs/assets/images/azure/designer/step-0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/designer/step-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/designer/step-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/designer/step-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/designer/step-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/designer/step-5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/designer/step-6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/designer/step-7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/hdinsight/step-0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/notebooks/step-0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/azure/notebooks/step-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/integrations/emr/emr_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ Your EMR cluster will now be able to access your Hopsworks Feature Store.
## Next Steps

If you use Python, then install the [HSFS library](https://pypi.org/project/hsfs/). The Scala version of the library has already been installed to your EMR cluster.
Use the [Connection API](../../../generated/api/connection_api/) to connect to the Hopsworks Feature Store.
Use the [Connection API](../../../generated/api/connection_api/) to connect to the Hopsworks Feature Store. For more information about how to use the Feature Store, see the [Quickstart Guide](../../quickstart.md).

!!! attention "Matching Hopsworks version"
The **major version of `HSFS`** needs to match the **major version of Hopsworks**.
Expand Down
182 changes: 182 additions & 0 deletions docs/integrations/hdinsight.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Configure HDInsight for the Hopsworks Feature Store
To enable HDInsight to access the Hopsworks Feature Store, you need to set up a Hopsworks API key, add a script action and configurations to your HDInsight cluster.

!!! info "Prerequisites"
A HDInsight cluster with cluster type Spark is required to connect to the Feature Store. You can either use an existing cluster or create a new one.

!!! info "Network Connectivity"

To be able to connect to the Feature Store, please ensure that your HDInsight cluster and the Hopsworks Feature Store are either in the same [Virtual Network](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview) or [Virtual Network Peering](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-manage-peering) is set up between the different networks. In addition, ensure that the Network Security Group of your Hopsworks instance is configured to allow incoming traffic from your HDInsight cluster on ports 443, 3306, 8020, 30010, 9083 and 9085. See [Network security groups](https://docs.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview) for more information.

## Step 1: Set up a Hopsworks API key
In order for HDInsight clusters to be able to communicate with the Hopsworks Feature Store, the clients running on HDInsight need to be able to access a Hopsworks API key.

In Hopsworks, click on your *username* in the top-right corner and select *Settings* to open the user settings. Select *API keys*. Give the key a name and select the project scope before creating the key. Make sure you have the key handy for the next steps.

!!! success "Scopes"
The API key should contain at least the following scopes:

1. featurestore
2. project
3. job

<p align="center">
<figure>
<img src="../../../assets/images/azure/hdinsight/step-0.png" alt="Generating an API key on Hopsworks">
<figcaption>API keys can be created in the User Settings on Hopsworks</figcaption>
</figure>
</p>

!!! info
You are only able to retrieve the API key once. If you forget to copy it to your clipboard, delete it and create a new one.

## Step 2: Use a script action to install the Feature Store connector

HDInsight requires Hopsworks connectors to be able to communicate with the Hopsworks Feature Store. These connectors can be installed with the script action shown below. Copy the content into a file, name the file `hopsworks.sh` and replace MY_INSTANCE, MY_PROJECT, MY_VERSION, MY_API_KEY and MY_CONDA_ENV with your values. Copy the `hopsworks.sh` file into any storage that is readable by your HDInsight clusters and take note of the URI of that file e.g., `https://account.blob.core.windows.net/scripts/hopsworks.sh`.

The script action needs to be applied head and worker nodes and can be applied during cluster creation or to an existing cluster. Ensure to persist the script action so that it is run on newly created nodes. For more information about how to use script actions, see [Customize Azure HDInsight clusters by using script actions](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux).

!!! attention "Matching Hopsworks version"
The **major version of `HSFS`** needs to match the **major version of Hopsworks**. Check [PyPI](https://pypi.org/project/hsfs/#history) for available releases.

<p align="center">
<figure>
<img src="../../assets/images/hopsworks-version.png" alt="HSFS version needs to match the major version of Hopsworks">
<figcaption>You find the Hopsworks version inside any of your Project's settings tab on Hopsworks</figcaption>
</figure>
</p>

Feature Store script action:
```bash
set -e

HOST="MY_INSTANCE.cloud.hopsworks.ai" # DNS of your Feature Store instance
PROJECT="MY_PROJECT" # Port to reach your Hopsworks instance, defaults to 443
HSFS_VERSION="MY_VERSION" # The major version of HSFS needs to match the major version of Hopsworks
API_KEY="MY_API_KEY" # The API key to authenticate with Hopsworks
CONDA_ENV="MY_CONDA_ENV" # py35 is the default for HDI 3.6

apt-get --assume-yes install python3-dev
apt-get --assume-yes install jq

/usr/bin/anaconda/envs/$CONDA_ENV/bin/pip install hsfs==$HSFS_VERSION

PROJECT_ID=$(curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/getProjectInfo/$PROJECT | jq -r .projectId)

mkdir -p /usr/lib/hopsworks
chown root:hadoop /usr/lib/hopsworks
cd /usr/lib/hopsworks

curl -o client.tar.gz -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/client

tar -xvf client.tar.gz
tar -xzf client/apache-hive-*-bin.tar.gz
mv apache-hive-*-bin apache-hive-bin
rm client.tar.gz
rm client/apache-hive-*-bin.tar.gz

curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .kStore | base64 -d > keyStore.jks

curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .tStore | base64 -d > trustStore.jks

echo -n $(curl -H "Authorization: ApiKey ${API_KEY}" https://$HOST/hopsworks-api/api/project/$PROJECT_ID/credentials | jq -r .password) > material_passwd

chown -R root:hadoop /usr/lib/hopsworks
```

## Step 3: Configure HDInsight for Feature Store access

The Hadoop and Spark installations of the HDInsight cluster need to be configured in order to access the Feature Store. This can be achieved either by using a [bootstrap script](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-bootstrap) when creating clusters or using [Ambari](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-manage-ambari) on existing clusters. Apply the following configurations to your HDInsight cluster.

!!! attention "Using Hive and the Feature Store"

HDInsight clusters cannot use their local Hive when being configured for the Feature Store as the Feature Store relies on custom Hive binaries and its own Metastore which will overwrite the local one. If you rely on Hive for feature engineering then it is advised to write your data to an external data storage such as ADLS from your main HDInsight cluster and use a second HDInsight cluster to read that data and access the Feature Store.

Hadoop hadoop-env.sh:
```
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/hopsworks/client/*
```

Hadoop core-site.xml:
```
hops.ipc.server.ssl.enabled=true
fs.hopsfs.impl=io.hops.hopsfs.client.HopsFileSystem
client.rpc.ssl.enabled.protocol=TLSv1.2
hops.ssl.keystore.name=/usr/lib/hopsworks/keyStore.jks
hops.rpc.socket.factory.class.default=io.hops.hadoop.shaded.org.apache.hadoop.net.HopsSSLSocketFactory
hops.ssl.keystores.passwd.name=/usr/lib/hopsworks/material_passwd
hops.ssl.hostname.verifier=ALLOW_ALL
hops.ssl.trustore.name=/usr/lib/hopsworks/trustStore.jks
```

Spark spark-defaults.conf:
```
spark.executor.extraClassPath=/usr/lib/hopsworks/client/*
spark.driver.extraClassPath=/usr/lib/hopsworks/client/*
spark.sql.hive.metastore.jars=/usr/lib/hopsworks/apache-hive-bin/lib/*
```

Spark hive-site.xml:
```
hive.metastore.uris=thrift://MY_HOPSWORKS_INSTANCE_PRIVATE_IP:9083
```

!!! info
Replace MY_HOPSWORKS_INSTANCE_PRIVATE_IP with the private IP address of you Hopsworks Feature Store.

## Step 4: Configure the Feature Store for HDInsight

In order for the Feature Store to work correctly with HDInsight, ensure that it is using Parquet instead of ORC for storing features. Open Hopsworks and navigate to the Admin panel:

<p align="center">
<figure>
<img src="../../../assets/images/azure/hdinsight/variables-step-0.png" alt="Open the admin panel">
<figcaption>Open the admin panel</figcaption>
</figure>
</p>

In the admin panel, select `Edit variables`:

<p align="center">
<figure>
<img src="../../../assets/images/azure/hdinsight/variables-step-1.png" alt="Select Edit variables">
<figcaption>Select Edit variables</figcaption>
</figure>
</p>

Search for the name `featurestore_default_storage_format` and ensure it is set to `PARQUET`. To set it to Parquet, select the edit icon to the right, update the value and accept the change. Press `Reload variables` to ensure that everything is updated correctly:

<p align="center">
<figure>
<img src="../../../assets/images/azure/hdinsight/variables-step-2.png" alt="Ensure featurestore_default_storage_format is set to PARQUET">
<figcaption>Ensure featurestore_default_storage_format is set to PARQUET</figcaption>
</figure>
</p>

## Step 5: Connect to the Feature Store

You are now ready to connect to the Hopsworks Feature Store, for instance using a Jupyter notebook in HDInsight with a PySpark3 kernel:

```python
import hsfs

# Put the API key into Key Vault for any production setup:
# See, https://azure.microsoft.com/en-us/services/key-vault/
secret_value = 'MY_API_KEY'

# Create a connection
conn = hsfs.connection(
host='MY_INSTANCE.cloud.hopsworks.ai', # DNS of your Feature Store instance
port=443, # Port to reach your Hopsworks instance, defaults to 443
project='MY_PROJECT', # Name of your Hopsworks Feature Store project
api_key_value=secret_value, # The API key to authenticate with Hopsworks
hostname_verification=True # Disable for self-signed certificates
)

# Get the feature store handle for the project's feature store
fs = conn.get_feature_store()
```

## Next Steps

For more information about how to use the Feature Store, see the [Quickstart Guide](../quickstart.md).
147 changes: 147 additions & 0 deletions docs/integrations/mlstudio_designer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Azure Machine Learning Designer Integration

Connecting to the Feature Store from the Azure Machine Learning Designer requires setting up a Feature Store API key for the Designer and installing the **HSFS** on the Designer. This guide explains step by step how to connect to the Feature Store from Azure Machine Learning Designer.

!!! info "Network Connectivity"

To be able to connect to the Feature Store, please ensure that the Network Security Group of your Hopsworks instance on Azure is configured to allow incoming traffic from your compute target on ports 443, 9083 and 9085. See [Network security groups](https://docs.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview) for more information. If your compute target is not in the same VNet as your Hopsworks instance and the Hopsworks instance is not accessible from the internet then you will need to configure [Virtual Network Peering](https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-manage-peering).

## Generate an API key

In Hopsworks, click on your *username* in the top-right corner and select *Settings* to open the user settings. Select *API keys*. Give the key a name and select the job, featurestore and project scopes before creating the key. Copy the key into your clipboard for the next step.

!!! success "Scopes"
The API key should contain at least the following scopes:

1. featurestore
2. project
3. job

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-0.png" alt="Generate an API key on Hopsworks">
<figcaption>API keys can be created in the User Settings on Hopsworks</figcaption>
</figure>
</p>

!!! info
You are only ably to retrieve the API key once. If you miss to copy it to your clipboard, delete it again and create a new one.

## Connect to the Feature Store

To connect to the Feature Store from the Azure Machine Learning Designer, create a new pipeline or open an existing one:

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-1.png" alt="Add an Execute Python Script step">
<figcaption>Add an Execute Python Script step</figcaption>
</figure>
</p>

In the pipeline, add a new `Execute Python Script` step and replace the Python script from the next step:

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-2.png" alt="Add the code to access the Feature Store">
<figcaption>Add the code to access the Feature Store</figcaption>
</figure>
</p>

!!! info "Updating the script"

Replace MY_VERSION, MY_API_KEY, MY_INSTANCE, MY_PROJECT and MY_FEATURE_GROUP with the respective values. The major version set for MY_VERSION needs to match the major version of Hopsworks. Check [PyPI](https://pypi.org/project/hsfs/#history) for available releases.

<p align="center">
<figure>
<img src="../../assets/images/hopsworks-version.png" alt="HSFS version needs to match the major version of Hopsworks">
<figcaption>You find the Hopsworks version inside any of your Project's settings tab on Hopsworks</figcaption>
</figure>
</p>

```python
import os
import importlib.util


package_name = 'hsfs'
version = 'MY_VERSION'
spec = importlib.util.find_spec(package_name)
if spec is None:
import os
os.system(f"pip install %s[hive]==%s" % (package_name, version))

# Put the API key into Key Vault for any production setup:
# See, https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-secrets-in-runs
#from azureml.core import Experiment, Run
#run = Run.get_context()
#secret_value = run.get_secret(name="fs-api-key")
secret_value = 'MY_API_KEY'

def azureml_main(dataframe1 = None, dataframe2 = None):

import hsfs
conn = hsfs.connection(
host='MY_INSTANCE.cloud.hopsworks.ai', # DNS of your Feature Store instance
port=443, # Port to reach your Hopsworks instance, defaults to 443
project='MY_PROJECT', # Name of your Hopsworks Feature Store project
api_key_value=secret_value, # The API key to authenticate with Hopsworks
hostname_verification=True, # Disable for self-signed certificates
engine='hive' # Choose Hive as engine
)
fs = conn.get_feature_store() # Get the project's default feature store

return fs.get_feature_group('MY_FEATURE_GROUP', version=1).read(),
```

Select a compute target and save the step. The step is now ready to use:

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-3.png" alt="Select a compute target">
<figcaption>Select a compute target</figcaption>
</figure>
</p>

As a next step, you have to connect the previously created `Execute Python Script` step with the next step in the pipeline. For instance, to export the features to a CSV file, create a `Export Data` step:

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-4.png" alt="Add an Export Data step">
<figcaption>Add an Export Data step</figcaption>
</figure>
</p>

Configure the `Export Data` step to write to you data store of choice:

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-5.png" alt="Configure the Export Data step">
<figcaption>Configure the Export Data step</figcaption>
</figure>
</p>

Connect the to steps by drawing a line between them:

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-6.png" alt="Connect the steps">
<figcaption>Connect the steps</figcaption>
</figure>
</p>

Finally, submit the pipeline and wait for it to finish:

!!! info "Performance on the first execution"

The `Execute Python Script` step can be slow when being executed for the first time as the HSFS library needs to be installed on the compute target. Subsequent executions on the same compute target should use the already installed library.

<p align="center">
<figure>
<img src="../../assets/images/azure/designer/step-7.png" alt="Execute the pipeline">
<figcaption>Execute the pipeline</figcaption>
</figure>
</p>

## Next Steps

For more information about how to use the Feature Store, see the [Quickstart Guide](../quickstart.md).
Loading

0 comments on commit da00316

Please sign in to comment.