Skip to content

Commit

Permalink
docs: add overview page for simple DNN and fix some typos (#1879)
Browse files Browse the repository at this point in the history
* docs: add overview page for simple DNN and fix some typos

* docs: fixup docs for acrolinx

* fix acrolinx

* update

* update

* update

---------
  • Loading branch information
serena-ruan authored Mar 21, 2023
1 parent 87e1c78 commit ed842a5
Show file tree
Hide file tree
Showing 13 changed files with 615 additions and 209 deletions.
1 change: 1 addition & 0 deletions website/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
!/docs/features/simple_deep_learning
/docs/features/simple_deep_learning/*
!/docs/features/simple_deep_learning/about.md
!/docs/features/simple_deep_learning/installation.md
!/docs/features/spark_serving
/docs/features/spark_serving/*
!/docs/features/spark_serving/about.md
Expand Down
92 changes: 63 additions & 29 deletions website/docs/features/simple_deep_learning/about.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,76 @@
---
title: Deep Vision Classification on Databricks
sidebar_label: Deep Vision Classification on Databricks
title: Simple Deep Learning with SynapseML
sidebar_label: About
---

:::note
This is for databricks 10.4.x-gpu-ml-scala2.12 runtime
:::
### Why Simple Deep Learning
Creating a Spark-compatible deep learning system can be challenging for users who may not have a
thorough understanding of deep learning and distributed systems. Additionally, writing custom deep learning
scripts may be a cumbersome and time-consuming task.
SynapseML aims to simplify this process by building on top of the [Horovod](https://github.com/horovod/horovod) Estimator, a general-purpose
distributed deep learning model that is compatible with SparkML, and [Pytorch-lightning](https://github.com/Lightning-AI/lightning),
a lightweight wrapper around the popular PyTorch deep learning framework.

## 1. Reinstall horovod using our prepared script
SynapseML's simple deep learning toolkit makes it easy to use modern deep learning methods in Apache Spark.
By providing a collection of Estimators, SynapseML enables users to perform distributed transfer learning on
spark clusters to solve custom machine learning tasks without requiring in-depth domain expertise.
Whether you're a data scientist, data engineer, or business analyst this project aims to make modern deep-learning methods easy to use for new domain-specific problems.

We build on top of torchvision, horovod and pytorch_lightning, so we need to reinstall horovod by building on specific versions of those packages.
Download our [horovod installation script](https://mmlspark.blob.core.windows.net/publicwasb/horovod_installation.sh) and upload
it to databricks dbfs.
### SynapseML's Simple DNN
SynapseML goes beyond the limited support for deep networks in SparkML and provides out-of-the-box solutions for various common scenarios:
- Visual Classification: Users can apply transfer learning for image classification tasks, using pretrained models and fine-tuning them to solve custom classification problems.
- Text Classification: SynapseML simplifies the process of implementing natural language processing tasks such as sentiment analysis, text classification, and language modeling by providing prebuilt models and tools.
- And more coming soon

Add the path of this script to `Init Scripts` section when configuring the spark cluster.
Restarting the cluster will automatically install horovod v0.25.0 with pytorch_lightning v1.5.0 and torchvision v0.12.0.
### Why Horovod
Horovod is a distributed deep learning framework developed by Uber, which has become popular for its ability to scale
deep learning tasks across multiple GPUs and compute nodes efficiently. It's designed to work with TensorFlow, Keras, PyTorch, and Apache MXNet.
- Scalability: Horovod uses efficient communication algorithms like ring-allreduce and hierarchical all reduce, which allow it to scale the training process across multiple GPUs and nodes without significant performance degradation.
- Easy Integration: Horovod can be easily integrated into existing deep learning codebases with minimal changes, making it a popular choice for distributed training.
- Fault Tolerance: Horovod provides fault tolerance features like elastic training. It can dynamically adapt to changes in the number of workers or recover from failures.
- Community Support: Horovod has an active community and is widely used in the industry, which ensures that the framework is continually updated and improved.

## 2. Install SynapseML Deep Learning Component

You could install the single synapseml-deep-learning wheel package to get the full functionality of deep vision classification.
Run the following command:
```powershell
pip install https://mmlspark.blob.core.windows.net/pip/$SYNAPSEML_SCALA_VERSION/synapseml_deep_learning-$SYNAPSEML_PYTHON_VERSION-py2.py3-none-any.whl
```
### Why Pytorch Lightning
PyTorch Lightning is a lightweight wrapper around the popular PyTorch deep learning framework, designed to make it
easier to write clean, modular, and scalable deep learning code. PyTorch Lightning has several advantages that
make it an excellent choice for SynapseML's Simple Deep Learning:
- Code Organization: PyTorch Lightning promotes a clean and organized code structure by separating the research code from the engineering code. This property makes it easier to maintain, debug, and share deep learning models.
- Flexibility: PyTorch Lightning retains the flexibility and expressiveness of PyTorch while adding useful abstractions to simplify the training loop and other boilerplate code.
- Built-in Best Practices: PyTorch Lightning incorporates many best practices for deep learning, such as automatic optimization, gradient clipping, and learning rate scheduling, making it easier for users to achieve optimal performance.
- Compatibility: PyTorch Lightning is compatible with a wide range of popular tools and frameworks, including Horovod, which allows users to easily use distributed training capabilities.
- Rapid Development: With PyTorch Lightning, users can quickly experiment with different model architectures and training strategies without worrying about low-level implementation details.

An alternative is installing the SynapseML jar package in library management section, by adding:
```
Coordinate: com.microsoft.azure:synapseml_2.12:SYNAPSEML_SCALA_VERSION
Repository: https://mmlspark.azureedge.net/maven
```
### Sample usage with DeepVisionClassifier
DeepVisionClassifier incorporates all models supported by [torchvision](https://github.com/pytorch/vision).
:::note
If you install the jar package, you need to follow the first two cells of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md/#environment-setup----reinstall-horovod-based-on-new-version-of-pytorch)
to make horovod recognize our module.
The current version is based on pytorch_lightning v1.5.0 and torchvision v0.12.0
:::
By providing a spark dataframe that contains an 'imageCol' and 'labelCol', you could directly apply 'transform' function
on it with DeepVisionClassifier.
```python
train_df = spark.createDataframe([
("PATH_TO_IMAGE_1.jpg", 1),
("PATH_TO_IMAGE_2.jpg", 2)
], ["image", "label"])

## 3. Try our sample notebook
deep_vision_classifier = DeepVisionClassifier(
backbone="resnet50", # Put your backbone here
store=store, # Corresponding store
callbacks=callbacks, # Optional callbacks
num_classes=17,
batch_size=16,
epochs=epochs,
validation=0.1,
)

You could follow the rest of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md) and have a try on your own dataset.
deep_vision_model = deep_vision_classifier.fit(train_df)
```
DeepVisionClassifier does distributed-training on spark with Horovod under the hood, after this fitting process it returns
a DeepVisionModel. With this code you could use the model for inference directly:
```python
pred_df = deep_vision_model.transform(test_df)
```

Supported models (`backbone` parameter for `DeepVisionClassifer`) should be string format of [Torchvision-supported models](https://github.com/pytorch/vision/blob/v0.12.0/torchvision/models/__init__.py);
You could also check by running `backbone in torchvision.models.__dict__`.
## Examples
- [DeepLearning - Deep Vision Classification](../DeepLearning%20-%20Deep%20Vision%20Classification)
- [DeepLearning - Deep Text Classification](../DeepLearning%20-%20Deep%20Text%20Classification)
42 changes: 42 additions & 0 deletions website/docs/features/simple_deep_learning/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Installation Guidance
sidebar_label: Installation Guidance for Deep Vision Classification
---

:::note
This is a sample with databricks 10.4.x-gpu-ml-scala2.12 runtime
:::

## 1. Reinstall horovod using our prepared script

We build on top of torchvision, horovod and pytorch_lightning, so we need to reinstall horovod by building on specific versions of those packages.
Download our [horovod installation script](https://mmlspark.blob.core.windows.net/publicwasb/horovod_installation.sh) and upload
it to databricks dbfs.

Add the path of this script to `Init Scripts` section when configuring the spark cluster.
Restarting the cluster automatically installs horovod v0.25.0 with pytorch_lightning v1.5.0 and torchvision v0.12.0.

## 2. Install SynapseML Deep Learning Component

You could install the single synapseml-deep-learning wheel package to get the full functionality of deep vision classification.
Run the following command:
```powershell
pip install synapseml==0.11.0
```

An alternative is installing the SynapseML jar package in library management section, by adding:
```
Coordinate: com.microsoft.azure:synapseml_2.12:0.11.0
Repository: https://mmlspark.azureedge.net/maven
```
:::note
If you install the jar package, follow the first two cells of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md/#environment-setup----reinstall-horovod-based-on-new-version-of-pytorch)
to ensure horovod recognizes SynapseML.
:::

## 3. Try our sample notebook

You could follow the rest of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md) and have a try on your own dataset.

Supported models (`backbone` parameter for `DeepVisionClassifer`) should be string format of [Torchvision-supported models](https://github.com/pytorch/vision/blob/v0.12.0/torchvision/models/__init__.py);
You could also check by running `backbone in torchvision.models.__dict__`.
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,30 @@ values={[
```python
from synapse.ml.causal import *
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, BooleanType

schema = StructType([
StructField("Treatment", IntegerType()),
StructField("Outcome", IntegerType()),
StructField("Treatment", BooleanType()),
StructField("Outcome", BooleanType()),
StructField("col2", DoubleType()),
StructField("col3", DoubleType()),
StructField("col4", DoubleType())
])


df = spark.createDataFrame([
(0, 1, 0.30, 0.66, 0.2),
(1, 0, 0.38, 0.53, 1.5),
(0, 1, 0.68, 0.98, 3.2),
(1, 0, 0.15, 0.32, 6.6),
(0, 1, 0.50, 0.65, 2.8),
(1, 1, 0.40, 0.54, 3.7),
(0, 1, 0.78, 0.97, 8.1),
(1, 0, 0.12, 0.32, 10.2),
(0, 1, 0.35, 0.63, 1.8),
(1, 0, 0.45, 0.57, 4.3),
(0, 1, 0.75, 0.97, 7.2),
(1, 1, 0.16, 0.32, 11.7)], schema
(False, True, 0.30, 0.66, 0.2),
(True, False, 0.38, 0.53, 1.5),
(False, True, 0.68, 0.98, 3.2),
(True, False, 0.15, 0.32, 6.6),
(False, True, 0.50, 0.65, 2.8),
(True, True, 0.40, 0.54, 3.7),
(False, True, 0.78, 0.97, 8.1),
(True, False, 0.12, 0.32, 10.2),
(False, True, 0.35, 0.63, 1.8),
(True, False, 0.45, 0.57, 4.3),
(False, True, 0.75, 0.97, 7.2),
(True, True, 0.16, 0.32, 11.7)], schema
)

dml = (DoubleMLEstimator()
Expand All @@ -63,18 +64,18 @@ import com.microsoft.azure.synapse.ml.causal._
import org.apache.spark.ml.classification.LogisticRegression

val df = (Seq(
(0, 1, 0.50, 0.60, 0),
(1, 0, 0.40, 0.50, 1),
(0, 1, 0.78, 0.99, 2),
(1, 0, 0.12, 0.34, 3),
(0, 1, 0.50, 0.60, 0),
(1, 0, 0.40, 0.50, 1),
(0, 1, 0.78, 0.99, 2),
(1, 0, 0.12, 0.34, 3),
(0, 0, 0.50, 0.60, 0),
(1, 1, 0.40, 0.50, 1),
(0, 1, 0.78, 0.99, 2),
(1, 0, 0.12, 0.34, 3))
(false, true, 0.50, 0.60, 0),
(true, false, 0.40, 0.50, 1),
(false, true, 0.78, 0.99, 2),
(true, false, 0.12, 0.34, 3),
(false, true, 0.50, 0.60, 0),
(true, false, 0.40, 0.50, 1),
(false, true, 0.78, 0.99, 2),
(true, false, 0.12, 0.34, 3),
(false, false, 0.50, 0.60, 0),
(true, true, 0.40, 0.50, 1),
(false, true, 0.78, 0.99, 2),
(true, false, 0.12, 0.34, 3))
.toDF("Treatment", "Outcome", "col2", "col3", "col4"))

val dml = (new DoubleMLEstimator()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,12 @@ dml = (DoubleMLEstimator()
.setOutcomeCol("Outcome")
.setOutcomeModel(LogisticRegression())
.setMaxIter(20))
dmlModel = dml.fit(df)
dmlModel = dml.fit(dataset)
```
> Note: all columns except "Treatment" and "Outcome" in your dataset will be used as confounders.
After fitting the model, you can get average treatment effect and confidence interval:
```python
dmlModel.getAvgTreatmentEffect()
dmlModel.getConfidenceInterval()
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: CognitiveServices - Create Audiobooks
hide_title: true
status: stable
---
# Create Audiobooks using Neural Speech to Text
# Create audiobooks using neural Text to speech

## Step 1: Load libraries and add service information

Expand Down Expand Up @@ -38,11 +38,6 @@ spark.sparkContext._jsc.hadoopConfiguration().set(spark_key_setting, storage_key
```


```python
import os
```


```python
import os
from os.path import exists, join
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: AzureSearchIndex - Met Artworks
hide_title: true
status: stable
---
<h1>Creating a searchable Art Database with The MET's open-access collection</h1>

In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using SynapseML. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index.


```python
import os, sys, time, json, requests
from pyspark.ml import Transformer, Estimator, Pipeline
from pyspark.ml.feature import SQLTransformer
from pyspark.sql.functions import lit, udf, col, split
```


```python
from pyspark.sql import SparkSession

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()

from synapse.ml.core.platform import running_on_synapse, find_secret

if running_on_synapse():
from notebookutils.visualization import display
```


```python
cognitive_key = find_secret("cognitive-api-key")
cognitive_loc = "eastus"
azure_search_key = find_secret("azure-search-key")
search_service = "mmlspark-azure-search"
search_index = "test"
```


```python
data = (
spark.read.format("csv")
.option("header", True)
.load("wasbs://publicwasb@mmlspark.blob.core.windows.net/metartworks_sample.csv")
.withColumn("searchAction", lit("upload"))
.withColumn("Neighbors", split(col("Neighbors"), ",").cast("array<string>"))
.withColumn("Tags", split(col("Tags"), ",").cast("array<string>"))
.limit(25)
)
```

<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworkSamples.png" width="800" />


```python
from synapse.ml.cognitive import AnalyzeImage
from synapse.ml.stages import SelectColumns

# define pipeline
describeImage = (
AnalyzeImage()
.setSubscriptionKey(cognitive_key)
.setLocation(cognitive_loc)
.setImageUrlCol("PrimaryImageUrl")
.setOutputCol("RawImageDescription")
.setErrorCol("Errors")
.setVisualFeatures(
["Categories", "Description", "Faces", "ImageType", "Color", "Adult"]
)
.setConcurrency(5)
)

df2 = (
describeImage.transform(data)
.select("*", "RawImageDescription.*")
.drop("Errors", "RawImageDescription")
)
```

<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworksProcessed.png" width="800" />

Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information.


```python
from synapse.ml.cognitive import *

df2.writeToAzureSearch(
subscriptionKey=azure_search_key,
actionCol="searchAction",
serviceName=search_service,
indexName=search_index,
keyCol="ObjectID",
)
```

The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents)


```python
url = "https://{}.search.windows.net/indexes/{}/docs/search?api-version=2019-05-06".format(
search_service, search_index
)
requests.post(
url, json={"search": "Glass"}, headers={"api-key": azure_search_key}
).json()
```
Loading

0 comments on commit ed842a5

Please sign in to comment.