-
Notifications
You must be signed in to change notification settings - Fork 834
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add overview page for simple DNN and fix some typos (#1879)
* docs: add overview page for simple DNN and fix some typos * docs: fixup docs for acrolinx * fix acrolinx * update * update * update ---------
- Loading branch information
1 parent
87e1c78
commit ed842a5
Showing
13 changed files
with
615 additions
and
209 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,42 +1,76 @@ | ||
--- | ||
title: Deep Vision Classification on Databricks | ||
sidebar_label: Deep Vision Classification on Databricks | ||
title: Simple Deep Learning with SynapseML | ||
sidebar_label: About | ||
--- | ||
|
||
:::note | ||
This is for databricks 10.4.x-gpu-ml-scala2.12 runtime | ||
::: | ||
### Why Simple Deep Learning | ||
Creating a Spark-compatible deep learning system can be challenging for users who may not have a | ||
thorough understanding of deep learning and distributed systems. Additionally, writing custom deep learning | ||
scripts may be a cumbersome and time-consuming task. | ||
SynapseML aims to simplify this process by building on top of the [Horovod](https://github.com/horovod/horovod) Estimator, a general-purpose | ||
distributed deep learning model that is compatible with SparkML, and [Pytorch-lightning](https://github.com/Lightning-AI/lightning), | ||
a lightweight wrapper around the popular PyTorch deep learning framework. | ||
|
||
## 1. Reinstall horovod using our prepared script | ||
SynapseML's simple deep learning toolkit makes it easy to use modern deep learning methods in Apache Spark. | ||
By providing a collection of Estimators, SynapseML enables users to perform distributed transfer learning on | ||
spark clusters to solve custom machine learning tasks without requiring in-depth domain expertise. | ||
Whether you're a data scientist, data engineer, or business analyst this project aims to make modern deep-learning methods easy to use for new domain-specific problems. | ||
|
||
We build on top of torchvision, horovod and pytorch_lightning, so we need to reinstall horovod by building on specific versions of those packages. | ||
Download our [horovod installation script](https://mmlspark.blob.core.windows.net/publicwasb/horovod_installation.sh) and upload | ||
it to databricks dbfs. | ||
### SynapseML's Simple DNN | ||
SynapseML goes beyond the limited support for deep networks in SparkML and provides out-of-the-box solutions for various common scenarios: | ||
- Visual Classification: Users can apply transfer learning for image classification tasks, using pretrained models and fine-tuning them to solve custom classification problems. | ||
- Text Classification: SynapseML simplifies the process of implementing natural language processing tasks such as sentiment analysis, text classification, and language modeling by providing prebuilt models and tools. | ||
- And more coming soon | ||
|
||
Add the path of this script to `Init Scripts` section when configuring the spark cluster. | ||
Restarting the cluster will automatically install horovod v0.25.0 with pytorch_lightning v1.5.0 and torchvision v0.12.0. | ||
### Why Horovod | ||
Horovod is a distributed deep learning framework developed by Uber, which has become popular for its ability to scale | ||
deep learning tasks across multiple GPUs and compute nodes efficiently. It's designed to work with TensorFlow, Keras, PyTorch, and Apache MXNet. | ||
- Scalability: Horovod uses efficient communication algorithms like ring-allreduce and hierarchical all reduce, which allow it to scale the training process across multiple GPUs and nodes without significant performance degradation. | ||
- Easy Integration: Horovod can be easily integrated into existing deep learning codebases with minimal changes, making it a popular choice for distributed training. | ||
- Fault Tolerance: Horovod provides fault tolerance features like elastic training. It can dynamically adapt to changes in the number of workers or recover from failures. | ||
- Community Support: Horovod has an active community and is widely used in the industry, which ensures that the framework is continually updated and improved. | ||
|
||
## 2. Install SynapseML Deep Learning Component | ||
|
||
You could install the single synapseml-deep-learning wheel package to get the full functionality of deep vision classification. | ||
Run the following command: | ||
```powershell | ||
pip install https://mmlspark.blob.core.windows.net/pip/$SYNAPSEML_SCALA_VERSION/synapseml_deep_learning-$SYNAPSEML_PYTHON_VERSION-py2.py3-none-any.whl | ||
``` | ||
### Why Pytorch Lightning | ||
PyTorch Lightning is a lightweight wrapper around the popular PyTorch deep learning framework, designed to make it | ||
easier to write clean, modular, and scalable deep learning code. PyTorch Lightning has several advantages that | ||
make it an excellent choice for SynapseML's Simple Deep Learning: | ||
- Code Organization: PyTorch Lightning promotes a clean and organized code structure by separating the research code from the engineering code. This property makes it easier to maintain, debug, and share deep learning models. | ||
- Flexibility: PyTorch Lightning retains the flexibility and expressiveness of PyTorch while adding useful abstractions to simplify the training loop and other boilerplate code. | ||
- Built-in Best Practices: PyTorch Lightning incorporates many best practices for deep learning, such as automatic optimization, gradient clipping, and learning rate scheduling, making it easier for users to achieve optimal performance. | ||
- Compatibility: PyTorch Lightning is compatible with a wide range of popular tools and frameworks, including Horovod, which allows users to easily use distributed training capabilities. | ||
- Rapid Development: With PyTorch Lightning, users can quickly experiment with different model architectures and training strategies without worrying about low-level implementation details. | ||
|
||
An alternative is installing the SynapseML jar package in library management section, by adding: | ||
``` | ||
Coordinate: com.microsoft.azure:synapseml_2.12:SYNAPSEML_SCALA_VERSION | ||
Repository: https://mmlspark.azureedge.net/maven | ||
``` | ||
### Sample usage with DeepVisionClassifier | ||
DeepVisionClassifier incorporates all models supported by [torchvision](https://github.com/pytorch/vision). | ||
:::note | ||
If you install the jar package, you need to follow the first two cells of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md/#environment-setup----reinstall-horovod-based-on-new-version-of-pytorch) | ||
to make horovod recognize our module. | ||
The current version is based on pytorch_lightning v1.5.0 and torchvision v0.12.0 | ||
::: | ||
By providing a spark dataframe that contains an 'imageCol' and 'labelCol', you could directly apply 'transform' function | ||
on it with DeepVisionClassifier. | ||
```python | ||
train_df = spark.createDataframe([ | ||
("PATH_TO_IMAGE_1.jpg", 1), | ||
("PATH_TO_IMAGE_2.jpg", 2) | ||
], ["image", "label"]) | ||
|
||
## 3. Try our sample notebook | ||
deep_vision_classifier = DeepVisionClassifier( | ||
backbone="resnet50", # Put your backbone here | ||
store=store, # Corresponding store | ||
callbacks=callbacks, # Optional callbacks | ||
num_classes=17, | ||
batch_size=16, | ||
epochs=epochs, | ||
validation=0.1, | ||
) | ||
|
||
You could follow the rest of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md) and have a try on your own dataset. | ||
deep_vision_model = deep_vision_classifier.fit(train_df) | ||
``` | ||
DeepVisionClassifier does distributed-training on spark with Horovod under the hood, after this fitting process it returns | ||
a DeepVisionModel. With this code you could use the model for inference directly: | ||
```python | ||
pred_df = deep_vision_model.transform(test_df) | ||
``` | ||
|
||
Supported models (`backbone` parameter for `DeepVisionClassifer`) should be string format of [Torchvision-supported models](https://github.com/pytorch/vision/blob/v0.12.0/torchvision/models/__init__.py); | ||
You could also check by running `backbone in torchvision.models.__dict__`. | ||
## Examples | ||
- [DeepLearning - Deep Vision Classification](../DeepLearning%20-%20Deep%20Vision%20Classification) | ||
- [DeepLearning - Deep Text Classification](../DeepLearning%20-%20Deep%20Text%20Classification) |
42 changes: 42 additions & 0 deletions
42
website/docs/features/simple_deep_learning/installation.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
--- | ||
title: Installation Guidance | ||
sidebar_label: Installation Guidance for Deep Vision Classification | ||
--- | ||
|
||
:::note | ||
This is a sample with databricks 10.4.x-gpu-ml-scala2.12 runtime | ||
::: | ||
|
||
## 1. Reinstall horovod using our prepared script | ||
|
||
We build on top of torchvision, horovod and pytorch_lightning, so we need to reinstall horovod by building on specific versions of those packages. | ||
Download our [horovod installation script](https://mmlspark.blob.core.windows.net/publicwasb/horovod_installation.sh) and upload | ||
it to databricks dbfs. | ||
|
||
Add the path of this script to `Init Scripts` section when configuring the spark cluster. | ||
Restarting the cluster automatically installs horovod v0.25.0 with pytorch_lightning v1.5.0 and torchvision v0.12.0. | ||
|
||
## 2. Install SynapseML Deep Learning Component | ||
|
||
You could install the single synapseml-deep-learning wheel package to get the full functionality of deep vision classification. | ||
Run the following command: | ||
```powershell | ||
pip install synapseml==0.11.0 | ||
``` | ||
|
||
An alternative is installing the SynapseML jar package in library management section, by adding: | ||
``` | ||
Coordinate: com.microsoft.azure:synapseml_2.12:0.11.0 | ||
Repository: https://mmlspark.azureedge.net/maven | ||
``` | ||
:::note | ||
If you install the jar package, follow the first two cells of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md/#environment-setup----reinstall-horovod-based-on-new-version-of-pytorch) | ||
to ensure horovod recognizes SynapseML. | ||
::: | ||
|
||
## 3. Try our sample notebook | ||
|
||
You could follow the rest of this [sample](./DeepLearning%20-%20Deep%20Vision%20Classification.md) and have a try on your own dataset. | ||
|
||
Supported models (`backbone` parameter for `DeepVisionClassifer`) should be string format of [Torchvision-supported models](https://github.com/pytorch/vision/blob/v0.12.0/torchvision/models/__init__.py); | ||
You could also check by running `backbone in torchvision.models.__dict__`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
108 changes: 108 additions & 0 deletions
108
...versioned_docs/version-0.11.0/features/other/AzureSearchIndex - Met Artworks.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
--- | ||
title: AzureSearchIndex - Met Artworks | ||
hide_title: true | ||
status: stable | ||
--- | ||
<h1>Creating a searchable Art Database with The MET's open-access collection</h1> | ||
|
||
In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using SynapseML. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index. | ||
|
||
|
||
```python | ||
import os, sys, time, json, requests | ||
from pyspark.ml import Transformer, Estimator, Pipeline | ||
from pyspark.ml.feature import SQLTransformer | ||
from pyspark.sql.functions import lit, udf, col, split | ||
``` | ||
|
||
|
||
```python | ||
from pyspark.sql import SparkSession | ||
|
||
# Bootstrap Spark Session | ||
spark = SparkSession.builder.getOrCreate() | ||
|
||
from synapse.ml.core.platform import running_on_synapse, find_secret | ||
|
||
if running_on_synapse(): | ||
from notebookutils.visualization import display | ||
``` | ||
|
||
|
||
```python | ||
cognitive_key = find_secret("cognitive-api-key") | ||
cognitive_loc = "eastus" | ||
azure_search_key = find_secret("azure-search-key") | ||
search_service = "mmlspark-azure-search" | ||
search_index = "test" | ||
``` | ||
|
||
|
||
```python | ||
data = ( | ||
spark.read.format("csv") | ||
.option("header", True) | ||
.load("wasbs://publicwasb@mmlspark.blob.core.windows.net/metartworks_sample.csv") | ||
.withColumn("searchAction", lit("upload")) | ||
.withColumn("Neighbors", split(col("Neighbors"), ",").cast("array<string>")) | ||
.withColumn("Tags", split(col("Tags"), ",").cast("array<string>")) | ||
.limit(25) | ||
) | ||
``` | ||
|
||
<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworkSamples.png" width="800" /> | ||
|
||
|
||
```python | ||
from synapse.ml.cognitive import AnalyzeImage | ||
from synapse.ml.stages import SelectColumns | ||
|
||
# define pipeline | ||
describeImage = ( | ||
AnalyzeImage() | ||
.setSubscriptionKey(cognitive_key) | ||
.setLocation(cognitive_loc) | ||
.setImageUrlCol("PrimaryImageUrl") | ||
.setOutputCol("RawImageDescription") | ||
.setErrorCol("Errors") | ||
.setVisualFeatures( | ||
["Categories", "Description", "Faces", "ImageType", "Color", "Adult"] | ||
) | ||
.setConcurrency(5) | ||
) | ||
|
||
df2 = ( | ||
describeImage.transform(data) | ||
.select("*", "RawImageDescription.*") | ||
.drop("Errors", "RawImageDescription") | ||
) | ||
``` | ||
|
||
<img src="https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworksProcessed.png" width="800" /> | ||
|
||
Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information. | ||
|
||
|
||
```python | ||
from synapse.ml.cognitive import * | ||
|
||
df2.writeToAzureSearch( | ||
subscriptionKey=azure_search_key, | ||
actionCol="searchAction", | ||
serviceName=search_service, | ||
indexName=search_index, | ||
keyCol="ObjectID", | ||
) | ||
``` | ||
|
||
The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents) | ||
|
||
|
||
```python | ||
url = "https://{}.search.windows.net/indexes/{}/docs/search?api-version=2019-05-06".format( | ||
search_service, search_index | ||
) | ||
requests.post( | ||
url, json={"search": "Glass"}, headers={"api-key": azure_search_key} | ||
).json() | ||
``` |
Oops, something went wrong.