Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 37 additions & 13 deletions docs/set_env_for_training_data_and_reference_doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,47 @@ Folders [document_training](../data/document_training/) and [field_extraction_pr
2. *Install Azure Storage Explorer:* Azure Storage Explorer is a tool which makes it easy to work with Azure Storage data. Install it and login with your credential, follow the [guide](https://aka.ms/download-and-install-Azure-Storage-Explorer).
3. *Create or Choose a Blob Container:* Create a blob container from Azure Storage Explorer or use an existing one.
<img src="./create-blob-container.png" width="600" />
4. *Generate a Shared Access Signature (SAS) URL:*
- Right-click on blob container and select the `Get Shared Access Signature...` in the menu.
- Check the required permissions: `Read`, `Write` and `List`
- Click the `Create` button.
<img src="./get-access-signature.png" height="600" /> <img src="./choose-signature-options.png" height="600" />
5. *Copy the SAS URL:* After creating the SAS, click `Copy` to get the URL with token. This will be used as the value for **TRAINING_DATA_SAS_URL** or **REFERENCE_DOC_SAS_URL** when running the sample code.
<img src="./copy-access-signature.png" width="600" />
6. *Set Environment Variables in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env).
> NOTE: **REFERENCE_DOC_SAS_URL** can be the same as the **TRAINING_DATA_SAS_URL** to re-use the same blob container
- [analyzer_training](../notebooks/analyzer_training.ipynb): Add the SAS URL as value of **TRAINIGN_DATA_SAS_URL**, and a prefix for **TRAINING_DATA_PATH**. You can choose any folder name you like for **TRAINING_DATA_PATH**. For example, you could use "training_files".
4. *Set SAS URL Related Environment Variables in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env). There are two options to set up environment variables to utilize required Shared Access Signature (SAS) URL.
- Option A - Generate a SAS URL manually on Azure Storage Explorer
- Right-click on blob container and select the `Get Shared Access Signature...` in the menu.
- Check the required permissions: `Read`, `Write` and `List`
- We will need `Write` for uploading, modifying, or appending blobs
- Click the `Create` button.
<img src="./get-access-signature.png" height="600" /> <img src="./choose-signature-options.png" height="600" />
- *Copy the SAS URL:* After creating the SAS, click `Copy` to get the URL with token. This will be used as the value for **TRAINING_DATA_SAS_URL** or **REFERENCE_DOC_SAS_URL** when running the sample code.
<img src="./copy-access-signature.png" width="600" />

- Set the following in [.env](../notebooks/.env).
> NOTE: **REFERENCE_DOC_SAS_URL** can be the same as the **TRAINING_DATA_SAS_URL** to re-use the same blob container
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the SAS URL as value of **TRAINIGN_DATA_SAS_URL**.
```env
TRAINING_DATA_SAS_URL=<Blob container SAS URL>
```
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the SAS URL as value of **REFERENCE_DOC_SAS_URL**.
```env
REFERENCE_DOC_SAS_URL=<Blob container SAS URL>
```
- Option B - Auto-generate the SAS URL via code in sample notebooks
- Instead of manually creating a SAS URL, you can set storage account and container information, and let the code generate a temporary SAS URL at runtime.
> NOTE: **TRAINING_DATA_STORAGE_ACCOUNT_NAME** and **TRAINING_DATA_CONTAINER_NAME** can be the same as the **REFERENCE_DOC_STORAGE_ACCOUNT_NAME** and **REFERENCE_DOC_CONTAINER_NAME** to re-use the same blob container
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the storage account name as `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `TRAINING_DATA_CONTAINER_NAME`.
```env
TRAINING_DATA_STORAGE_ACCOUNT_NAME=<your-storage-account-name>
TRAINING_DATA_CONTAINER_NAME=<your-container-name>
```
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the storage account name as `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `REFERENCE_DOC_CONTAINER_NAME`.
```env
REFERENCE_DOC_STORAGE_ACCOUNT_NAME=<your-storage-account-name>
REFERENCE_DOC_CONTAINER_NAME=<your-container-name>
```

5. *Set Folder Prefix in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env).
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add a prefix for **TRAINING_DATA_PATH**. You can choose any folder name you like for **TRAINING_DATA_PATH**. For example, you could use "training_files".
```env
TRAINING_DATA_SAS_URL=<Blob container SAS URL>
TRAINING_DATA_PATH=<Designated folder path under the blob container>
```
- [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the SAS URL as value of **REFERENCE_DOC_SAS_URL**, and a prefix for **REFERENCE_DOC_PATH**. You can choose any folder name you like for **REFERENCE_DOC_PATH**. For example, you could use "reference_docs".
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add a prefix for **REFERENCE_DOC_PATH**. You can choose any folder name you like for **REFERENCE_DOC_PATH**. For example, you could use "reference_docs".
```env
REFERENCE_DOC_SAS_URL=<Blob container SAS URL>
REFERENCE_DOC_PATH=<Designated folder path under the blob container>
```

Expand Down
52 changes: 34 additions & 18 deletions notebooks/analyzer_training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,11 @@
"\n",
"## Prerequisites\n",
"1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
"1. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related env variables `TRAINING_DATA_SAS_URL` and `TRAINING_DATA_PATH` into the [.env](./.env) file.\n",
" - `TRAINING_DATA_SAS_URL`: SAS URL for your Azure Blob container. \n",
" - `TRAINING_DATA_PATH`: Folder path within the container to upload training data. \n",
"1. Install packages needed to run the sample\n",
"\n",
"\n"
"2. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related environment variables into the [.env](./.env) file.\n",
" - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n",
" - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n",
" - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where training data will be uploaded.\n",
"3. Install packages needed to run the sample\n"
]
},
{
Expand Down Expand Up @@ -119,11 +118,12 @@
"metadata": {},
"source": [
"## Prepare labeled data\n",
"In this step, we will \n",
"- Check whether document files in local folder have corresponding `.labels.json` and `.result.json` files\n",
"- Upload these files to the designated Azure blob storage.\n",
"\n",
"We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set in the Prerequisites step."
"In this step, we will\n",
"- Use `TRAINING_DATA_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n",
"- Try to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n",
"If this is not set, we attempt to generate the SAS URL automatically using the environment variables `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`.\n",
"- Verify that document files in the local folder have corresponding `.labels.json` and `.result.json` files\n",
"- Upload these files to the Azure Blob storage container specified by the environment variables."
]
},
{
Expand All @@ -132,10 +132,26 @@
"metadata": {},
"outputs": [],
"source": [
"TRAINING_DATA_SAS_URL = os.getenv(\"TRAINING_DATA_SAS_URL\")\n",
"TRAINING_DATA_PATH = os.getenv(\"TRAINING_DATA_PATH\")\n",
"\n",
"await client.generate_training_data_on_blob(training_docs_folder, TRAINING_DATA_SAS_URL, TRAINING_DATA_PATH)"
"training_data_sas_url = os.getenv(\"TRAINING_DATA_SAS_URL\")\n",
"if not training_data_sas_url:\n",
" TRAINING_DATA_STORAGE_ACCOUNT_NAME = os.getenv(\"TRAINING_DATA_STORAGE_ACCOUNT_NAME\")\n",
" TRAINING_DATA_CONTAINER_NAME = os.getenv(\"TRAINING_DATA_CONTAINER_NAME\")\n",
" if not TRAINING_DATA_STORAGE_ACCOUNT_NAME and not training_data_sas_url:\n",
" raise ValueError(\n",
" \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n",
" )\n",
" from azure.storage.blob import ContainerSasPermissions\n",
" # We will need \"Write\" for uploading, modifying, or appending blobs\n",
" training_data_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n",
" account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,\n",
" container_name=TRAINING_DATA_CONTAINER_NAME,\n",
" permissions=ContainerSasPermissions(read=True, write=True, list=True),\n",
" expiry_hours=1,\n",
" )\n",
"\n",
"training_data_path = os.getenv(\"TRAINING_DATA_PATH\")\n",
"\n",
"await client.generate_training_data_on_blob(training_docs_folder, training_data_sas_url, training_data_path)"
]
},
{
Expand All @@ -145,7 +161,7 @@
"## Create analyzer with defined schema\n",
"Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
"\n",
"We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set up in the [.env](./.env) file and used in the previous step."
"We use **training_data_sas_url** and **training_data_path** that's set up in the [.env](./.env) file and used in the previous step."
]
},
{
Expand All @@ -160,8 +176,8 @@
"response = client.begin_create_analyzer(\n",
" CUSTOM_ANALYZER_ID,\n",
" analyzer_template_path=analyzer_template,\n",
" training_storage_container_sas_url=TRAINING_DATA_SAS_URL,\n",
" training_storage_container_path_prefix=TRAINING_DATA_PATH,\n",
" training_storage_container_sas_url=training_data_sas_url,\n",
" training_storage_container_path_prefix=training_data_path,\n",
")\n",
"result = client.poll_result(response)\n",
"if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
Expand Down
50 changes: 32 additions & 18 deletions notebooks/field_extraction_pro_mode.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,10 @@
"source": [
"## Prerequisites\n",
"1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
"1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file.\n",
" - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container.\n",
" - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference docs.\n",
"1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up reference document related environment variables in the [.env](./.env) file.\n",
" - You can either set `REFERENCE_DOC_SAS_URL` directly with the SAS URL for your Azure Blob container,\n",
" - Or set both `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n",
" - Also set `REFERENCE_DOC_PATH` to specify the folder path within the container where reference documents will be uploaded.\n",
" > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data.\n",
"1. Install the required packages to run the sample."
]
Expand Down Expand Up @@ -157,12 +158,12 @@
"source": [
"## Prepare reference data\n",
"In this step, we will \n",
"- Use `REFERENCE_DOC_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n",
"- Try to get the SAS URL from the environment variable `REFERENCE_DOC_SAS_URL`.\n",
"If this is not set, we attempt to generate the SAS URL automatically using the environment variables `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`.\n",
"- Use Azure AI service to Extract OCR results from reference documents (if needed).\n",
"- Generate a reference `.jsonl` file.\n",
"- Upload these files to the designated Azure blob storage.\n",
"\n",
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step.\n",
"\n"
"- Upload these files to the designated Azure blob storage.\n"
]
},
{
Expand All @@ -172,8 +173,21 @@
"outputs": [],
"source": [
"# Load reference storage configuration from environment\n",
"REFERENCE_DOC_SAS_URL = os.getenv(\"REFERENCE_DOC_SAS_URL\")\n",
"REFERENCE_DOC_PATH = os.getenv(\"REFERENCE_DOC_PATH\")"
"reference_doc_path = os.getenv(\"REFERENCE_DOC_PATH\")\n",
"\n",
"reference_doc_sas_url = os.getenv(\"REFERENCE_DOC_SAS_URL\")\n",
"if not reference_doc_sas_url:\n",
" REFERENCE_DOC_STORAGE_ACCOUNT_NAME = os.getenv(\"REFERENCE_DOC_STORAGE_ACCOUNT_NAME\")\n",
" REFERENCE_DOC_CONTAINER_NAME = os.getenv(\"REFERENCE_DOC_CONTAINER_NAME\")\n",
" if REFERENCE_DOC_STORAGE_ACCOUNT_NAME and REFERENCE_DOC_CONTAINER_NAME:\n",
" from azure.storage.blob import ContainerSasPermissions\n",
" # We will need \"Write\" for uploading, modifying, or appending blobs\n",
" reference_doc_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n",
" account_name=REFERENCE_DOC_STORAGE_ACCOUNT_NAME,\n",
" container_name=REFERENCE_DOC_CONTAINER_NAME,\n",
" permissions=ContainerSasPermissions(read=True, write=True, list=True),\n",
" expiry_hours=1,\n",
" )"
]
},
{
Expand All @@ -193,7 +207,7 @@
"# Please name the OCR result files with the same name as the original document files including its extension, and add the suffix \".result.json\"\n",
"# For example, if the original document is \"invoice.pdf\", the OCR result file should be named \"invoice.pdf.result.json\"\n",
"# NOTE: Please comment out the follwing line if you don't have any reference documents.\n",
"await client.generate_knowledge_base_on_blob(reference_docs, REFERENCE_DOC_SAS_URL, REFERENCE_DOC_PATH, skip_analyze=False)"
"await client.generate_knowledge_base_on_blob(reference_docs, reference_doc_sas_url, reference_doc_path, skip_analyze=False)"
]
},
{
Expand All @@ -203,7 +217,7 @@
"## Create analyzer with defined schema for Pro mode\n",
"Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
"\n",
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the [.env](./.env) file and used in the previous step."
"We use **reference_doc_sas_url** and **reference_doc_path** that's set up in the [.env](./.env) file and used in the previous step."
]
},
{
Expand All @@ -218,8 +232,8 @@
"response = client.begin_create_analyzer(\n",
" CUSTOM_ANALYZER_ID,\n",
" analyzer_template_path=analyzer_template,\n",
" pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL,\n",
" pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH,\n",
" pro_mode_reference_docs_storage_container_sas_url=reference_doc_sas_url,\n",
" pro_mode_reference_docs_storage_container_path_prefix=reference_doc_path,\n",
")\n",
"result = client.poll_result(response)\n",
"if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
Expand Down Expand Up @@ -332,8 +346,7 @@
"reference_docs_2 = \"../data/field_extraction_pro_mode/insurance_claims_review/reference_docs\"\n",
"\n",
"# Load reference storage configuration from environment\n",
"REFERENCE_DOC_SAS_URL_2 = os.getenv(\"REFERENCE_DOC_SAS_URL\") # Reuse the same blob container\n",
"REFERENCE_DOC_PATH_2 = os.getenv(\"REFERENCE_DOC_PATH\").rstrip(\"/\") + \"_2/\" # NOTE: Use a different path for the second sample\n",
"reference_doc_path_2 = os.getenv(\"REFERENCE_DOC_PATH\").rstrip(\"/\") + \"_2/\" # NOTE: Use a different path for the second sample\n",
"CUSTOM_ANALYZER_ID_2 = \"pro-mode-sample-\" + str(uuid.uuid4())"
]
},
Expand All @@ -352,7 +365,8 @@
"outputs": [],
"source": [
"logging.info(\"Start generating knowledge base for the second sample...\")\n",
"await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)"
"# Reuse the same blob container\n",
"await client.generate_knowledge_base_on_blob(reference_docs_2, reference_doc_sas_url, reference_doc_path_2, skip_analyze=True)"
]
},
{
Expand All @@ -372,8 +386,8 @@
"response = client.begin_create_analyzer(\n",
" CUSTOM_ANALYZER_ID_2,\n",
" analyzer_template_path=analyzer_template_2,\n",
" pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL_2,\n",
" pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH_2,\n",
" pro_mode_reference_docs_storage_container_sas_url=reference_doc_sas_url,\n",
" pro_mode_reference_docs_storage_container_path_prefix=reference_doc_path_2,\n",
")\n",
"result = client.poll_result(response)\n",
"if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
Expand Down
Loading