From e1b52687b934a71a1682134ac12fd3775da385f0 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Thu, 19 Dec 2024 17:39:54 -0800 Subject: [PATCH 1/3] Platform UI: Embed and chunk recommender --- mint.json | 1 + platform/chunking.mdx | 6 ++++ platform/embedding.mdx | 6 ++++ platform/recommend.mdx | 72 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 85 insertions(+) create mode 100644 platform/recommend.mdx diff --git a/mint.json b/mint.json index 0e633107..6e496c8d 100644 --- a/mint.json +++ b/mint.json @@ -508,6 +508,7 @@ }, "platform/workflows", "platform/jobs", + "platform/recommend", { "group": "API", "pages": [ diff --git a/platform/chunking.mdx b/platform/chunking.mdx index 868e29d7..08c3a7f6 100644 --- a/platform/chunking.mdx +++ b/platform/chunking.mdx @@ -2,6 +2,12 @@ title: Chunking --- + + To get help choosing a chunking strategy and settings for your + [Custom](/platform/workflows#create-a-custom-workflow) workflows, + [request a chunk recommendation from Unstructured](/pltaform/recommend). + + After partitioning, _chunking_ rearranges the resulting document elements into manageable "chunks" to stay within the limits of an embedding model and to improve retrieval precision. The goal is to retrieve only parts of documents that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks diff --git a/platform/embedding.mdx b/platform/embedding.mdx index 1792f8b6..92b64158 100644 --- a/platform/embedding.mdx +++ b/platform/embedding.mdx @@ -2,6 +2,12 @@ title: Embedding --- + + To get help choosing and setting an embedding provider and model for your + [Custom](/platform/workflows#create-a-custom-workflow) workflows, + [request an embed recommendation from Unstructured](/platform/recommend). + + After partitioning, chunking, and summarizing, the _embedding_ step creates arrays of numbers known as _vectors_, representing the text that is extracted by Unstructured. These vectors are stored or _embedded_ next to the text itself. These vector embeddings are generated by an diff --git a/platform/recommend.mdx b/platform/recommend.mdx new file mode 100644 index 00000000..8777b3b0 --- /dev/null +++ b/platform/recommend.mdx @@ -0,0 +1,72 @@ +--- +title: Recommend +--- + + + This feature works only with **Build it with me > Custom** and **Build it myself** [workflows](/platform/workflows). + + This feature is _not_ supported for **Basic**, **Advanced**, and **Platinum** [workflows](/platform/workflows). + + +The Unstructured Platform can offer you a recommendation for an ideal [embedding provider and model](/platform/embedding) +and [chunking strategy and settings](/platform/chunking) for your source files and data documents and records. +These recommendations are optimized to work well across a variety of +vector stores, RAG applications, and model fine-tuning scenarios. + +Unstructured's embedding provider and model and chunking strategy and settings recommendations are especially useful if you are not familiar with how the +various embedding providers and models and chunking strategies and settings can be applied for optimal results. If you are comfortable with embedding and chunking, +you can use these recommendations to help inform your current strategies. + +Unstructured makes its recommendations by using the related [source connector](/platform/sources/overview) to access, process, and analyze a +sampling of files or data documents or records from the source location, depending on the connector's type. Unstructured then recommends +an embedding provider and model and chunking strategy and settings based on this analysis. + +import SharedPagesBilling from '/snippets/general-shared-text/pages-billing.mdx'; + + + Performing a recommendation will result in billing to your Unstructured account. To make its recommendation, Unstructured + must process and analyze a sampling of up to 50 files or 50 non-file data documents or records from the source location, depending on the source connector's type. + Your Unstructured account is billed for the equivalent number of pages. + + + + +## Request a recommendation + +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click the name of the source connector that you want to use. If you do not have a source connector, + [create one](/platform/sources/overview). +4. If you're requesting a recomendation for the first time for this connector, click the **Run Recommender** button. + + If you have previously requested a recommendation for this connector, you can make another request by clicking the **Run Again** button. + This can be useful if you significantly changed the files or non-file data documents or records in the source location since you previously + requested a recommendation. + + + If the **Run Recommender** or **Run Again** button is not visible, or if they are visible but not enabled, there is likely something wrong + with your source connector. + + To fix this, try clicking the edit (pencil) icon, + make any necessary changes to the connector's previous settings, and then click **Save and Test**. + Keep repeating this step as needed until a **Successful** icon appears. When this icon appears, the + **Run Recommender** or **Run Again** button should be visible and enabled. + + +5. Two **Scheduled** statuses appear, one for **Embed** and another for **Chunk**. +6. After several minutes, the **Scheduled** statuses are replaced by **Running**. +7. After several more minutes, the **Running** statuses are replaced by **Finished**. +8. To view the recommendation, click **View**. + +The **Auto Recommender Results** pane shows Unstructured's recommended embedding provider and model and chunking strategy and settings for your source +files or non-file data documents or records, depending on the source connector's type. + +## Implement an embed recommendation + +1. In the **Auto Recommender Results** pane, in the **Embed Recommendation** area, note the recommended embedding provider and model. +2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions. + +## Implement a chunking recommendation + +1. In the **Auto Recommender Results** pane, in the **Chunk Recommendation** area, note the recommended chunking strategy and settings. +2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions. \ No newline at end of file From a8eede1acce8776c3f1cece168d03d70d8ab112b Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Thu, 19 Dec 2024 17:45:01 -0800 Subject: [PATCH 2/3] Minor edit: non-file data --- platform/recommend.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/platform/recommend.mdx b/platform/recommend.mdx index 8777b3b0..1071ca0e 100644 --- a/platform/recommend.mdx +++ b/platform/recommend.mdx @@ -9,7 +9,7 @@ title: Recommend The Unstructured Platform can offer you a recommendation for an ideal [embedding provider and model](/platform/embedding) -and [chunking strategy and settings](/platform/chunking) for your source files and data documents and records. +and [chunking strategy and settings](/platform/chunking) for your source files and non-file data documents and records. These recommendations are optimized to work well across a variety of vector stores, RAG applications, and model fine-tuning scenarios. @@ -18,7 +18,7 @@ various embedding providers and models and chunking strategies and settings can you can use these recommendations to help inform your current strategies. Unstructured makes its recommendations by using the related [source connector](/platform/sources/overview) to access, process, and analyze a -sampling of files or data documents or records from the source location, depending on the connector's type. Unstructured then recommends +sampling of files or non-file data documents or records from the source location, depending on the connector's type. Unstructured then recommends an embedding provider and model and chunking strategy and settings based on this analysis. import SharedPagesBilling from '/snippets/general-shared-text/pages-billing.mdx'; From 8c378703b3a896069b742c9cc58e6e89bc70086f Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Fri, 20 Dec 2024 08:39:53 -0800 Subject: [PATCH 3/3] Auto recommender works with fsspec source connector types only --- platform/recommend.mdx | 62 ++++++++++++++++++++++-------------------- 1 file changed, 32 insertions(+), 30 deletions(-) diff --git a/platform/recommend.mdx b/platform/recommend.mdx index 1071ca0e..6ebb950d 100644 --- a/platform/recommend.mdx +++ b/platform/recommend.mdx @@ -2,30 +2,35 @@ title: Recommend --- - - This feature works only with **Build it with me > Custom** and **Build it myself** [workflows](/platform/workflows). - - This feature is _not_ supported for **Basic**, **Advanced**, and **Platinum** [workflows](/platform/workflows). - - -The Unstructured Platform can offer you a recommendation for an ideal [embedding provider and model](/platform/embedding) -and [chunking strategy and settings](/platform/chunking) for your source files and non-file data documents and records. +The Unstructured Platform can offer you recommendations for an ideal [embedding provider and model](/platform/embedding) +and [chunking strategy and settings](/platform/chunking) for your source files. These recommendations are optimized to work well across a variety of vector stores, RAG applications, and model fine-tuning scenarios. -Unstructured's embedding provider and model and chunking strategy and settings recommendations are especially useful if you are not familiar with how the -various embedding providers and models and chunking strategies and settings can be applied for optimal results. If you are comfortable with embedding and chunking, -you can use these recommendations to help inform your current strategies. +Unstructured's embedding and chunking recommendations are especially useful if you are not familiar with how the +various embedding and chunking strategies and settings can be applied for optimal results. However, if you are already comfortable with embedding and chunking, +these recommendations can still be useful in helping inform your current strategies. + +Unstructured's recommendations can be implemented only in **Build it with me > Custom** and **Build it myself** [workflows](/platform/workflows). +You cannot implement these recommendations in **Build it with me** > **Basic**, **Advanced**, and **Platinum** workflows, as those workflow types already have +preset embedding and chunking settings that cannot be changed. + +Unstructured makes its recommendations by using the specified [source connector](/platform/sources/overview) to access, process, and analyze a +sampling of files from the source location. Unstructured then recommends +an embedding provider and model and a chunking strategy and settings based on this analysis. + +Unstructured's embedding and chunking recommendations can be requested for the following file-based source connector types: -Unstructured makes its recommendations by using the related [source connector](/platform/sources/overview) to access, process, and analyze a -sampling of files or non-file data documents or records from the source location, depending on the connector's type. Unstructured then recommends -an embedding provider and model and chunking strategy and settings based on this analysis. +- [Azure](/platform/sources/azure-blob-storage) +- [Dropbox](/platform/sources/dropbox) +- [Google Cloud Storage](/platform/sources/google-cloud) +- [S3](/platform/sources/s3) import SharedPagesBilling from '/snippets/general-shared-text/pages-billing.mdx'; Performing a recommendation will result in billing to your Unstructured account. To make its recommendation, Unstructured - must process and analyze a sampling of up to 50 files or 50 non-file data documents or records from the source location, depending on the source connector's type. + must process and analyze a sampling of up to 50 files from the source location. Your Unstructured account is billed for the equivalent number of pages. @@ -33,40 +38,37 @@ import SharedPagesBilling from '/snippets/general-shared-text/pages-billing.mdx' ## Request a recommendation -1. On the sidebar, click **Connectors**. +1. In the Unstructured Platform, on the sidebar, click **Connectors**. 2. Click **Sources**. 3. Click the name of the source connector that you want to use. If you do not have a source connector, [create one](/platform/sources/overview). 4. If you're requesting a recomendation for the first time for this connector, click the **Run Recommender** button. If you have previously requested a recommendation for this connector, you can make another request by clicking the **Run Again** button. - This can be useful if you significantly changed the files or non-file data documents or records in the source location since you previously + This is useful if you significantly changed the files in the source location since you previously requested a recommendation. - - If the **Run Recommender** or **Run Again** button is not visible, or if they are visible but not enabled, there is likely something wrong - with your source connector. + If the **Run Recommender** or **Run Again** button is not visible, or if they are visible but not enabled, check for the following: - To fix this, try clicking the edit (pencil) icon, - make any necessary changes to the connector's previous settings, and then click **Save and Test**. - Keep repeating this step as needed until a **Successful** icon appears. When this icon appears, the - **Run Recommender** or **Run Again** button should be visible and enabled. - + - The selected connector must be a file-based source connector. See the preceding list for supported file-based source connector types. + - The selected connector must have successfully passed a connectivity test. If the connector's details pane does not show a + **Successful** icon, then click the pencil icon, make any necessary changes to the connector's previous settings, + and then click **Save and Test**. 5. Two **Scheduled** statuses appear, one for **Embed** and another for **Chunk**. 6. After several minutes, the **Scheduled** statuses are replaced by **Running**. 7. After several more minutes, the **Running** statuses are replaced by **Finished**. -8. To view the recommendation, click **View**. +8. After **Finished** appears, to view the recommendation, click **View**. -The **Auto Recommender Results** pane shows Unstructured's recommended embedding provider and model and chunking strategy and settings for your source -files or non-file data documents or records, depending on the source connector's type. +The **Auto Recommender Results** pane shows Unstructured's recommended embedding provider and model and chunking strategy and +settings for the source files that it analyzed. ## Implement an embed recommendation 1. In the **Auto Recommender Results** pane, in the **Embed Recommendation** area, note the recommended embedding provider and model. -2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions. +2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions for your target workflow. ## Implement a chunking recommendation 1. In the **Auto Recommender Results** pane, in the **Chunk Recommendation** area, note the recommended chunking strategy and settings. -2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions. \ No newline at end of file +2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions for your target workflow. \ No newline at end of file