Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hold] Platform UI: Embed and chunk recommender #408

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -508,6 +508,7 @@
},
"platform/workflows",
"platform/jobs",
"platform/recommend",
{
"group": "API",
"pages": [
Expand Down
6 changes: 6 additions & 0 deletions platform/chunking.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
title: Chunking
---

<Tip>
To get help choosing a chunking strategy and settings for your
[Custom](/platform/workflows#create-a-custom-workflow) workflows,
[request a chunk recommendation from Unstructured](/pltaform/recommend).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[request a chunk recommendation from Unstructured](/pltaform/recommend).
[request a chunking configuration recommendation from Unstructured](/pltaform/recommend).
```?

</Tip>

After partitioning, _chunking_ rearranges the resulting document elements into manageable "chunks" to stay within
the limits of an embedding model and to improve retrieval precision. The goal is to retrieve only parts of documents
that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks
Expand Down
6 changes: 6 additions & 0 deletions platform/embedding.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
title: Embedding
---

<Tip>
To get help choosing and setting an embedding provider and model for your
[Custom](/platform/workflows#create-a-custom-workflow) workflows,
[request an embed recommendation from Unstructured](/platform/recommend).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[request an embed recommendation from Unstructured](/platform/recommend).
[request an embedding model recommendation from Unstructured](/platform/recommend).
```?

</Tip>

After partitioning, chunking, and summarizing, the _embedding_ step creates arrays of numbers
known as _vectors_, representing the text that is extracted by Unstructured.
These vectors are stored or _embedded_ next to the text itself. These vector embeddings are generated by an
Expand Down
74 changes: 74 additions & 0 deletions platform/recommend.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Recommend
---

The Unstructured Platform can offer you recommendations for an ideal [embedding provider and model](/platform/embedding)
and [chunking strategy and settings](/platform/chunking) for your source files.
These recommendations are optimized to work well across a variety of
vector stores, RAG applications, and model fine-tuning scenarios.

Unstructured's embedding and chunking recommendations are especially useful if you are not familiar with how the
various embedding and chunking strategies and settings can be applied for optimal results. However, if you are already comfortable with embedding and chunking,
these recommendations can still be useful in helping inform your current strategies.

Unstructured's recommendations can be implemented only in **Build it with me > Custom** and **Build it myself** [workflows](/platform/workflows).
You cannot implement these recommendations in **Build it with me** > **Basic**, **Advanced**, and **Platinum** workflows, as those workflow types already have
preset embedding and chunking settings that cannot be changed.

Unstructured makes its recommendations by using the specified [source connector](/platform/sources/overview) to access, process, and analyze a
sampling of files from the source location. Unstructured then recommends
an embedding provider and model and a chunking strategy and settings based on this analysis.

Unstructured's embedding and chunking recommendations can be requested for the following file-based source connector types:

- [Azure](/platform/sources/azure-blob-storage)
- [Dropbox](/platform/sources/dropbox)
- [Google Cloud Storage](/platform/sources/google-cloud)
- [S3](/platform/sources/s3)

import SharedPagesBilling from '/snippets/general-shared-text/pages-billing.mdx';

<Note>
Performing a recommendation will result in billing to your Unstructured account. To make its recommendation, Unstructured
must process and analyze a sampling of up to 50 files from the source location.
Your Unstructured account is billed for the equivalent number of pages.

<SharedPagesBilling />
</Note>

## Request a recommendation

1. In the Unstructured Platform, on the sidebar, click **Connectors**.
2. Click **Sources**.
3. Click the name of the source connector that you want to use. If you do not have a source connector,
[create one](/platform/sources/overview).
4. If you're requesting a recomendation for the first time for this connector, click the **Run Recommender** button.

If you have previously requested a recommendation for this connector, you can make another request by clicking the **Run Again** button.
This is useful if you significantly changed the files in the source location since you previously
requested a recommendation.

If the **Run Recommender** or **Run Again** button is not visible, or if they are visible but not enabled, check for the following:

- The selected connector must be a file-based source connector. See the preceding list for supported file-based source connector types.
- The selected connector must have successfully passed a connectivity test. If the connector's details pane does not show a
**Successful** icon, then click the pencil icon, make any necessary changes to the connector's previous settings,
and then click **Save and Test**.

5. Two **Scheduled** statuses appear, one for **Embed** and another for **Chunk**.
6. After several minutes, the **Scheduled** statuses are replaced by **Running**.
7. After several more minutes, the **Running** statuses are replaced by **Finished**.
8. After **Finished** appears, to view the recommendation, click **View**.

The **Auto Recommender Results** pane shows Unstructured's recommended embedding provider and model and chunking strategy and
settings for the source files that it analyzed.

## Implement an embed recommendation

1. In the **Auto Recommender Results** pane, in the **Embed Recommendation** area, note the recommended embedding provider and model.
2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions for your target workflow.

## Implement a chunking recommendation

1. In the **Auto Recommender Results** pane, in the **Chunk Recommendation** area, note the recommended chunking strategy and settings.
2. To implement the recommendation, expand the **Next Steps** section and follow the on-screen instructions for your target workflow.