Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vyokky/dev #105

Merged
merged 21 commits into from
Jul 4, 2024
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
- <b>AppAgent 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
- <b>Control Interaction 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** or **Win32** API.

Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939).
Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939) and [Documentation](https://microsoft.github.io/UFO/).
<h1 align="center">
<img src="./assets/framework_v2.png"/>
</h1>
Expand Down Expand Up @@ -137,9 +137,17 @@ Optionally, you can set a backup language model (LLM) engine in the `BACKUP_AGEN
UFO also supports other LLMs and advanced configurations, such as customize your own model, please check the [documents](https://microsoft.github.io/UFO/supported_models/overview/) for more details. Because of the limitations of model input, a lite version of the prompt is provided to allow users to experience it, which is configured in `config_dev.yaml`.

### 📔 Step 3: Additional Setting for RAG (optional).
If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.
If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.

#### RAG from Offline Help Document
We provide the following options for RAG to enhance UFO's capabilities:
- **[Offline Help Document](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_help_document/)**: Enable UFO to retrieve information from offline help documents.
- **[Online Bing Search Engine](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_bing_search/)**: Enhance UFO's capabilities by utilizing the most up-to-date online search results.
- **[Self-Experience](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/experience_learning/)**: Save task completion trajectories into UFO's memory for future reference.
- **[User-Demonstration](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_demonstration/)**: Boost UFO's capabilities through user demonstration.

Consult their respective documentation for more information on how to configure these settings.

<!-- #### RAG from Offline Help Document
Before enabling this function, you need to create an offline indexer for your help document. Please refer to the [README](./learner/README.md) to learn how to create an offline vectored database for retrieval. You can enable this function by setting the following configuration:
```bash
## RAG Configuration for the offline docs
Expand Down Expand Up @@ -184,7 +192,7 @@ You can enable this function by setting the following configuration:
## RAG Configuration for demonstration
RAG_DEMONSTRATION: True # Whether to use the RAG from its user demonstration.
RAG_DEMONSTRATION_RETRIEVED_TOPK: 5 # The topk for the demonstration examples.
```
``` -->


### 🎉 Step 4: Start UFO
Expand Down
16 changes: 16 additions & 0 deletions documents/docs/advanced_usage/control_filtering/icon_filtering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Icon Filter

The icon control filter is a method to filter the controls based on the similarity between the control icon image and the agent's plan using the image/text embeddings.

## Configuration

To activate the icon control filtering, you need to add `ICON` to the `CONTROL_FILTER` list in the `config_dev.yaml` file. Below is the detailed icon control filter configuration in the `config_dev.yaml` file:

- `CONTROL_FILTER`: A list of filtering methods that you want to apply to the controls. To activate the icon control filtering, add `ICON` to the list.
- `CONTROL_FILTER_TOP_K_ICON`: The number of controls to keep after filtering.
- `CONTROL_FILTER_MODEL_ICON_NAME`: The control filter model name for icon similarity. By default, it is set to "clip-ViT-B-32".


# Reference

:::automator.ui_control.control_filter.IconControlFilter
22 changes: 22 additions & 0 deletions documents/docs/advanced_usage/control_filtering/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Control Filtering

There may be many controls items in the application, which may not be relevant to the task. UFO can filter out the irrelevant controls and only focus on the relevant ones. This filtering process can reduce the complexity of the task.

Execept for configuring the control types for selection on `CONTROL_LIST` in `config_dev.yaml`, UFO also supports filtering the controls based on semantic similarity or keyword matching between the agent's plan and the control's information. We currerntly support the following filtering methods:

| Filtering Method | Description |
|------------------|-------------|
| [`Text`](./text_filtering.md) | Filter the controls based on the control text. |
| [`Semantic`](./semantic_filtering.md) | Filter the controls based on the semantic similarity. |
| [`Icon`](./icon_filtering.md) | Filter the controls based on the control icon image. |


## Configuration
You can activate the control filtering by setting the `CONTROL_FILTER` in the `config_dev.yaml` file. The `CONTROL_FILTER` is a list of filtering methods that you want to apply to the controls, which can be `TEXT`, `SEMANTIC`, or `ICON`.

You can configure multiple filtering methods in the `CONTROL_FILTER` list.

# Reference
The implementation of the control filtering is base on the `BasicControlFilter` class located in the `ufo/automator/ui_control/control_filter.py` file. Concrete filtering class inherit from the `BasicControlFilter` class and implement the `control_filter` method to filter the controls based on the specific filtering method.

:::automator.ui_control.control_filter.BasicControlFilter
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Sematic Control Filter

The semantic control filter is a method to filter the controls based on the semantic similarity between the agent's plan and the control's text using their embeddings.

## Configuration

To activate the semantic control filtering, you need to add `SEMANTIC` to the `CONTROL_FILTER` list in the `config_dev.yaml` file. Below is the detailed sematic control filter configuration in the `config_dev.yaml` file:

- `CONTROL_FILTER`: A list of filtering methods that you want to apply to the controls. To activate the semantic control filtering, add `SEMANTIC` to the list.
- `CONTROL_FILTER_TOP_K_SEMANTIC`: The number of controls to keep after filtering.
- `CONTROL_FILTER_MODEL_SEMANTIC_NAME`: The control filter model name for semantic similarity. By default, it is set to "all-MiniLM-L6-v2".

# Reference

:::automator.ui_control.control_filter.SemanticControlFilter
16 changes: 16 additions & 0 deletions documents/docs/advanced_usage/control_filtering/text_filtering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Text Control Filter

The text control filter is a method to filter the controls based on the control text. The agent's plan on the current step usually contains some keywords or phrases. This method filters the controls based on the matching between the control text and the keywords or phrases in the agent's plan.

## Configuration

To activate the text control filtering, you need to add `TEXT` to the `CONTROL_FILTER` list in the `config_dev.yaml` file. Below is the detailed text control filter configuration in the `config_dev.yaml` file:

- `CONTROL_FILTER`: A list of filtering methods that you want to apply to the controls. To activate the text control filtering, add `TEXT` to the list.
- `CONTROL_FILTER_TOP_K_PLAN`: The number of agent's plan keywords or phrases to use for filtering the controls.



# Reference

:::automator.ui_control.control_filter.TextControlFilter
24 changes: 24 additions & 0 deletions documents/docs/advanced_usage/customization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Customization

Sometimes, UFO may need additional context or information to complete a task. These information are important and customized for each user. UFO can ask the user for additional information and save it in the local memory for future reference. This customization feature allows UFO to provide a more personalized experience to the user.

## Scenario

Let's consider a scenario where UFO needs additional information to complete a task. UFO is tasked with booking a cab for the user. To book a cab, UFO needs to know the exact address of the user. UFO will ask the user for the address and save it in the local memory for future reference. Next time, when UFO is asked to complete a task that requires the user's address, UFO will use the saved address to complete the task, without asking the user again.


## Implementation
We currently implement the customization feature in the `HostAgent` class. When the `HostAgent` needs additional information, it will transit to the `PENDING` state and ask the user for the information. The user will provide the information, and the `HostAgent` will save it in the local memory base for future reference. The saved information is stored in the `blackboard` and can be accessed by all agents in the session.

!!! note
The customization memory base is only saved in a **local file**. These information will **not** upload to the cloud or any other storage to protect the user's privacy.

## Configuration

You can configure the customization feature by setting the following field in the `config_dev.yaml` file.

| Configuration Option | Description | Type | Default Value |
|------------------------|----------------------------------------------|---------|---------------------------------------|
| `USE_CUSTOMIZATION` | Whether to enable the customization. | Boolean | True |
| `QA_PAIR_FILE` | The path for the historical QA pairs. | String | "customization/historical_qa.txt" |
| `QA_PAIR_NUM` | The number of QA pairs for the customization.| Integer | 20 |
83 changes: 83 additions & 0 deletions documents/docs/advanced_usage/follower_mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Follower Mode

The Follower mode is a feature of UFO that the agent follows a list of pre-defined steps in natural language to take actions on applications. Different from the normal mode, this mode creates an `FollowerAgent` that follows the plan list provided by the user to interact with the application, instead of generating the plan itself. This mode is useful for debugging and software testing or verification.

## Quick Start

### Step 1: Create a Plan file

Before starting the Follower mode, you need to create a plan file that contains the list of steps for the agent to follow. The plan file is a JSON file that contains the following fields:

| Field | Description | Type |
| --- | --- | --- |
| task | The task description. | String |
| steps | The list of steps for the agent to follow. | List of Strings |
| object | The application or file to interact with. | String |

Below is an example of a plan file:

```json
{
"task": "Type in a text of 'Test For Fun' with heading 1 level",
"steps":
[
"1.type in 'Test For Fun'",
"2.Select the 'Test For Fun' text",
"3.Click 'Home' tab to show the 'Styles' ribbon tab",
"4.Click 'Styles' ribbon tab to show the style 'Heading 1'",
"5.Click 'Heading 1' style to apply the style to the selected text"
],
"object": "draft.docx"
}
```

!!! note
The `object` field is the application or file that the agent will interact with. The object **must be active** (can be minimized) when starting the Follower mode.


### Step 2: Start the Follower Mode
To start the Follower mode, run the following command:

```bash
# assume you are in the cloned UFO folder
python ufo.py --task_name {task_name} --mode follower --plan {plan_file}
```

!!! tip
Replace `{task_name}` with the name of the task and `{plan_file}` with the path to the plan file.


### Step 3: Run in Batch (Optional)

You can also run the Follower mode in batch mode by providing a folder containing multiple plan files. The agent will follow the plans in the folder one by one. To run in batch mode, run the following command:

```bash
# assume you are in the cloned UFO folder
python ufo.py --task_name {task_name} --mode follower --plan {plan_folder}
```

UFO will automatically detect the plan files in the folder and run them one by one.

!!! tip
Replace `{task_name}` with the name of the task and `{plan_folder}` with the path to the folder containing plan files.


## Evaluation
You may want to evaluate the `task` is completed successfully or not by following the plan. UFO will call the `EvaluationAgent` to evaluate the task if `EVA_SESSION` is set to `True` in the `config_dev.yaml` file.

You can check the evaluation log in the `logs/{task_name}/evaluation.log` file.

# References
The follower mode employs a `PlanReader` to parse the plan file and create a `FollowerSession` to follow the plan.

## PlanReader
The `PlanReader` is located in the `ufo/module/sessions/plan_reader.py` file.

:::module.sessions.plan_reader.PlanReader

<br>
## FollowerSession

The `FollowerSession` is also located in the `ufo/module/sessions/session.py` file.

:::module.sessions.session.FollowerSession
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Learning from Self-Experience

When UFO successfully completes a task, user can choose to save the successful experience to reinforce the AppAgent. The AppAgent can learn from its own successful experiences to improve its performance in the future.

## Mechanism

### Step 1: Complete a Session
- **Event**: UFO completes a session

### Step 2: Ask User to Save Experience
- **Action**: The agent prompts the user with a choice to save the successful experience

<h1 align="center">
<img src="../../../img/save_ask.png" alt="Save Experience" width="100%">
</h1>

### Step 3: User Chooses to Save
- **Action**: If the user chooses to save the experience

### Step 4: Summarize and Save the Experience
- **Tool**: `ExperienceSummarizer`
- **Process**:
1. Summarize the experience into a demonstration example
2. Save the demonstration example in the `EXPERIENCE_SAVED_PATH` as specified in the `config_dev.yaml` file
3. The demonstration example includes similar [fields](../../prompts/examples_prompts.md) as those used in the AppAgent's prompt

### Step 5: Retrieve and Utilize Saved Experience
- **When**: The AppAgent encounters a similar task in the future
- **Action**: Retrieve the saved experience from the experience database
- **Outcome**: Use the retrieved experience to generate a plan

### Workflow Diagram
```mermaid
graph TD;
A[Complete Session] --> B[Ask User to Save Experience]
B --> C[User Chooses to Save]
C --> D[Summarize with ExperienceSummarizer]
D --> E[Save in EXPERIENCE_SAVED_PATH]
F[AppAgent Encounters Similar Task] --> G[Retrieve Saved Experience]
G --> H[Generate Plan]
```

## Activate the Learning from Self-Experience

### Step 1: Configure the AppAgent
Configure the following parameters to allow UFO to use the RAG from its self-experience:

| Configuration Option | Description | Type | Default Value |
|----------------------|-------------|------|---------------|
| `RAG_EXPERIENCE` | Whether to use the RAG from its self-experience | Boolean | False |
| `RAG_EXPERIENCE_RETRIEVED_TOPK` | The topk for the offline retrieved documents | Integer | 5 |

# Reference

## Experience Summarizer
The `ExperienceSummarizer` class is located in the `ufo/experience/experience_summarizer.py` file. The `ExperienceSummarizer` class provides the following methods to summarize the experience:

:::experience.summarizer.ExperienceSummarizer

<br>

## Experience Retriever
The `ExperienceRetriever` class is located in the `ufo/rag/retriever.py` file. The `ExperienceRetriever` class provides the following methods to retrieve the experience:

:::rag.retriever.ExperienceRetriever
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Learning from Bing Search

UFO provides the capability to reinforce the AppAgent by searching for information on Bing to obtain up-to-date knowledge for niche tasks or applications which beyond the `AppAgent`'s knowledge.

## Mechanism
Upon receiving a request, the `AppAgent` constructs a Bing search query based on the request and retrieves the search results from Bing. The `AppAgent` then extracts the relevant information from the top-k search results from Bing and generates a plan based on the retrieved information.


## Activate the Learning from Bing Search


### Step 1: Obtain Bing API Key
To use the Bing search, you need to obtain a Bing API key. You can follow the instructions on the [Microsoft Azure Bing Search API](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api) to get the API key.


### Step 2: Configure the AppAgent

Configure the following parameters to allow UFO to use online Bing search for the decision-making process:

| Configuration Option | Description | Type | Default Value |
|----------------------|-------------|------|---------------|
| `RAG_ONLINE_SEARCH` | Whether to use the Bing search | Boolean | False |
| `BING_API_KEY` | The Bing search API key | String | "" |
| `RAG_ONLINE_SEARCH_TOPK` | The topk for the online search | Integer | 5 |
| `RAG_ONLINE_RETRIEVED_TOPK` | The topk for the online retrieved searched results | Integer | 1 |

# Reference

:::rag.retriever.OnlineDocRetriever
Loading
Loading