diff --git a/README.md b/README.md index ac7c648a..97926293 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ - AppAgent 👾, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application. - Control Interaction 🎮, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** or **Win32** API. -Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939). +Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939) and [Documentation](https://microsoft.github.io/UFO/).

@@ -137,9 +137,17 @@ Optionally, you can set a backup language model (LLM) engine in the `BACKUP_AGEN UFO also supports other LLMs and advanced configurations, such as customize your own model, please check the [documents](https://microsoft.github.io/UFO/supported_models/overview/) for more details. Because of the limitations of model input, a lite version of the prompt is provided to allow users to experience it, which is configured in `config_dev.yaml`. ### 📔 Step 3: Additional Setting for RAG (optional). -If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file. +If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file. -#### RAG from Offline Help Document +We provide the following options for RAG to enhance UFO's capabilities: +- **[Offline Help Document](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_help_document/)**: Enable UFO to retrieve information from offline help documents. +- **[Online Bing Search Engine](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_bing_search/)**: Enhance UFO's capabilities by utilizing the most up-to-date online search results. +- **[Self-Experience](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/experience_learning/)**: Save task completion trajectories into UFO's memory for future reference. +- **[User-Demonstration](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_demonstration/)**: Boost UFO's capabilities through user demonstration. + +Consult their respective documentation for more information on how to configure these settings. + + ### 🎉 Step 4: Start UFO diff --git a/documents/docs/advanced_usage/control_filtering/icon_filtering.md b/documents/docs/advanced_usage/control_filtering/icon_filtering.md new file mode 100644 index 00000000..422fecbf --- /dev/null +++ b/documents/docs/advanced_usage/control_filtering/icon_filtering.md @@ -0,0 +1,16 @@ +# Icon Filter + +The icon control filter is a method to filter the controls based on the similarity between the control icon image and the agent's plan using the image/text embeddings. + +## Configuration + +To activate the icon control filtering, you need to add `ICON` to the `CONTROL_FILTER` list in the `config_dev.yaml` file. Below is the detailed icon control filter configuration in the `config_dev.yaml` file: + +- `CONTROL_FILTER`: A list of filtering methods that you want to apply to the controls. To activate the icon control filtering, add `ICON` to the list. +- `CONTROL_FILTER_TOP_K_ICON`: The number of controls to keep after filtering. +- `CONTROL_FILTER_MODEL_ICON_NAME`: The control filter model name for icon similarity. By default, it is set to "clip-ViT-B-32". + + +# Reference + +:::automator.ui_control.control_filter.IconControlFilter \ No newline at end of file diff --git a/documents/docs/advanced_usage/control_filtering/overview.md b/documents/docs/advanced_usage/control_filtering/overview.md new file mode 100644 index 00000000..98daa4ad --- /dev/null +++ b/documents/docs/advanced_usage/control_filtering/overview.md @@ -0,0 +1,22 @@ +# Control Filtering + +There may be many controls items in the application, which may not be relevant to the task. UFO can filter out the irrelevant controls and only focus on the relevant ones. This filtering process can reduce the complexity of the task. + +Execept for configuring the control types for selection on `CONTROL_LIST` in `config_dev.yaml`, UFO also supports filtering the controls based on semantic similarity or keyword matching between the agent's plan and the control's information. We currerntly support the following filtering methods: + +| Filtering Method | Description | +|------------------|-------------| +| [`Text`](./text_filtering.md) | Filter the controls based on the control text. | +| [`Semantic`](./semantic_filtering.md) | Filter the controls based on the semantic similarity. | +| [`Icon`](./icon_filtering.md) | Filter the controls based on the control icon image. | + + +## Configuration +You can activate the control filtering by setting the `CONTROL_FILTER` in the `config_dev.yaml` file. The `CONTROL_FILTER` is a list of filtering methods that you want to apply to the controls, which can be `TEXT`, `SEMANTIC`, or `ICON`. + +You can configure multiple filtering methods in the `CONTROL_FILTER` list. + +# Reference +The implementation of the control filtering is base on the `BasicControlFilter` class located in the `ufo/automator/ui_control/control_filter.py` file. Concrete filtering class inherit from the `BasicControlFilter` class and implement the `control_filter` method to filter the controls based on the specific filtering method. + +:::automator.ui_control.control_filter.BasicControlFilter diff --git a/documents/docs/advanced_usage/control_filtering/semantic_filtering.md b/documents/docs/advanced_usage/control_filtering/semantic_filtering.md new file mode 100644 index 00000000..078bfa03 --- /dev/null +++ b/documents/docs/advanced_usage/control_filtering/semantic_filtering.md @@ -0,0 +1,15 @@ +# Sematic Control Filter + +The semantic control filter is a method to filter the controls based on the semantic similarity between the agent's plan and the control's text using their embeddings. + +## Configuration + +To activate the semantic control filtering, you need to add `SEMANTIC` to the `CONTROL_FILTER` list in the `config_dev.yaml` file. Below is the detailed sematic control filter configuration in the `config_dev.yaml` file: + +- `CONTROL_FILTER`: A list of filtering methods that you want to apply to the controls. To activate the semantic control filtering, add `SEMANTIC` to the list. +- `CONTROL_FILTER_TOP_K_SEMANTIC`: The number of controls to keep after filtering. +- `CONTROL_FILTER_MODEL_SEMANTIC_NAME`: The control filter model name for semantic similarity. By default, it is set to "all-MiniLM-L6-v2". + +# Reference + +:::automator.ui_control.control_filter.SemanticControlFilter \ No newline at end of file diff --git a/documents/docs/advanced_usage/control_filtering/text_filtering.md b/documents/docs/advanced_usage/control_filtering/text_filtering.md new file mode 100644 index 00000000..53f84500 --- /dev/null +++ b/documents/docs/advanced_usage/control_filtering/text_filtering.md @@ -0,0 +1,16 @@ +# Text Control Filter + +The text control filter is a method to filter the controls based on the control text. The agent's plan on the current step usually contains some keywords or phrases. This method filters the controls based on the matching between the control text and the keywords or phrases in the agent's plan. + +## Configuration + +To activate the text control filtering, you need to add `TEXT` to the `CONTROL_FILTER` list in the `config_dev.yaml` file. Below is the detailed text control filter configuration in the `config_dev.yaml` file: + +- `CONTROL_FILTER`: A list of filtering methods that you want to apply to the controls. To activate the text control filtering, add `TEXT` to the list. +- `CONTROL_FILTER_TOP_K_PLAN`: The number of agent's plan keywords or phrases to use for filtering the controls. + + + +# Reference + +:::automator.ui_control.control_filter.TextControlFilter \ No newline at end of file diff --git a/documents/docs/advanced_usage/customization.md b/documents/docs/advanced_usage/customization.md new file mode 100644 index 00000000..8c38259f --- /dev/null +++ b/documents/docs/advanced_usage/customization.md @@ -0,0 +1,24 @@ +# Customization + +Sometimes, UFO may need additional context or information to complete a task. These information are important and customized for each user. UFO can ask the user for additional information and save it in the local memory for future reference. This customization feature allows UFO to provide a more personalized experience to the user. + +## Scenario + +Let's consider a scenario where UFO needs additional information to complete a task. UFO is tasked with booking a cab for the user. To book a cab, UFO needs to know the exact address of the user. UFO will ask the user for the address and save it in the local memory for future reference. Next time, when UFO is asked to complete a task that requires the user's address, UFO will use the saved address to complete the task, without asking the user again. + + +## Implementation +We currently implement the customization feature in the `HostAgent` class. When the `HostAgent` needs additional information, it will transit to the `PENDING` state and ask the user for the information. The user will provide the information, and the `HostAgent` will save it in the local memory base for future reference. The saved information is stored in the `blackboard` and can be accessed by all agents in the session. + +!!! note + The customization memory base is only saved in a **local file**. These information will **not** upload to the cloud or any other storage to protect the user's privacy. + +## Configuration + +You can configure the customization feature by setting the following field in the `config_dev.yaml` file. + +| Configuration Option | Description | Type | Default Value | +|------------------------|----------------------------------------------|---------|---------------------------------------| +| `USE_CUSTOMIZATION` | Whether to enable the customization. | Boolean | True | +| `QA_PAIR_FILE` | The path for the historical QA pairs. | String | "customization/historical_qa.txt" | +| `QA_PAIR_NUM` | The number of QA pairs for the customization.| Integer | 20 | diff --git a/documents/docs/advanced_usage/follower_mode.md b/documents/docs/advanced_usage/follower_mode.md new file mode 100644 index 00000000..44899572 --- /dev/null +++ b/documents/docs/advanced_usage/follower_mode.md @@ -0,0 +1,83 @@ +# Follower Mode + +The Follower mode is a feature of UFO that the agent follows a list of pre-defined steps in natural language to take actions on applications. Different from the normal mode, this mode creates an `FollowerAgent` that follows the plan list provided by the user to interact with the application, instead of generating the plan itself. This mode is useful for debugging and software testing or verification. + +## Quick Start + +### Step 1: Create a Plan file + +Before starting the Follower mode, you need to create a plan file that contains the list of steps for the agent to follow. The plan file is a JSON file that contains the following fields: + +| Field | Description | Type | +| --- | --- | --- | +| task | The task description. | String | +| steps | The list of steps for the agent to follow. | List of Strings | +| object | The application or file to interact with. | String | + +Below is an example of a plan file: + +```json +{ + "task": "Type in a text of 'Test For Fun' with heading 1 level", + "steps": + [ + "1.type in 'Test For Fun'", + "2.Select the 'Test For Fun' text", + "3.Click 'Home' tab to show the 'Styles' ribbon tab", + "4.Click 'Styles' ribbon tab to show the style 'Heading 1'", + "5.Click 'Heading 1' style to apply the style to the selected text" + ], + "object": "draft.docx" +} +``` + +!!! note + The `object` field is the application or file that the agent will interact with. The object **must be active** (can be minimized) when starting the Follower mode. + + +### Step 2: Start the Follower Mode +To start the Follower mode, run the following command: + +```bash +# assume you are in the cloned UFO folder +python ufo.py --task_name {task_name} --mode follower --plan {plan_file} +``` + +!!! tip + Replace `{task_name}` with the name of the task and `{plan_file}` with the path to the plan file. + + +### Step 3: Run in Batch (Optional) + +You can also run the Follower mode in batch mode by providing a folder containing multiple plan files. The agent will follow the plans in the folder one by one. To run in batch mode, run the following command: + +```bash +# assume you are in the cloned UFO folder +python ufo.py --task_name {task_name} --mode follower --plan {plan_folder} +``` + +UFO will automatically detect the plan files in the folder and run them one by one. + +!!! tip + Replace `{task_name}` with the name of the task and `{plan_folder}` with the path to the folder containing plan files. + + +## Evaluation +You may want to evaluate the `task` is completed successfully or not by following the plan. UFO will call the `EvaluationAgent` to evaluate the task if `EVA_SESSION` is set to `True` in the `config_dev.yaml` file. + +You can check the evaluation log in the `logs/{task_name}/evaluation.log` file. + +# References +The follower mode employs a `PlanReader` to parse the plan file and create a `FollowerSession` to follow the plan. + +## PlanReader +The `PlanReader` is located in the `ufo/module/sessions/plan_reader.py` file. + +:::module.sessions.plan_reader.PlanReader + +
+## FollowerSession + +The `FollowerSession` is also located in the `ufo/module/sessions/session.py` file. + +:::module.sessions.session.FollowerSession \ No newline at end of file diff --git a/documents/docs/advanced_usage/reinforce_appagent/experience_learning.md b/documents/docs/advanced_usage/reinforce_appagent/experience_learning.md new file mode 100644 index 00000000..7c5cc018 --- /dev/null +++ b/documents/docs/advanced_usage/reinforce_appagent/experience_learning.md @@ -0,0 +1,65 @@ +# Learning from Self-Experience + +When UFO successfully completes a task, user can choose to save the successful experience to reinforce the AppAgent. The AppAgent can learn from its own successful experiences to improve its performance in the future. + +## Mechanism + +### Step 1: Complete a Session +- **Event**: UFO completes a session + +### Step 2: Ask User to Save Experience +- **Action**: The agent prompts the user with a choice to save the successful experience + +

+ Save Experience +

+ +### Step 3: User Chooses to Save +- **Action**: If the user chooses to save the experience + +### Step 4: Summarize and Save the Experience +- **Tool**: `ExperienceSummarizer` +- **Process**: + 1. Summarize the experience into a demonstration example + 2. Save the demonstration example in the `EXPERIENCE_SAVED_PATH` as specified in the `config_dev.yaml` file + 3. The demonstration example includes similar [fields](../../prompts/examples_prompts.md) as those used in the AppAgent's prompt + +### Step 5: Retrieve and Utilize Saved Experience +- **When**: The AppAgent encounters a similar task in the future +- **Action**: Retrieve the saved experience from the experience database +- **Outcome**: Use the retrieved experience to generate a plan + +### Workflow Diagram +```mermaid +graph TD; + A[Complete Session] --> B[Ask User to Save Experience] + B --> C[User Chooses to Save] + C --> D[Summarize with ExperienceSummarizer] + D --> E[Save in EXPERIENCE_SAVED_PATH] + F[AppAgent Encounters Similar Task] --> G[Retrieve Saved Experience] + G --> H[Generate Plan] +``` + +## Activate the Learning from Self-Experience + +### Step 1: Configure the AppAgent +Configure the following parameters to allow UFO to use the RAG from its self-experience: + +| Configuration Option | Description | Type | Default Value | +|----------------------|-------------|------|---------------| +| `RAG_EXPERIENCE` | Whether to use the RAG from its self-experience | Boolean | False | +| `RAG_EXPERIENCE_RETRIEVED_TOPK` | The topk for the offline retrieved documents | Integer | 5 | + +# Reference + +## Experience Summarizer +The `ExperienceSummarizer` class is located in the `ufo/experience/experience_summarizer.py` file. The `ExperienceSummarizer` class provides the following methods to summarize the experience: + +:::experience.summarizer.ExperienceSummarizer + +
+ +## Experience Retriever +The `ExperienceRetriever` class is located in the `ufo/rag/retriever.py` file. The `ExperienceRetriever` class provides the following methods to retrieve the experience: + +:::rag.retriever.ExperienceRetriever diff --git a/documents/docs/advanced_usage/reinforce_appagent/learning_from_bing_search.md b/documents/docs/advanced_usage/reinforce_appagent/learning_from_bing_search.md new file mode 100644 index 00000000..6ee9a5bb --- /dev/null +++ b/documents/docs/advanced_usage/reinforce_appagent/learning_from_bing_search.md @@ -0,0 +1,29 @@ +# Learning from Bing Search + +UFO provides the capability to reinforce the AppAgent by searching for information on Bing to obtain up-to-date knowledge for niche tasks or applications which beyond the `AppAgent`'s knowledge. + +## Mechanism +Upon receiving a request, the `AppAgent` constructs a Bing search query based on the request and retrieves the search results from Bing. The `AppAgent` then extracts the relevant information from the top-k search results from Bing and generates a plan based on the retrieved information. + + +## Activate the Learning from Bing Search + + +### Step 1: Obtain Bing API Key +To use the Bing search, you need to obtain a Bing API key. You can follow the instructions on the [Microsoft Azure Bing Search API](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api) to get the API key. + + +### Step 2: Configure the AppAgent + +Configure the following parameters to allow UFO to use online Bing search for the decision-making process: + +| Configuration Option | Description | Type | Default Value | +|----------------------|-------------|------|---------------| +| `RAG_ONLINE_SEARCH` | Whether to use the Bing search | Boolean | False | +| `BING_API_KEY` | The Bing search API key | String | "" | +| `RAG_ONLINE_SEARCH_TOPK` | The topk for the online search | Integer | 5 | +| `RAG_ONLINE_RETRIEVED_TOPK` | The topk for the online retrieved searched results | Integer | 1 | + +# Reference + +:::rag.retriever.OnlineDocRetriever \ No newline at end of file diff --git a/documents/docs/advanced_usage/reinforce_appagent/learning_from_demonstration.md b/documents/docs/advanced_usage/reinforce_appagent/learning_from_demonstration.md new file mode 100644 index 00000000..280b5f7d --- /dev/null +++ b/documents/docs/advanced_usage/reinforce_appagent/learning_from_demonstration.md @@ -0,0 +1,76 @@ +Here is the polished document for your Python code project: + +# Learning from User Demonstration + +For complex tasks, users can demonstrate the task using [Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) to record the action trajectories. UFO can learn from these user demonstrations to improve the AppAgent's performance. + + + +## Mechanism + +### Step 1: Record the Task +- **Tool**: [Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) +- **Output**: Zip file containing the task description and action trajectories + +### Step 2: Save the Demonstration +- **Action**: Save the recorded demonstration as a zip file + +### Step 3: Extract and Summarize the Demonstration +- **Tool**: `DemonstrationSummarizer` +- **Process**: + 1. Extract the zip file + 2. Summarize the demonstration +- **Configuration**: Save the summarized demonstration in the `DEMONSTRATION_SAVED_PATH` as specified in the `config_dev.yaml` file + +### Step 4: Retrieve and Utilize the Demonstration +- **When**: AppAgent encounters a similar task +- **Action**: Retrieve the saved demonstration from the demonstration database +- **Tool**: `DemonstrationRetriever` +- **Outcome**: Generate a plan based on the retrieved demonstration + +### Demonstration Workflow Diagram +```mermaid +graph TD; + A[User Records Task] --> B[Save as Zip File] + B --> C[Extract Zip File] + C --> D[Summarize with DemonstrationSummarizer] + D --> E[Save in DEMONSTRATION_SAVED_PATH] + F[AppAgent Encounters Similar Task] --> G[Retrieve Demonstration from Database] + G --> H[Generate Plan] +``` + +You can find a demo video of learning from user demonstrations: + + + +
+ +
+ +## Activating Learning from User Demonstrations + +### Step 1: User Demonstration +Please follow the steps in the [User Demonstration Provision](../../creating_app_agent/demonstration_provision.md) document to provide help documents to the AppAgent. + +### Step 2: Configure the AppAgent +Configure the following parameters to allow UFO to use RAG from user demonstrations: + +| Configuration Option | Description | Type | Default Value | +|----------------------|-------------|------|---------------| +| `RAG_DEMONSTRATION` | Whether to use RAG from user demonstrations | Boolean | False | +| `RAG_DEMONSTRATION_RETRIEVED_TOPK` | The top K documents to retrieve offline | Integer | 5 | +| `RAG_DEMONSTRATION_COMPLETION_N` | The number of completion choices for the demonstration result | Integer | 3 | + +## Reference + +### Demonstration Summarizer +The `DemonstrationSummarizer` class is located in the `record_processor/summarizer/summarizer.py` file. The `DemonstrationSummarizer` class provides methods to summarize the demonstration: + +:::summarizer.summarizer.DemonstrationSummarizer + +
+ +### Demonstration Retriever +The `DemonstrationRetriever` class is located in the `rag/retriever.py` file. The `DemonstrationRetriever` class provides methods to retrieve the demonstration: + +:::rag.retriever.DemonstrationRetriever \ No newline at end of file diff --git a/documents/docs/advanced_usage/reinforce_appagent/learning_from_help_document.md b/documents/docs/advanced_usage/reinforce_appagent/learning_from_help_document.md new file mode 100644 index 00000000..aca82a56 --- /dev/null +++ b/documents/docs/advanced_usage/reinforce_appagent/learning_from_help_document.md @@ -0,0 +1,31 @@ +# Learning from Help Documents + +User or applications can provide help documents to the AppAgent to reinforce its capabilities. The AppAgent can retrieve knowledge from these documents to improve its understanding of the task, generate high-quality plans, and interact more efficiently with the application. You can find how to provide help documents to the AppAgent in the [Help Document Provision](../../creating_app_agent/help_document_provision.md) section. + + +## Mechanism +The help documents are provided in a format of **task-solution pairs**. Upon receiving a request, the AppAgent retrieves the relevant help documents by matching the request with the task descriptions in the help documents and generates a plan based on the retrieved solutions. + +!!! note + Since the retrieved help documents may not be relevant to the request, the `AppAgent` will only take them as references to generate the plan. + +## Activate the Learning from Help Documents + +Follow the steps below to activate the learning from help documents: + +### Step 1: Provide Help Documents +Please follow the steps in the [Help Document Provision](../../creating_app_agent/help_document_provision.md) document to provide help documents to the AppAgent. + +### Step 2: Configure the AppAgent + +Configure the following parameters in the `config.yaml` file to activate the learning from help documents: + +| Configuration Option | Description | Type | Default Value | +|----------------------|-------------|------|---------------| +| `RAG_OFFLINE_DOCS` | Whether to use the offline RAG | Boolean | False | +| `RAG_OFFLINE_DOCS_RETRIEVED_TOPK` | The topk for the offline retrieved documents | Integer | 1 | + + +# Reference + +:::rag.retriever.OfflineDocRetriever \ No newline at end of file diff --git a/documents/docs/advanced_usage/reinforce_appagent/overview.md b/documents/docs/advanced_usage/reinforce_appagent/overview.md new file mode 100644 index 00000000..6ebf072f --- /dev/null +++ b/documents/docs/advanced_usage/reinforce_appagent/overview.md @@ -0,0 +1,63 @@ +# Reinforcing AppAgent + +UFO provides versatile mechanisms to reinforce the AppAgent's capabilities through RAG (Retrieval-Augmented Generation) and other techniques. These enhance the AppAgent's understanding of the task, improving the quality of the generated plans, and increasing the efficiency of the AppAgent's interactions with the application. + +We currently support the following reinforcement methods: + +| Reinforcement Method | Description | +|----------------------|-------------| +| [Learning from Help Documents](./learning_from_help_document.md) | Reinforce the AppAgent by retrieving knowledge from help documents. | +| [Learning from Bing Search](./learning_from_bing_search.md) | Reinforce the AppAgent by searching for information on Bing to obtain up-to-date knowledge. | +| [Learning from Self-Experience](./experience_learning.md) | Reinforce the AppAgent by learning from its own successful experiences. | +| [Learning from User Demonstrations](./learning_from_demonstration.md) | Reinforce the AppAgent by learning from action trajectories demonstrated by users. | + +## Knowledge Provision + +UFO provides the knowledge to the AppAgent through a `context_provision` method defined in the `AppAgent` class: + +```python +def context_provision(self, request: str = "") -> None: + """ + Provision the context for the app agent. + :param request: The Bing search query. + """ + + # Load the offline document indexer for the app agent if available. + if configs["RAG_OFFLINE_DOCS"]: + utils.print_with_color( + "Loading offline help document indexer for {app}...".format( + app=self._process_name + ), + "magenta", + ) + self.build_offline_docs_retriever() + + # Load the online search indexer for the app agent if available. + + if configs["RAG_ONLINE_SEARCH"] and request: + utils.print_with_color("Creating a Bing search indexer...", "magenta") + self.build_online_search_retriever( + request, configs["RAG_ONLINE_SEARCH_TOPK"] + ) + + # Load the experience indexer for the app agent if available. + if configs["RAG_EXPERIENCE"]: + utils.print_with_color("Creating an experience indexer...", "magenta") + experience_path = configs["EXPERIENCE_SAVED_PATH"] + db_path = os.path.join(experience_path, "experience_db") + self.build_experience_retriever(db_path) + + # Load the demonstration indexer for the app agent if available. + if configs["RAG_DEMONSTRATION"]: + utils.print_with_color("Creating an demonstration indexer...", "magenta") + demonstration_path = configs["DEMONSTRATION_SAVED_PATH"] + db_path = os.path.join(demonstration_path, "demonstration_db") + self.build_human_demonstration_retriever(db_path) +``` + +The `context_provision` method loads the offline document indexer, online search indexer, experience indexer, and demonstration indexer for the AppAgent based on the configuration settings in the `config_dev.yaml` file. + +# Reference +UFO employs the `Retriever` class located in the `ufo/rag/retriever.py` file to retrieve knowledge from various sources. The `Retriever` class provides the following methods to retrieve knowledge: + +:::rag.retriever.Retriever diff --git a/documents/docs/agents/app_agent.md b/documents/docs/agents/app_agent.md new file mode 100644 index 00000000..d2520f8c --- /dev/null +++ b/documents/docs/agents/app_agent.md @@ -0,0 +1,174 @@ +# AppAgent 👾 + +An `AppAgent` is responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application. The `AppAgent` is created by the `HostAgent` to fulfill a sub-task within a `Round`. The `AppAgent` is responsible for executing the necessary actions within the application to fulfill the user's request. The `AppAgent` has the following features: + +1. **[ReAct](https://arxiv.org/abs/2210.03629) with the Application** - The `AppAgent` recursively interacts with the application in a workflow of observation->thought->action, leveraging the multi-modal capabilities of Visual Language Models (VLMs) to comprehend the application UI and fulfill the user's request. +2. **Comprehension Enhancement** - The `AppAgent` is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including external knowledge bases, and demonstration libraries, making the agent an application "expert". +3. **Versatile Skill Set** - The `AppAgent` is equipped with a diverse set of skills to support comprehensive automation, such as mouse, keyboard, native APIs, and "Copilot". + +!!! tip + You can find the how to enhance the `AppAgent` with external knowledge bases and demonstration libraries in the [Reinforcing AppAgent](../advanced_usage/reinforce_appagent/overview.md) documentation. + + +We show the framework of the `AppAgent` in the following diagram: + +

+ AppAgent Image +

+ +## AppAgent Input + +To interact with the application, the `AppAgent` receives the following inputs: + +| Input | Description | Type | +| --- | --- | --- | +| User Request | The user's request in natural language. | String | +| Sub-Task | The sub-task description to be executed by the `AppAgent`, assigned by the `HostAgent`. | String | +| Current Application | The name of the application to be interacted with. | String | +| Control Information | Index, name and control type of available controls in the application. | List of Dictionaries | +| Application Screenshots | Screenshots of the application, including a clean screenshot, an annotated screenshot with labeled controls, and a screenshot with a rectangle around the selected control at the previous step (optional). | List of Strings | +| Previous Sub-Tasks | The previous sub-tasks and their completion status. | List of Strings | +| Previous Plan | The previous plan for the following steps. | List of Strings | +| HostAgent Message | The message from the `HostAgent` for the completion of the sub-task. | String | +| Retrived Information | The retrieved information from external knowledge bases or demonstration libraries. | String | +| Blackboard | The shared memory space for storing and sharing information among the agents. | Dictionary | + + +Below is an example of the annotated application screenshot with labeled controls. This follow the [Set-of-Mark](https://arxiv.org/pdf/2310.11441) paradigm. +

+ AppAgent Image +

+ + +By processing these inputs, the `AppAgent` determines the necessary actions to fulfill the user's request within the application. + +!!! tip + Whether to concatenate the clean screenshot and annotated screenshot can be configured in the `CONCAT_SCREENSHOT` field in the `config_dev.yaml` file. + +!!! tip + Whether to include the screenshot with a rectangle around the selected control at the previous step can be configured in the `INCLUDE_LAST_SCREENSHOT` field in the `config_dev.yaml` file. + + +## AppAgent Output + +With the inputs provided, the `AppAgent` generates the following outputs: + +| Output | Description | Type | +| --- | --- | --- | +| Observation | The observation of the current application screenshots. | String | +| Thought | The logical reasoning process of the `AppAgent`. | String | +| ControlLabel | The index of the selected control to interact with. | String | +| ControlText | The name of the selected control to interact with. | String | +| Function | The function to be executed on the selected control. | String | +| Args | The arguments required for the function execution. | List of Strings | +| Status | The status of the agent, mapped to the `AgentState`. | String | +| Plan | The plan for the following steps after the current action. | List of Strings | +| Comment | Additional comments or information provided to the user. | String | +| SaveScreenshot | The flag to save the screenshot of the application to the `blackboard` for future reference. | Boolean | + +Below is an example of the `AppAgent` output: + +```json +{ + "Observation": "Application screenshot", + "Thought": "Logical reasoning process", + "ControlLabel": "Control index", + "ControlText": "Control name", + "Function": "Function name", + "Args": ["arg1", "arg2"], + "Status": "AgentState", + "Plan": ["Step 1", "Step 2"], + "Comment": "Additional comments", + "SaveScreenshot": true +} +``` + +!!! info + The `AppAgent` output is formatted as a JSON object by LLMs and can be parsed by the `json.loads` method in Python. + + +## AppAgent State +The `AppAgent` state is managed by a state machine that determines the next action to be executed based on the current state, as defined in the `ufo/agents/states/app_agent_states.py` module. The states include: + +| State | Description | +| --- | --- | +| `CONTINUE` | The `AppAgent` continues executing the current action. | +| `FINISH` | The `AppAgent` has completed the current sub-task. | +| `ERROR` | The `AppAgent` encountered an error during execution. | +| `FAIL` | The `AppAgent` believes the current sub-task is unachievable. | + +| `CONFIRM` | The `AppAgent` is confirming the user's input or action. | +| `SCREENSHOT` | The `AppAgent` believes the current screenshot is not clear in annotating the control and requests a new screenshot. | + +The state machine diagram for the `AppAgent` is shown below: +

+ +

+ +The `AppAgent` progresses through these states to execute the necessary actions within the application and fulfill the sub-task assigned by the `HostAgent`. + + +## Knowledge Enhancement +The `AppAgent` is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including external knowledge bases and demonstration libraries. The `AppAgent` leverages this knowledge to enhance its comprehension of the application and learn from demonstrations to improve its performance. + +### Learning from Help Documents +User can provide help documents to the `AppAgent` to enhance its comprehension of the application and improve its performance in the `config.yaml` file. + +!!! tip + Please find details configuration in the [documentation](../configurations/user_configuration.md). +!!! tip + You may also refer to the [here]() for how to provide help documents to the `AppAgent`. + + +In the `AppAgent`, it calls the `build_offline_docs_retriever` to build a help document retriever, and uses the `retrived_documents_prompt_helper` to contruct the prompt for the `AppAgent`. + + + +### Learning from Bing Search +Since help documents may not cover all the information or the information may be outdated, the `AppAgent` can also leverage Bing search to retrieve the latest information. You can activate Bing search and configure the search engine in the `config.yaml` file. + +!!! tip + Please find details configuration in the [documentation](../configurations/user_configuration.md). +!!! tip + You may also refer to the [here]() for the implementation of Bing search in the `AppAgent`. + +In the `AppAgent`, it calls the `build_online_search_retriever` to build a Bing search retriever, and uses the `retrived_documents_prompt_helper` to contruct the prompt for the `AppAgent`. + + +### Learning from Self-Demonstrations +You may save successful action trajectories in the `AppAgent` to learn from self-demonstrations and improve its performance. After the completion of a `session`, the `AppAgent` will ask the user whether to save the action trajectories for future reference. You may configure the use of self-demonstrations in the `config.yaml` file. + +!!! tip + You can find details of the configuration in the [documentation](../configurations/user_configuration.md). + +!!! tip + You may also refer to the [here]() for the implementation of self-demonstrations in the `AppAgent`. + +In the `AppAgent`, it calls the `build_experience_retriever` to build a self-demonstration retriever, and uses the `rag_experience_retrieve` to retrieve the demonstration for the `AppAgent`. + +### Learning from Human Demonstrations +In addition to self-demonstrations, you can also provide human demonstrations to the `AppAgent` to enhance its performance by using the [Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) tool built in the Windows OS. The `AppAgent` will learn from the human demonstrations to improve its performance and achieve better personalization. The use of human demonstrations can be configured in the `config.yaml` file. + +!!! tip + You can find details of the configuration in the [documentation](../configurations/user_configuration.md). +!!! tip + You may also refer to the [here]() for the implementation of human demonstrations in the `AppAgent`. + +In the `AppAgent`, it calls the `build_human_demonstration_retriever` to build a human demonstration retriever, and uses the `rag_experience_retrieve` to retrieve the demonstration for the `AppAgent`. + + +## Skill Set for Automation +The `AppAgent` is equipped with a versatile skill set to support comprehensive automation within the application by calling the `create_puppteer_interface` method. The skills include: + +| Skill | Description | +| --- | --- | +| UI Automation | Mimicking user interactions with the application UI controls using the `UI Automation` and `Win32` API. | +| Native API | Accessing the application's native API to execute specific functions and actions. | +| In-App Agent | Leveraging the in-app agent to interact with the application's internal functions and features. | + +By utilizing these skills, the `AppAgent` can efficiently interact with the application and fulfill the user's request. You can find more details in the [Automator](../automator/overview.md) documentation and the code in the `ufo/automator` module. + + +# Reference + +:::agents.agent.app_agent.AppAgent \ No newline at end of file diff --git a/documents/docs/agents/design/blackboard.md b/documents/docs/agents/design/blackboard.md new file mode 100644 index 00000000..dd40719f --- /dev/null +++ b/documents/docs/agents/design/blackboard.md @@ -0,0 +1,55 @@ +# Agent Blackboard + +The `Blackboard` is a shared memory space that is visible to all agents in the UFO framework. It stores information required for agents to interact with the user and applications at every step. The `Blackboard` is a key component of the UFO framework, enabling agents to share information and collaborate to fulfill user requests. The `Blackboard` is implemented as a class in the `ufo/agents/memory/blackboard.py` file. + +## Components + +The `Blackboard` consists of the following data components: + +| Component | Description | +| --- | --- | +| `questions` | A list of questions that UFO asks the user, along with their corresponding answers. | +| `requests` | A list of historical user requests received in previous `Round`. | +| `trajectories` | A list of step-wise trajectories that record the agent's actions and decisions at each step. | +| `screenshots` | A list of screenshots taken by the agent when it believes the current state is important for future reference. | + +!!! tip + The keys stored in the `trajectories` are configured as `HISTORY_KEYS` in the `config_dev.yaml` file. You can customize the keys based on your requirements and the agent's logic. + +!!! tip + Whether to save the screenshots is determined by the `AppAgent`. You can enable or disable screenshot capture by setting the `SCREENSHOT_TO_MEMORY` flag in the `config_dev.yaml` file. + +## Blackboard to Prompt + +Data in the `Blackboard` is based on the `MemoryItem` class. It has a method `blackboard_to_prompt` that converts the information stored in the `Blackboard` to a string prompt. Agents call this method to construct the prompt for the LLM's inference. The `blackboard_to_prompt` method is defined as follows: + +```python +def blackboard_to_prompt(self) -> List[str]: + """ + Convert the blackboard to a prompt. + :return: The prompt. + """ + prefix = [ + { + "type": "text", + "text": "[Blackboard:]", + } + ] + + blackboard_prompt = ( + prefix + + self.texts_to_prompt(self.questions, "[Questions & Answers:]") + + self.texts_to_prompt(self.requests, "[Request History:]") + + self.texts_to_prompt(self.trajectories, "[Step Trajectories:]") + + self.screenshots_to_prompt() + ) + + return blackboard_prompt +``` + +## Reference + +:::agents.memory.blackboard.Blackboard + +!!!note + You can customize the class to tailor the `Blackboard` to your requirements. \ No newline at end of file diff --git a/documents/docs/agents/design/memory.md b/documents/docs/agents/design/memory.md new file mode 100644 index 00000000..484412aa --- /dev/null +++ b/documents/docs/agents/design/memory.md @@ -0,0 +1,24 @@ +# Agent Memory + +The `Memory` manages the memory of the agent and stores the information required for the agent to interact with the user and applications at every step. Parts of elements in the `Memory` will be visible to the agent for decision-making. + + +## MemoryItem +A `MemoryItem` is a `dataclass` that represents a single step in the agent's memory. The fields of a `MemoryItem` is flexible and can be customized based on the requirements of the agent. The `MemoryItem` class is defined as follows: + +::: agents.memory.memory.MemoryItem + +!!!info + At each step, an instance of `MemoryItem` is created and stored in the `Memory` to record the information of the agent's interaction with the user and applications. + + +## Memory +The `Memory` class is responsible for managing the memory of the agent. It stores a list of `MemoryItem` instances that represent the agent's memory at each step. The `Memory` class is defined as follows: + +::: agents.memory.memory.Memory + +!!!info + Each agent has its own `Memory` instance to store their information. + +!!!info + Not all information in the `Memory` are provided to the agent for decision-making. The agent can access parts of the memory based on the requirements of the agent's logic. \ No newline at end of file diff --git a/documents/docs/agents/design/processor.md b/documents/docs/agents/design/processor.md new file mode 100644 index 00000000..c91020c5 --- /dev/null +++ b/documents/docs/agents/design/processor.md @@ -0,0 +1,29 @@ +# Agents Processor + +The `Processor` is a key component of the agent to process the core logic of the agent to process the user's request. The `Processor` is implemented as a class in the `ufo/agents/processors` folder. Each agent has its own `Processor` class withing the folder. + +## Core Process +Once called, an agent follows a series of steps to process the user's request defined in the `Processor` class by calling the `process` method. The workflow of the `process` is as follows: + +| Step | Description | Function | +| --- | --- | --- | +| 1 | Print the step information. | `print_step_info` | +| 2 | Capture the screenshot of the application. | `capture_screenshot` | +| 3 | Get the control information of the application. | `get_control_info` | +| 4 | Get the prompt message for the LLM. | `get_prompt_message` | +| 5 | Generate the response from the LLM. | `get_response` | +| 6 | Update the cost of the step. | `update_cost` | +| 7 | Parse the response from the LLM. | `parse_response` | +| 8 | Execute the action based on the response. | `execute_action` | +| 9 | Update the memory and blackboard. | `update_memory` | +| 10 | Update the status of the agent. | `update_status` | +| 11 | Update the step information. | `update_step` | + +At each step, the `Processor` processes the user's request by invoking the corresponding method sequentially to execute the necessary actions. + + +The process may be paused. It can be resumed, based on the agent's logic and the user's request using the `resume` method. + +## Reference +Below is the basic structure of the `Processor` class: +:::agents.processors.basic.BaseProcessor \ No newline at end of file diff --git a/documents/docs/agents/design/prompter.md b/documents/docs/agents/design/prompter.md new file mode 100644 index 00000000..aefd4a7b --- /dev/null +++ b/documents/docs/agents/design/prompter.md @@ -0,0 +1,47 @@ +# Agent Prompter + +The `Prompter` is a key component of the UFO framework, responsible for constructing prompts for the LLM to generate responses. The `Prompter` is implemented in the `ufo/prompts` folder. Each agent has its own `Prompter` class that defines the structure of the prompt and the information to be fed to the LLM. + +## Components + +A prompt fed to the LLM usually a list of dictionaries, where each dictionary contains the following keys: + +| Key | Description | +| --- | --- | +| `role` | The role of the text in the prompt, can be `system`, `user`, or `assistant`. | +| `content` | The content of the text for the specific role. | + +!!!tip + You may find the [official documentation](https://help.openai.com/en/articles/7042661-moving-from-completions-to-chat-completions-in-the-openai-api) helpful for constructing the prompt. + +In the `__init__` method of the `Prompter` class, you can define the template of the prompt for each component, and the final prompt message is constructed by combining the templates of each component using the `prompt_construction` method. + +### System Prompt +The system prompt use the template configured in the `config_dev.yaml` file for each agent. It usually contains the instructions for the agent's role, action, tips, reponse format, etc. +You need use the `system_prompt_construction` method to construct the system prompt. + +Prompts on the API instructions, and demonstration examples are also included in the system prompt, which are constructed by the `api_prompt_helper` and `examples_prompt_helper` methods respectively. Below is the sub-components of the system prompt: + +| Component | Description | Method | +| --- | --- | --- | +| `apis` | The API instructions for the agent. | `api_prompt_helper` | +| `examples` | The demonstration examples for the agent. | `examples_prompt_helper` | + +### User Prompt +The user prompt is constructed based on the information from the agent's observation, external knowledge, and `Blackboard`. You can use the `user_prompt_construction` method to construct the user prompt. Below is the sub-components of the user prompt: + +| Component | Description | Method | +| --- | --- | --- | +| `observation` | The observation of the agent. | `user_content_construction` | +| `retrieved_docs` | The knowledge retrieved from the external knowledge base. | `retrived_documents_prompt_helper` | +| `blackboard` | The information stored in the `Blackboard`. | `blackboard_to_prompt` | + + +# Reference +You can find the implementation of the `Prompter` in the `ufo/prompts` folder. Below is the basic structure of the `Prompter` class: + +:::prompter.basic.BasicPrompter + + +!!!tip + You can customize the `Prompter` class to tailor the prompt to your requirements. \ No newline at end of file diff --git a/documents/docs/agents/design/state.md b/documents/docs/agents/design/state.md new file mode 100644 index 00000000..a1495711 --- /dev/null +++ b/documents/docs/agents/design/state.md @@ -0,0 +1,122 @@ +# Agent State + +The `State` class is a fundamental component of the UFO agent framework. It represents the current state of the agent and determines the next action and agent to handle the request. Each agent has a specific set of states that define the agent's behavior and workflow. + + + +## AgentStatus +The set of states for an agent is defined in the `AgentStatus` class: + +```python +class AgentStatus(Enum): + """ + The status class for the agent. + """ + + ERROR = "ERROR" + FINISH = "FINISH" + CONTINUE = "CONTINUE" + FAIL = "FAIL" + PENDING = "PENDING" + CONFIRM = "CONFIRM" + SCREENSHOT = "SCREENSHOT" +``` + +Each agent implements its own set of `AgentStatus` to define the states of the agent. + + +## AgentStateManager + +The class `AgentStateManager` manages the state mapping from a string to the corresponding state class. Each state class is registered with the `AgentStateManager` using the `register` decorator to associate the state class with a specific agent, e.g., + +```python +@AgentStateManager.register +class SomeAgentState(AgentState): + """ + The state class for the some agent. + """ +``` + +!!! tip + You can find examples on how to register the state class for the `AppAgent` in the `ufo/agents/states/app_agent_state.py` file. + +Below is the basic structure of the `AgentStateManager` class: +```python +class AgentStateManager(ABC, metaclass=SingletonABCMeta): + """ + A abstract class to manage the states of the agent. + """ + + _state_mapping: Dict[str, Type[AgentState]] = {} + + def __init__(self): + """ + Initialize the state manager. + """ + + self._state_instance_mapping: Dict[str, AgentState] = {} + + def get_state(self, status: str) -> AgentState: + """ + Get the state for the status. + :param status: The status string. + :return: The state object. + """ + + # Lazy load the state class + if status not in self._state_instance_mapping: + state_class = self._state_mapping.get(status) + if state_class: + self._state_instance_mapping[status] = state_class() + else: + self._state_instance_mapping[status] = self.none_state + + state = self._state_instance_mapping.get(status, self.none_state) + + return state + + def add_state(self, status: str, state: AgentState) -> None: + """ + Add a new state to the state mapping. + :param status: The status string. + :param state: The state object. + """ + self.state_map[status] = state + + @property + def state_map(self) -> Dict[str, AgentState]: + """ + The state mapping of status to state. + :return: The state mapping. + """ + return self._state_instance_mapping + + @classmethod + def register(cls, state_class: Type[AgentState]) -> Type[AgentState]: + """ + Decorator to register the state class to the state manager. + :param state_class: The state class to be registered. + :return: The state class. + """ + cls._state_mapping[state_class.name()] = state_class + return state_class + + @property + @abstractmethod + def none_state(self) -> AgentState: + """ + The none state of the state manager. + """ + pass +``` + +## AgentState +Each state class inherits from the `AgentState` class and must implement the method of `handle` to process the action in the state. In addition, the `next_state` and `next_agent` methods are used to determine the next state and agent to handle the transition. Please find below the reference for the `State` class in UFO. + +::: agents.states.basic.AgentState + +!!!tip + The state machine diagrams for the `HostAgent` and `AppAgent` are shown in their respective documents. + +!!!tip + A `Round` calls the `handle`, `next_state`, and `next_agent` methods of the current state to process the user request and determine the next state and agent to handle the request, and orchestrates the agents to execute the necessary actions. diff --git a/documents/docs/agents/evaluation_agent.md b/documents/docs/agents/evaluation_agent.md new file mode 100644 index 00000000..b75e726d --- /dev/null +++ b/documents/docs/agents/evaluation_agent.md @@ -0,0 +1,67 @@ +# EvaluationAgent 🧐 + +The objective of the `EvaluationAgent` is to evaluate whether a `Session` or `Round` has been successfully completed. The `EvaluationAgent` assesses the performance of the `HostAgent` and `AppAgent` in fulfilling the request. You can configure whether to enable the `EvaluationAgent` in the `config_dev.yaml` file and the detailed documentation can be found [here](../configurations/developer_configuration.md). +!!! note + The `EvaluationAgent` is fully LLM-driven and conducts evaluations based on the action trajectories and screenshots. It may not by 100% accurate since LLM may make mistakes. + + +## Configuration +To enable the `EvaluationAgent`, you can configure the following parameters in the `config_dev.yaml` file to evaluate the task completion status at different levels: + +| Configuration Option | Description | Type | Default Value | +|---------------------------|-----------------------------------------------|---------|---------------| +| `EVA_SESSION` | Whether to include the session in the evaluation. | Boolean | True | +| `EVA_ROUND` | Whether to include the round in the evaluation. | Boolean | False | +| `EVA_ALL_SCREENSHOTS` | Whether to include all the screenshots in the evaluation. | Boolean | True | + + +## Evaluation Inputs +The `EvaluationAgent` takes the following inputs for evaluation: + +| Input | Description | Type | +| --- | --- | --- | +| User Request | The user's request to be evaluated. | String | +| APIs Description | The description of the APIs used in the execution. | List of Strings | +| Action Trajectories | The action trajectories executed by the `HostAgent` and `AppAgent`. | List of Strings | +| Screenshots | The screenshots captured during the execution. | List of Images | + +For more details on how to construct the inputs, please refer to the `EvaluationAgentPrompter` class in `ufo/prompter/eva_prompter.py`. + +!!! tip + You can configure whether to use all screenshots or only the first and last screenshot for evaluation in the `EVA_ALL_SCREENSHOTS` of the `config_dev.yaml` file. + + +## Evaluation Outputs +The `EvaluationAgent` generates the following outputs after evaluation: + +| Output | Description | Type | +| --- | --- | --- | +| reason | The detailed reason for your judgment, by observing the screenshot differences and the . | String | +| sub_scores | The sub-score of the evaluation in decomposing the evaluation into multiple sub-goals. | List of Dictionaries | +| complete | The completion status of the evaluation, can be `yes`, `no`, or `unsure`. | String | + +Below is an example of the evaluation output: + +```json +{ + "reason": "The agent successfully completed the task of sending 'hello' to Zac on Microsoft Teams. + The initial screenshot shows the Microsoft Teams application with the chat window of Chaoyun Zhang open. + The agent then focused on the chat window, input the message 'hello', and clicked the Send button. + The final screenshot confirms that the message 'hello' was sent to Zac.", + "sub_scores": { + "correct application focus": "yes", + "correct message input": "yes", + "message sent successfully": "yes" + }, + "complete": "yes"} +``` + +!!!info + The log of the evaluation results will be saved in the `logs/{task_name}/evaluation.log` file. + +The `EvaluationAgent` employs the CoT mechanism to first decompose the evaluation into multiple sub-goals and then evaluate each sub-goal separately. The sub-scores are then aggregated to determine the overall completion status of the evaluation. + +# Reference + +:::agents.agent.evaluation_agent.EvaluationAgent + diff --git a/documents/docs/agents/follower_agent.md b/documents/docs/agents/follower_agent.md new file mode 100644 index 00000000..4855366f --- /dev/null +++ b/documents/docs/agents/follower_agent.md @@ -0,0 +1,28 @@ +# Follower Agent 🚶🏽‍♂️ + +The `FollowerAgent` is inherited from the `AppAgent` and is responsible for following the user's instructions to perform specific tasks within the application. The `FollowerAgent` is designed to execute a series of actions based on the user's guidance. It is particularly useful for software testing, when clear instructions are provided to validate the application's behavior. + + +## Different from the AppAgent +The `FollowerAgent` shares most of the functionalities with the `AppAgent`, but it is designed to follow the step-by-step instructions provided by the user, instead of does its own reasoning to determine the next action. + + +## Usage +The `FollowerAgent` is available in `follower` mode. You can find more details in the [documentation](). It also uses differnt `Session` and `Processor` to handle the user's instructions. The step-wise instructions are provided by the user in the in a json file, which is then parsed by the `FollowerAgent` to execute the actions. An example of the json file is shown below: + +```json +{ + "task": "Type in a bold text of 'Test For Fun'", + "steps": + [ + "1.type in 'Test For Fun'", + "2.select the text of 'Test For Fun'", + "3.click on the bold" + ], + "object": "draft.docx" +} +``` + +# Reference + +:::agents.agent.follower_agent.FollowerAgent \ No newline at end of file diff --git a/documents/docs/agents/host_agent.md b/documents/docs/agents/host_agent.md new file mode 100644 index 00000000..2c29f9d3 --- /dev/null +++ b/documents/docs/agents/host_agent.md @@ -0,0 +1,154 @@ +# HostAgent 🤖 + +The `HostAgent` assumes three primary responsibilities: + +1. **User Engagement**: The `HostAgent` engages with the user to understand their request and analyze their intent. It also conversates with the user to gather additional information when necessary. +2. **AppAgent Management**: The `HostAgent` manages the creation and registration of `AppAgents` to fulfill the user's request. It also orchestrates the interaction between the `AppAgents` and the application. +3. **Task Management**: The `HostAgent` analyzes the user's request, to decompose it into sub-tasks and distribute them among the `AppAgents`. It also manages the scheduling, orchestration, coordination, and monitoring of the `AppAgents` to ensure the successful completion of the user's request. +4. **Communication**: The `HostAgent` communicates with the `AppAgents` to exchange information. It also manages the `Blackboard` to store and share information among the agents, as shown below: + +

+ Blackboard Image +

+ + +The `HostAgent` activates its `Processor` to process the user's request and decompose it into sub-tasks. Each sub-task is then assigned to an `AppAgent` for execution. The `HostAgent` monitors the progress of the `AppAgents` and ensures the successful completion of the user's request. + +## HostAgent Input + +The `HostAgent` receives the following inputs: + +| Input | Description | Type | +| --- | --- | --- | +| User Request | The user's request in natural language. | String | +| Application Information | Information about the existing active applications. | List of Strings | +| Desktop Screenshots | Screenshots of the desktop to provide context to the `HostAgent`. | Image | +| Previous Sub-Tasks | The previous sub-tasks and their completion status. | List of Strings | +| Previous Plan | The previous plan for the following sub-tasks. | List of Strings | +| Blackboard | The shared memory space for storing and sharing information among the agents. | Dictionary | + +By processing these inputs, the `HostAgent` determines the appropriate application to fulfill the user's request and orchestrates the `AppAgents` to execute the necessary actions. + +## HostAgent Output + +With the inputs provided, the `HostAgent` generates the following outputs: + +| Output | Description | Type | +| --- | --- | --- | +| Observation | The observation of current desktop screenshots. | String | +| Thought | The logical reasoning process of the `HostAgent`. | String | +| Current Sub-Task | The current sub-task to be executed by the `AppAgent`. | String | +| Message | The message to be sent to the `AppAgent` for the completion of the sub-task. | String | +| ControlLabel | The index of the selected application to execute the sub-task. | String | +| ControlText | The name of the selected application to execute the sub-task. | String | +| Plan | The plan for the following sub-tasks after the current sub-task. | List of Strings | +| Status | The status of the agent, mapped to the `AgentState`. | String | +| Comment | Additional comments or information provided to the user. | String | +| Questions | The questions to be asked to the user for additional information. | List of Strings | +| AppsToOpen | The application to be opened to execute the sub-task if it is not already open. | Dictionary | + + +Below is an example of the `HostAgent` output: + +```json +{ + "Observation": "Desktop screenshot", + "Thought": "Logical reasoning process", + "Current Sub-Task": "Sub-task description", + "Message": "Message to AppAgent", + "ControlLabel": "Application index", + "ControlText": "Application name", + "Plan": ["Sub-task 1", "Sub-task 2"], + "Status": "AgentState", + "Comment": "Additional comments", + "Questions": ["Question 1", "Question 2"], + "AppsToOpen": {"APP": "powerpnt", "file_path": ""} +} +``` + +!!! info + The `HostAgent` output is formatted as a JSON object by LLMs and can be parsed by the `json.loads` method in Python. + + +## HostAgent State + +The `HostAgent` progresses through different states, as defined in the `ufo/agents/states/host_agent_states.py` module. The states include: + +| State | Description | +| --- | --- | +| `CONTINUE` | The `HostAgent` is ready to process the user's request and emloy the `Processor` to decompose it into sub-tasks and assign them to the `AppAgents`. | +| `FINISH` | The overall task is completed, and the `HostAgent` is ready to return the results to the user. | +| `ERROR` | An error occurred during the processing of the user's request, and the `HostAgent` is unable to proceed. | +| `FAIL` | The `HostAgent` believes the task is unachievable and cannot proceed further. | +| `PENDING` | The `HostAgent` is waiting for additional information from the user to proceed. | + + +The state machine diagram for the `HostAgent` is shown below: +

+ +

+ + +The `HostAgent` transitions between these states based on the user's request, the application information, and the progress of the `AppAgents` in executing the sub-tasks. + + +## Task Decomposition +Upon receiving the user's request, the `HostAgent` decomposes it into sub-tasks and assigns each sub-task to an `AppAgent` for execution. The `HostAgent` determines the appropriate application to fulfill the user's request based on the application information and the user's request. It then orchestrates the `AppAgents` to execute the necessary actions to complete the sub-tasks. We show the task decomposition process in the following figure: + +

+ Task Decomposition Image +

+ +## Creating and Registering AppAgents +When the `HostAgent` determines the need for a new `AppAgent` to fulfill a sub-task, it creates an instance of the `AppAgent` and registers it with the `HostAgent`, by calling the `create_subagent` method: + +```python +def create_subagent( + self, + agent_type: str, + agent_name: str, + process_name: str, + app_root_name: str, + is_visual: bool, + main_prompt: str, + example_prompt: str, + api_prompt: str, + *args, + **kwargs, + ) -> BasicAgent: + """ + Create an SubAgent hosted by the HostAgent. + :param agent_type: The type of the agent to create. + :param agent_name: The name of the SubAgent. + :param process_name: The process name of the app. + :param app_root_name: The root name of the app. + :param is_visual: The flag indicating whether the agent is visual or not. + :param main_prompt: The main prompt file path. + :param example_prompt: The example prompt file path. + :param api_prompt: The API prompt file path. + :return: The created SubAgent. + """ + app_agent = self.agent_factory.create_agent( + agent_type, + agent_name, + process_name, + app_root_name, + is_visual, + main_prompt, + example_prompt, + api_prompt, + *args, + **kwargs, + ) + self.appagent_dict[agent_name] = app_agent + app_agent.host = self + self._active_appagent = app_agent + + return app_agent +``` + +The `HostAgent` then assigns the sub-task to the `AppAgent` for execution and monitors its progress. + +# Reference + +:::agents.agent.host_agent.HostAgent diff --git a/documents/docs/agents/overview.md b/documents/docs/agents/overview.md index e69de29b..1c64e609 100644 --- a/documents/docs/agents/overview.md +++ b/documents/docs/agents/overview.md @@ -0,0 +1,37 @@ +# Agents + +In UFO, there are four types of agents: `HostAgent`, `AppAgent`, `FollowerAgent`, and `EvaluationAgent`. Each agent has a specific role in the UFO system and is responsible for different aspects of the user interaction process: + +| Agent | Description | +| --- | --- | +| [`HostAgent`](../agents/host_agent.md) | Decomposes the user request into sub-tasks and selects the appropriate application to fulfill the request. | +| [`AppAgent`](../agents/app_agent.md) | Executes actions on the selected application. | +| [`FollowerAgent`](../agents/follower_agent.md) | Follows the user's instructions to complete the task. | +| [`EvaluationAgent`](../agents/evaluation_agent.md) | Evaluates the completeness of a session or a round. | + +In the normal workflow, only the `HostAgent` and `AppAgent` are involved in the user interaction process. The `FollowerAgent` and `EvaluationAgent` are used for specific tasks. + +Please see below the orchestration of the agents in UFO: + +

+ +

+ +## Main Components + +An agent in UFO is composed of the following main components to fulfill its role in the UFO system: + +| Component | Description | +| --- | --- | +| [`State`](../agents/design/state.md) | Represents the current state of the agent and determines the next action and agent to handle the request. | +| [`Memory`](../agents/design/memory.md) | Stores information about the user request, application state, and other relevant data. | +| [`Blackboard`](../agents/design/blackboard.md) | Stores information shared between agents. | +| [`Prompter`](../agents/design/prompter.md) | Generates prompts for the language model based on the user request and application state. | +| [`Processor`](../agents/design/processor.md) | Processes the workflow of the agent, including handling user requests, executing actions, and memory management. | + +## Reference + +Below is the reference for the `Agent` class in UFO. All agents in UFO inherit from the `Agent` class and implement necessary methods to fulfill their roles in the UFO system. + +::: agents.agent.basic.BasicAgent + diff --git a/documents/docs/automator/ai_tool_automator.md b/documents/docs/automator/ai_tool_automator.md new file mode 100644 index 00000000..f29ae42a --- /dev/null +++ b/documents/docs/automator/ai_tool_automator.md @@ -0,0 +1,26 @@ +# AI Tool Automator + +The AI Tool Automator is a component of the UFO framework that enables the agent to interact with AI tools based on large language models (LLMs). The AI Tool Automator is designed to facilitate the integration of LLM-based AI tools into the UFO framework, enabling the agent to leverage the capabilities of these tools to perform complex tasks. + +!!! note + UFO can also call in-app AI tools, such as `Copilot`, to assist with the automation process. This is achieved by using either `UI Automation` or `API` to interact with the in-app AI tool. These in-app AI tools differ from the AI Tool Automator, which is designed to interact with external AI tools based on LLMs that are not integrated into the application. + +## Configuration +The AI Tool Automator shares the same prompt configuration options as the UI Automator: + +| Configuration Option | Description | Type | Default Value | +|-------------------------|---------------------------------------------------------------------------------------------------------|----------|---------------| +| `API_PROMPT` | The prompt for the UI automation API. | String | "ufo/prompts/share/base/api.yaml" | + + +## Receiver +The AI Tool Automator shares the same receiver structure as the UI Automator. Please refer to the [UI Automator Receiver](./ui_automator.md#receiver) section for more details. + +## Command +The command of the AI Tool Automator shares the same structure as the UI Automator. Please refer to the [UI Automator Command](./ui_automator.md#command) section for more details. The list of available commands in the AI Tool Automator is shown below: + +| Command Name | Function Name | Description | +|--------------|---------------|-------------| +| `AnnotationCommand` | `annotation` | Annotate the control items on the screenshot. | +| `SummaryCommand` | `summary` | Summarize the observation of the current application window. | + diff --git a/documents/docs/automator/overview.md b/documents/docs/automator/overview.md new file mode 100644 index 00000000..d70a5d67 --- /dev/null +++ b/documents/docs/automator/overview.md @@ -0,0 +1,64 @@ +# Application Automator + +The Automator application is a tool that allows UFO to automate and take actions on applications. Currently, UFO supports two types of actions: `UI Automation` and `API`. + +!!! note + UFO can also call in-app AI tools, such as `Copilot`, to assist with the automation process. This is achieved by using either `UI Automation` or `API` to interact with the in-app AI tool. + +- [UI Automator](./ui_automator.md) - This action type is used to interact with the application's UI controls, such as buttons, text boxes, and menus. UFO uses the **UIA** or **Win32** APIs to interact with the application's UI controls. +- [API](./wincom_automator.md) - This action type is used to interact with the application's native API. Users and app developers can create their own API actions to interact with specific applications. +- [Web](./web_automator.md) - This action type is used to interact with web applications. UFO uses the [**crawl4ai**](https://github.com/unclecode/crawl4ai) library to extract information from web pages. +- [AI Tool](./ai_tool_automator.md) - This action type is used to interact with the LLM-based AI tools. + +## Action Design Patterns + +Actions in UFO are implemented using the [command](https://refactoring.guru/design-patterns/command) design pattern, which encapsulates a receiver, a command, and an invoker. The receiver is the object that performs the action, the command is the object that encapsulates the action, and the invoker is the object that triggers the action. + +The basic classes for implementing actions in UFO are as follows: + +| Role | Class | Description | +| --- | --- | --- | +| Receiver | `ufo.automator.basic.ReceiverBasic` | The base class for all receivers in UFO. Receivers are objects that perform actions on applications. | +| Command | `ufo.automator.basic.CommandBasic` | The base class for all commands in UFO. Commands are objects that encapsulate actions to be performed by receivers. | +| Invoker | `ufo.automator.puppeteer.AppPuppeteer` | The base class for the invoker in UFO. Invokers are objects that trigger commands to be executed by receivers. | + +The advantage of using the command design pattern in the agent framework is that it allows for the decoupling of the sender and receiver of the action. This decoupling enables the agent to execute actions on different objects without knowing the details of the object or the action being performed, making the agent more flexible and extensible for new actions. + +## Receiver + +The `Receiver` is a central component in the Automator application that performs actions on the application. It provides functionalities to interact with the application and execute the action. All available actions are registered in the `Receiver` with the `ReceiverManager` class. + +You can find the reference for a basic `Receiver` class below: + +::: automator.basic.ReceiverBasic + +
+ +## Command + +The `Command` is a specific action that the `Receiver` can perform on the application. It encapsulates the function and parameters required to execute the action. The `Command` class is a base class for all commands in the Automator application. + +You can find the reference for a basic `Command` class below: + +::: automator.basic.CommandBasic + +
+ +!!! note + Each command must register with a specific `Receiver` to be executed using the `register_command` decorator. For example: + @ReceiverExample.register + class CommandExample(CommandBasic): + ... + + +## Invoker + +The `AppPuppeteer` plays the role of the invoker in the Automator application. It triggers the commands to be executed by the receivers. The `AppPuppeteer` equips the `AppAgent` with the capability to interact with the application's UI controls. It provides functionalities to translate action strings into specific actions and execute them. All available actions are registered in the `Puppeteer` with the `ReceiverManager` class. + +You can find the implementation of the `AppPuppeteer` class in the `ufo/automator/puppeteer.py` file, and its reference is shown below. + +::: automator.puppeteer.AppPuppeteer + +
+ +For further details, refer to the specific documentation for each component and class in the Automator module. \ No newline at end of file diff --git a/documents/docs/automator/ui_automator.md b/documents/docs/automator/ui_automator.md new file mode 100644 index 00000000..87a05135 --- /dev/null +++ b/documents/docs/automator/ui_automator.md @@ -0,0 +1,74 @@ +# UI Automator + +The UI Automator enables to mimic the operations of mouse and keyboard on the application's UI controls. UFO uses the **UIA** or **Win32** APIs to interact with the application's UI controls, such as buttons, edit boxes, and menus. + + +## Configuration + +There are several configurations that need to be set up before using the UI Automator in the `config_dev.yaml` file. Below is the list of configurations related to the UI Automator: + +| Configuration Option | Description | Type | Default Value | +|-------------------------|---------------------------------------------------------------------------------------------------------|----------|---------------| +| `CONTROL_BACKEND` | The backend for control action, currently supporting `uia` and `win32`. | String | "uia" | +| `CONTROL_LIST` | The list of widgets allowed to be selected. | List | ["Button", "Edit", "TabItem", "Document", "ListItem", "MenuItem", "ScrollBar", "TreeItem", "Hyperlink", "ComboBox", "RadioButton", "DataItem"] | +| `ANNOTATION_COLORS` | The colors assigned to different control types for annotation. | Dictionary | {"Button": "#FFF68F", "Edit": "#A5F0B5", "TabItem": "#A5E7F0", "Document": "#FFD18A", "ListItem": "#D9C3FE", "MenuItem": "#E7FEC3", "ScrollBar": "#FEC3F8", "TreeItem": "#D6D6D6", "Hyperlink": "#91FFEB", "ComboBox": "#D8B6D4"} | +| `API_PROMPT` | The prompt for the UI automation API. | String | "ufo/prompts/share/base/api.yaml" | +| `CLICK_API` | The API used for click action, can be `click_input` or `click`. | String | "click_input" | +| `INPUT_TEXT_API` | The API used for input text action, can be `type_keys` or `set_text`. | String | "type_keys" | +| `INPUT_TEXT_ENTER` | Whether to press enter after typing the text. | Boolean | False | + + + +## Receiver + +The receiver of the UI Automator is the `ControlReceiver` class defined in the `ufo/automator/ui_control/controller/control_receiver` module. It is initialized with the application's window handle and control wrapper that executes the actions. . The `ControlReceiver` provides functionalities to interact with the application's UI controls. Below is the reference for the `ControlReceiver` class: + +::: automator.ui_control.controller.ControlReceiver + +
+ +## Command + +The command of the UI Automator is the `ControlCommand` class defined in the `ufo/automator/ui_control/controller/ControlCommand` module. It encapsulates the function and parameters required to execute the action. The `ControlCommand` class is a base class for all commands in the UI Automator application. Below is an example of a `ClickInputCommand` class that inherits from the `ControlCommand` class: + +```python +@ControlReceiver.register +class ClickInputCommand(ControlCommand): + """ + The click input command class. + """ + + def execute(self) -> str: + """ + Execute the click input command. + :return: The result of the click input command. + """ + return self.receiver.click_input(self.params) + + @classmethod + def name(cls) -> str: + """ + Get the name of the atomic command. + :return: The name of the atomic command. + """ + return "click_input" +``` + +!!! note + The concrete command classes must implement the `execute` method to execute the action and the `name` method to return the name of the atomic command. + +!!! note + Each command must register with a specific `ControlReceiver` to be executed using the `@ControlReceiver.register` decorator. + +Below is the list of available commands in the UI Automator that are currently supported by UFO: + +| Command Name | Function Name | Description | +|--------------|---------------|-------------| +| `ClickInputCommand` | `click_input` | Click the control item with the mouse. | +| `SetEditTextCommand` | `set_edit_text` | Add new text to the control item. | +| `GetTextsCommand` | `texts` | Get the text of the control item. | +| `WheelMouseInputCommand` | `wheel_mouse_input` | Scroll the control item. | +| `KeyboardInputCommand` | `keyboard_input` | Simulate the keyboard input. | + +!!! tip + You can customize the commands by adding new command classes to the `ufo/automator/ui_control/controller/ControlCommand` module. diff --git a/documents/docs/automator/web_automator.md b/documents/docs/automator/web_automator.md new file mode 100644 index 00000000..cb32aa61 --- /dev/null +++ b/documents/docs/automator/web_automator.md @@ -0,0 +1 @@ +# Web Automator \ No newline at end of file diff --git a/documents/docs/automator/wincom_automator.md b/documents/docs/automator/wincom_automator.md new file mode 100644 index 00000000..97ce4cd0 --- /dev/null +++ b/documents/docs/automator/wincom_automator.md @@ -0,0 +1,80 @@ +# API Automator + +UFO currently support the use of [`Win32 API`](https://learn.microsoft.com/en-us/windows/win32/api/) API automator to interact with the application's native API. We implement them in python using the [`pywin32`](https://pypi.org/project/pywin32/) library. The API automator now supports `Word` and `Excel` applications, and we are working on extending the support to other applications. + +## Configuration + +There are several configurations that need to be set up before using the API Automator in the `config_dev.yaml` file. Below is the list of configurations related to the API Automator: + +| Configuration Option | Description | Type | Default Value | +|-------------------------|---------------------------------------------------------------------------------------------------------|----------|---------------| +| `USE_APIS` | Whether to allow the use of application APIs. | Boolean | True | +| `API_PROMPT` | The prompt for the UI automation API. | String | "ufo/prompts/share/base/api.yaml" | +| `WORD_API_PROMPT` | The prompt for the Word APIs. | String | "ufo/prompts/apps/word/api.yaml" | +| `EXCEL_API_PROMPT` | The prompt for the Excel APIs. | String | "ufo/prompts/apps/excel/api.yaml" | + +## Receiver +The base class for the receiver of the API Automator is the `WinCOMReceiverBasic` class defined in the `ufo/automator/app_apis/basic` module. It is initialized with the application's win32 com object and provides functionalities to interact with the application's native API. Below is the reference for the `WinCOMReceiverBasic` class: + +::: automator.app_apis.basic.WinCOMReceiverBasic + +The receiver of `Word` and `Excel` applications inherit from the `WinCOMReceiverBasic` class. The `WordReceiver` and `ExcelReceiver` classes are defined in the `ufo/automator/app_apis/word` and `ufo/automator/app_apis/excel` modules, respectively: + + +## Command + +The command of the API Automator for the `Word` and `Excel` applications in located in the `client` module in the `ufo/automator/app_apis/{app_name}` folder inheriting from the `WinCOMCommand` class. It encapsulates the function and parameters required to execute the action. Below is an example of a `WordCommand` class that inherits from the `SelectTextCommand` class: + +```python +@WordWinCOMReceiver.register +class SelectTextCommand(WinCOMCommand): + """ + The command to select text. + """ + + def execute(self): + """ + Execute the command to select text. + :return: The selected text. + """ + return self.receiver.select_text(self.params.get("text")) + + @classmethod + def name(cls) -> str: + """ + The name of the command. + """ + return "select_text" +``` + +!!! note + The concrete command classes must implement the `execute` method to execute the action and the `name` method to return the name of the atomic command. + +!!! note + Each command must register with a concrete `WinCOMReceiver` to be executed using the `register` decorator. + +Below is the list of available commands in the API Automator that are currently supported by UFO: + +### Word API Commands + +| Command Name | Function Name | Description | +|--------------|---------------|-------------| +| `InsertTableCommand` | `insert_table` | Insert a table to a Word document. | +| `SelectTextCommand` | `select_text` | Select the text in a Word document. | +| `SelectTableCommand` | `select_table` | Select a table in a Word document. | + + +### Excel API Commands + +| Command Name | Function Name | Description | +|--------------|---------------|-------------| +| `GetSheetContentCommand` | `get_sheet_content` | Get the content of a sheet in the Excel app. | +| `Table2MarkdownCommand` | `table2markdown` | Convert the table content in a sheet of the Excel app to markdown format. | +| `InsertExcelTableCommand` | `insert_excel_table` | Insert a table to the Excel sheet. | + + +!!! tip + You can customize the commands by adding new command classes to the `ufo/automator/app_apis/{app_name}/` module. + + + \ No newline at end of file diff --git a/documents/docs/configurations/developer_configuration.md b/documents/docs/configurations/developer_configuration.md index d7af9097..2d983e9c 100644 --- a/documents/docs/configurations/developer_configuration.md +++ b/documents/docs/configurations/developer_configuration.md @@ -14,7 +14,7 @@ The following parameters are included in the system configuration of the UFO age | `RECTANGLE_TIME` | The time in seconds for the rectangle display around the selected control. | Integer | 1 | | `SAFE_GUARD` | Whether to use the safe guard to ask for user confirmation before performing sensitive operations. | Boolean | True | | `CONTROL_LIST` | The list of widgets allowed to be selected. | List | ["Button", "Edit", "TabItem", "Document", "ListItem", "MenuItem", "ScrollBar", "TreeItem", "Hyperlink", "ComboBox", "RadioButton", "DataItem"] | -| `HISTORY_KEYS` | The keys of the step history added to the blackboard for agent decision-making. | List | ["Step", "Thought", "ControlText", "Subtask", "Action", "Comment", "Results", "UserConfirm"] | +| `HISTORY_KEYS` | The keys of the step history added to the [`Blackboard`](../agents/design/blackboard.md) for agent decision-making. | List | ["Step", "Thought", "ControlText", "Subtask", "Action", "Comment", "Results", "UserConfirm"] | | `ANNOTATION_COLORS` | The colors assigned to different control types for annotation. | Dictionary | {"Button": "#FFF68F", "Edit": "#A5F0B5", "TabItem": "#A5E7F0", "Document": "#FFD18A", "ListItem": "#D9C3FE", "MenuItem": "#E7FEC3", "ScrollBar": "#FEC3F8", "TreeItem": "#D6D6D6", "Hyperlink": "#91FFEB", "ComboBox": "#D8B6D4"} | | `PRINT_LOG` | Whether to print the log in the console. | Boolean | False | | `CONCAT_SCREENSHOT` | Whether to concatenate the screenshots into a single image for the LLM input. | Boolean | False | @@ -24,7 +24,7 @@ The following parameters are included in the system configuration of the UFO age | `USE_APIS` | Whether to allow the use of application APIs. | Boolean | True | | `ALLOW_OPENAPP` | Whether to allow the open app action in `HostAgent`. | Boolean | False | | `LOG_XML` | Whether to log the XML file at every step. | Boolean | False | -| `SCREENSHOT_TO_MEMORY` | Whether to allow the screenshot to memory for the agent's decision making. | Boolean | True | +| `SCREENSHOT_TO_MEMORY` | Whether to allow the screenshot to [`Blackboard`](../agents/design/blackboard.md) for the agent's decision making. | Boolean | True | ## Main Prompt Configuration diff --git a/documents/docs/creating_app_agent/demonstration_provision.md b/documents/docs/creating_app_agent/demonstration_provision.md new file mode 100644 index 00000000..247e5dcb --- /dev/null +++ b/documents/docs/creating_app_agent/demonstration_provision.md @@ -0,0 +1,71 @@ +## Provide Human Demonstrations to the AppAgent + +Users or application developers can provide human demonstrations to the `AppAgent` to guide it in executing similar tasks in the future. The `AppAgent` uses these demonstrations to understand the context of the task and the steps required to execute it, effectively becoming an expert in the application. + +### How to Prepare Human Demonstrations for the AppAgent? + +Currently, UFO supports learning from user trajectories recorded by [Steps Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) integrated within Windows. More tools will be supported in the future. + +### Step 1: Recording User Demonstrations + +Follow the [official guidance](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) to use Steps Recorder to record user demonstrations. + +### Step 2: Add Additional Information or Comments as Needed + +Include any specific details or instructions for UFO to notice by adding comments. Since Steps Recorder doesn't capture typed text, include any necessary typed content in the comments as well. + +

+ Adding Comments in Steps Recorder +

+ + + +### Step 3: Review and Save the Recorded Demonstrations + +Review the recorded steps and save them to a ZIP file. Refer to the [sample_record.zip](https://github.com/microsoft/UFO/blob/main/record_processor/example/sample_record.zip) for an example of recorded steps for a specific request, such as "sending an email to example@gmail.com to say hi." + +### Step 4: Create an Action Trajectory Indexer + +Once you have your demonstration record ZIP file ready, you can parse it as an example to support RAG for UFO. Follow these steps: + +```bash +# Assume you are in the cloned UFO folder +python -m record_processor -r "" -p "" +``` + +- Replace `` with the specific request, such as "sending an email to example@gmail.com to say hi." +- Replace `` with the full path to the ZIP file you just created. + +This command will parse the record and summarize it into an execution plan. You'll see a confirmation message similar to the following: + +```bash +Here are the plans summarized from your demonstration: +Plan [1] +(1) Input the email address 'example@gmail.com' in the 'To' field. +(2) Input the subject of the email. I need to input 'Greetings'. +(3) Input the content of the email. I need to input 'Hello,\nI hope this message finds you well. I am writing to send you a warm greeting and to wish you a great day.\nBest regards.' +(4) Click the Send button to send the email. +Plan [2] +(1) *** +(2) *** +(3) *** +Plan [3] +(1) *** +(2) *** +(3) *** +Would you like to save any one of them as a future reference for the agent? Press [1] [2] [3] to save the corresponding plan, or press any other key to skip. +``` + +Press `1` to save the plan into its memory for future reference. A sample can be found [here](https://github.com/microsoft/UFO/blob/main/vectordb/demonstration/example.yaml). + +You can view a demonstration video below: + + + +
+ +### How to Use Human Demonstrations to Enhance the AppAgent? + +After creating the offline indexer, refer to the [Learning from User Demonstrations](../advanced_usage/reinforce_appagent/learning_from_demonstration.md) section for guidance on how to use human demonstrations to enhance the AppAgent. + +--- \ No newline at end of file diff --git a/documents/docs/creating_app_agent/help_document_provision.md b/documents/docs/creating_app_agent/help_document_provision.md new file mode 100644 index 00000000..254f7faa --- /dev/null +++ b/documents/docs/creating_app_agent/help_document_provision.md @@ -0,0 +1,42 @@ +# Providing Help Documents to the AppAgent + +Help documents provide guidance to the `AppAgent` in executing specific tasks. The `AppAgent` uses these documents to understand the context of the task and the steps required to execute it, effectively becoming an expert in the application. + +## How to Provide Help Documents to the AppAgent? + +### Step 1: Prepare Help Documents and Metadata + +Currently, UFO supports processing help documents in XML format, which is the default format for official help documents of Microsoft apps. More formats will be supported in the future. + +To create a dedicated document for a specific task of an app, save it in a file named, for example, `task.xml`. This document should be accompanied by a metadata file with the same prefix but with the `.meta` extension, such as `task.xml.meta`. The metadata file should include: + +- `title`: Describes the task at a high level. +- `Content-Summary`: Summarizes the content of the help document. + +These two files are used for similarity search with user requests, so it is important to write them carefully. Examples of a help document and its metadata can be found [here](https://github.com/microsoft/UFO/blob/main/learner/doc_example/ppt-copilot.xml) and [here](https://github.com/microsoft/UFO/blob/main/learner/doc_example/ppt-copilot.xml.meta). + +### Step 2: Place Help Documents in the AppAgent Directory + +Once you have prepared all help documents and their metadata, place them into a folder. Sub-folders for the help documents are allowed, but ensure that each help document and its corresponding metadata are placed in the same directory. + +### Step 3: Create a Help Document Indexer + +After organizing your documents in a folder named `path_of_the_docs`, you can create an offline indexer to support RAG for UFO. Follow these steps: + +```bash +# Assume you are in the cloned UFO folder +python -m learner --app --docs +``` + +- Replace `` with the name of the application, such as PowerPoint or WeChat. +- Replace `` with the full path to the folder containing all your documents. + +This command will create an offline indexer for all documents in the `path_of_the_docs` folder using Faiss and embedding with sentence transformer (additional embeddings will be supported soon). By default, the created index will be placed [here](https://github.com/microsoft/UFO/tree/main/vectordb/docs). + +!!! note + Ensure the `app_name` is accurately defined, as it is used to match the offline indexer in online RAG. + + +### How to Use Help Documents to Enhance the AppAgent? + +After creating the offline indexer, you can find the guidance on how to use the help documents to enhance the AppAgent in the [Learning from Help Documents](../advanced_usage/reinforce_appagent/learning_from_help_document.md) section. \ No newline at end of file diff --git a/documents/docs/creating_app_agent/overview.md b/documents/docs/creating_app_agent/overview.md new file mode 100644 index 00000000..a6b9da29 --- /dev/null +++ b/documents/docs/creating_app_agent/overview.md @@ -0,0 +1,11 @@ +# Creating Your AppAgent + +UFO provides a flexible framework and SDK for application developers to empower their applications with AI capabilities by wrapping them into an `AppAgent`. By creating an `AppAgent`, you can leverage the power of UFO to interact with your application and automate tasks. + +To create an `AppAgent`, you can provide the following components: + +| Component | Description | Usage Documentation | +| --- | --- | --- | +| [Help Documents](./help_document_provision.md) | The help documents for the application to guide the `AppAgent` in executing tasks. | [Learning from Help Documents](../advanced_usage/reinforce_appagent/learning_from_help_document.md) | +| [User Demonstrations](./demonstration_provision.md) | The user demonstrations for the application to guide the `AppAgent` in executing tasks. | [Learning from User Demonstrations](../advanced_usage/reinforce_appagent/learning_from_demonstration.md) | +| [Native API Wrappers](./warpping_app_native_api.md) | The native API wrappers for the application to interact with the application. | [Automator](../automator/overview.md) | \ No newline at end of file diff --git a/documents/docs/creating_app_agent/warpping_app_native_api.md b/documents/docs/creating_app_agent/warpping_app_native_api.md new file mode 100644 index 00000000..21b6d0a9 --- /dev/null +++ b/documents/docs/creating_app_agent/warpping_app_native_api.md @@ -0,0 +1,317 @@ +# Wrapping Your App's Native API + +UFO takes actions on applications based on UI controls, but providing native API to its toolboxes can enhance the efficiency and accuracy of the actions. This document provides guidance on how to wrap your application's native API into UFO's toolboxes. + +## How to Wrap Your App's Native API? + +Before developing the native API wrappers, we strongly recommend that you read the design of the [Automator](../automator/overview.md). + +### Step 1: Create a Receiver for the Native API + +The `Receiver` is a class that receives the native API calls from the `AppAgent` and executes them. To wrap your application's native API, you need to create a `Receiver` class that contains the methods to execute the native API calls. + +To create a `Receiver` class, follow these steps: + +1. **Create a Folder for Your Application:** + + - Navigate to the `ufo/automator/app_api/` directory. + - Create a folder named after your application. + +2. **Create a Python File:** + + - Inside the folder you just created, add a Python file named after your application. + +3. **Define the Receiver Class:** + + - In the Python file, define a class named `{Your_Receiver}`, inheriting from the `ReceiverBasic` class located in `ufo/automator/basic.py`. + - Initialize the `Your_Receiver` class with the object that executes the native API calls. For example, if your API is based on a `com` object, initialize the `com` object in the `__init__` method of the `Your_Receiver` class. + +Example of `WinCOMReceiverBasic` class: +```python +class WinCOMReceiverBasic(ReceiverBasic): + """ + The base class for Windows COM client. + """ + + _command_registry: Dict[str, Type[CommandBasic]] = {} + + def __init__(self, app_root_name: str, process_name: str, clsid: str) -> None: + """ + Initialize the Windows COM client. + :param app_root_name: The app root name. + :param process_name: The process name. + :param clsid: The CLSID of the COM object. + """ + + self.app_root_name = app_root_name + self.process_name = process_name + self.clsid = clsid + self.client = win32com.client.Dispatch(self.clsid) + self.com_object = self.get_object_from_process_name() +``` + +4. **Define Methods to Execute Native API Calls:** + + - Define the methods in the `Your_Receiver` class to execute the native API calls. + +Example of `ExcelWinCOMReceiver` class: +```python +def table2markdown(self, sheet_name: str) -> str: + """ + Convert the table in the sheet to a markdown table string. + :param sheet_name: The sheet name. + :return: The markdown table string. + """ + + sheet = self.com_object.Sheets(sheet_name) + data = sheet.UsedRange() + df = pd.DataFrame(data[1:], columns=data[0]) + df = df.dropna(axis=0, how="all") + df = df.applymap(self.format_value) + + return df.to_markdown(index=False) +``` + +5. **Create a Factory Class:** + + - Create a Factory class to manage multiple `Receiver` classes. The Factory class should have a method to create the `Receiver` class based on the application name. + +Example of `COMReceiverFactory` class: +```python +class COMReceiverFactory(ReceiverFactory): + """ + The factory class for the COM receiver. + """ + + def create_receiver(self, app_root_name: str, process_name: str) -> WinCOMReceiverBasic: + """ + Create the wincom receiver. + :param app_root_name: The app root name. + :param process_name: The process name. + :return: The receiver. + """ + + com_receiver = self.__com_client_mapper(app_root_name) + clsid = self.__app_root_mappping(app_root_name) + + if clsid is None or com_receiver is None: + print_with_color(f"Warning: Win32COM API is not supported for {process_name}.", "yellow") + return None + + return com_receiver(app_root_name, process_name, clsid) + + def __com_client_mapper(self, app_root_name: str) -> Type[WinCOMReceiverBasic]: + """ + Map the app root to the corresponding COM client. + :param app_root_name: The app root name. + :return: The COM client. + """ + win_com_client_mapping = { + "WINWORD.EXE": WordWinCOMReceiver, + "EXCEL.EXE": ExcelWinCOMReceiver, + } + + return win_com_client_mapping.get(app_root_name, None) +``` + +6. **Register the ReceiverFactory in ReceiverManager:** + + - Register your `ReceiverFactory` in the `ReceiverManager` class in the `ufo/automator/puppeteer.py` file. + +Example: +```python +class ReceiverManager: + """ + The class for the receiver manager. + """ + + def __init__(self): + """ + Initialize the receiver manager. + """ + self.receiver_registry = {} + self.receiver_factories = {} + self.ui_control_receiver: Optional[ControlReceiver] = None + self.com_receiver: Optional[WinCOMReceiverBasic] = None + + self.load_receiver_factories() + + def load_receiver_factories(self) -> None: + """ + Load the receiver factories. Now we have two types of receiver factories: UI control receiver factory and COM receiver factory. + A receiver factory is responsible for creating the receiver for the specific type of receiver. + """ + self.__register_receiver_factory("UIControl", UIControlReceiverFactory()) + self.__register_receiver_factory("COM", COMReceiverFactory()) + + def __update_receiver_registry(self) -> None: + """ + Update the receiver registry. A receiver registry is a dictionary that maps the command name to the receiver. + """ + if self.ui_control_receiver is not None: + self.receiver_registry.update(self.ui_control_receiver.self_command_mapping()) + if self.com_receiver is not None: + self.receiver_registry.update(self.com_receiver.self_command_mapping()) + + def create_com_receiver(self, app_root_name, process_name) -> WinCOMReceiverBasic: + """ + Get the COM client. + :param app_root_name: The app root name. + :param process_name: The process name. + """ + factory = self.receiver_factories.get("COM") + self.com_receiver = factory.create_receiver(app_root_name, process_name) + self.__update_receiver_registry() + return self.com_receiver +``` + +The `Receiver` class is now ready to receive the native API calls from the `AppAgent`. + +### Step 2: Create a Command for the Native API + +Commands are the actions that the `AppAgent` can execute on the application. To create a command for the native API, you need to create a `Command` class that contains the method to execute the native API calls. + +1. **Create a Command Class:** + + - Create a `Command` class in the same Python file where the `Receiver` class is located. The `Command` class should inherit from the `CommandBasic` class located in `ufo/automator/basic.py`. + +Example: +```python +class WinCOMCommand(CommandBasic): + """ + The abstract command interface. + """ + + def __init__(self, receiver: WinCOMReceiverBasic, params=None) -> None: + """ + Initialize the command. + :param receiver: The receiver of the command. + """ + self.receiver = receiver + self.params = params if params is not None else {} + + @abstractmethod + def execute(self): + pass + + @classmethod + def name(cls) -> str: + """ + Get the name of the command. + :return: The name of the command. + """ + return cls.__name__ +``` + +2. **Define the Execute Method:** + + - Define the `execute` method in the `Command` class to call the receiver to execute the native API calls. + +Example: +```python +def execute(self): + """ + Execute the command to insert a table. + :return: The inserted table. + """ + return self.receiver.insert_excel_table( + sheet_name=self.params.get("sheet_name", 1), + table=self.params.get("table"), + start_row=self.params.get("start_row", 1), + start_col=self.params.get("start_col", 1), + ) +``` + +3. **Register the Command Class:** + + - Register the `Command` class in the corresponding `Receiver` class using the `@your_receiver.register` decorator. + +Example: +```python +@ExcelWinCOMReceiver.register +class InsertExcelTable(WinCOMCommand): + ... +``` + +The `Command` class is now registered in the `Receiver` class and available for the `AppAgent` to execute the native API calls. + +### Step 3: Provide Prompt Descriptions for the Native API + +To let the `AppAgent` know the usage of the native API calls, you need to provide prompt descriptions. + +1. **Create an api.yaml File:** + + - Create an `api.yaml` file in the `ufo/prompts/apps/{your_app_name}` directory. + +2. **Define Prompt Descriptions:** + + - Define the prompt descriptions for the native API calls in the `api.yaml` file. + +Example: +```yaml +table2markdown: + summary: |- + "table2markdown" is to get the table content in a sheet of the Excel app and convert it to markdown format. + class_name: |- + GetSheetContent + usage: |- + [1] + + API call: table2markdown(sheet_name: str) + [2] Args: + - sheet_name: The name of the sheet in the Excel app. + [3] Example: table2markdown(sheet_name="Sheet1") + [4] Available control item: Any control item in the Excel app. + [5] Return: the markdown format string of the table content of the sheet. +``` + +!!! note + The `table2markdown` is the name of the native API call. It `MUST` match the `name()` defined in the corresponding `Command` class! + +3. **Configure the api.yaml File in config_dev.yaml:** + + - Configure the address of the `api.yaml` file in the `config_dev.yaml` file with the application name as the key and the path to the `api.yaml` file as the value. + +Example: +```yaml +EXCEL_API_PROMPT: "ufo/prompts/apps/excel/api.yaml" +``` + +4. **Register the Prompt Address in APIPromptLoader:** + + - Register the prompt address in the `APIPromptLoader.load_com_api_prompt` method in the `ufo/prompter/agent_prompter.py` file. + +Example: +```python +def load_com_api_prompt(self) -> Dict[str, str]: + """ + Load the prompt template for COM APIs. + :return: The prompt template for COM APIs. + """ + app2configkey_mapper = { + "WINWORD.EXE": "WORD_API_PROMPT", + "EXCEL.EXE": "EXCEL_API_PROMPT", + "POWERPNT.EXE": "POWERPOINT_API_PROMPT", + "olk.exe": "OUTLOOK_API_PROMPT", + } + + config_key = app2configkey_mapper.get(self.root_name, None) + prompt_address = configs.get(config_key, None) + + if prompt_address: + return AppAgentPrompter.load_prompt_template(prompt_address, None) + else: + return {} +``` + +You shoud add the following data to the `app2configkey_mapper` dictionary: + +```python +"your_application_program_name": "YOUR_APPLICATION_API_PROMPT", +``` + +The `AppAgent` can now use the prompt descriptions to understand the usage of the native API calls. + +--- + +By following these steps, you will have successfully wrapped the native API of your application into UFO's toolboxes, allowing the `AppAgent` to execute the native API calls on the application! diff --git a/documents/docs/about/faq.md b/documents/docs/faq.md similarity index 82% rename from documents/docs/about/faq.md rename to documents/docs/faq.md index 8eae129d..9f0ca0e7 100644 --- a/documents/docs/about/faq.md +++ b/documents/docs/faq.md @@ -24,7 +24,8 @@ A: Yes, you can host your custom LLM endpoint and configure UFO to use it. Check ## Q7: Can I use non-English requests in UFO? A: It depends on the language model you are using. Most of LLMs support multiple languages, and you can specify the language in the request. However, the performance may vary for different languages. -## Q8: It shows the error `Error making API request: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))` when I run UFO. What should I do? +## Q8: Why it shows the error `Error making API request: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))`? A: This means the LLM endpoint is not accessible. You can check the network connection (e.g. VPN) and the status of the LLM endpoint. -To get more support, please submit an issue on the [GitHub Issues](https://github.com/microsoft/UFO/issues), or send an email to [ufo-agent@microsoft.com](mailto:ufo-agent@microsoft.com). \ No newline at end of file +!!! info + To get more support, please submit an issue on the [GitHub Issues](https://github.com/microsoft/UFO/issues), or send an email to [ufo-agent@microsoft.com](mailto:ufo-agent@microsoft.com). \ No newline at end of file diff --git a/documents/docs/getting_started/more_guidance.md b/documents/docs/getting_started/more_guidance.md index 0d3b714f..9c733f63 100644 --- a/documents/docs/getting_started/more_guidance.md +++ b/documents/docs/getting_started/more_guidance.md @@ -8,6 +8,6 @@ For instance, except for configuring the `HOST_AGENT` and `APP_AGENT`, you can a ## For Developers If you are a developer who wants to contribute to UFO, you can take a look at the [Developer Configuration](../configurations/developer_configuration.md) to explore the development environment setup and the development workflow. -You can also refer to the [Project Structure](../project_directory_structure.md) to understand the project structure and the role of each component in UFO, and use the rest of the documentation to understand the architecture and design of UFO. Taking a look at the [Session](../project/session.md) and [Round](../project/round.md) can help you understand the core logic of UFO. +You can also refer to the [Project Structure](../project_directory_structure.md) to understand the project structure and the role of each component in UFO, and use the rest of the documentation to understand the architecture and design of UFO. Taking a look at the [Session](../modules/session.md) and [Round](../modules/round.md) can help you understand the core logic of UFO. For debugging and testing, it is recommended to check the log files in the `ufo/logs` directory to track the execution of UFO and identify any issues that may arise. \ No newline at end of file diff --git a/documents/docs/getting_started/quick_start.md b/documents/docs/getting_started/quick_start.md index d5b4168b..6f0b6c65 100644 --- a/documents/docs/getting_started/quick_start.md +++ b/documents/docs/getting_started/quick_start.md @@ -55,8 +55,23 @@ Optionally, you can set a backup language model (LLM) engine in the `BACKUP_AGEN !!! note UFO also supports other LLMs and advanced configurations, such as customize your own model, please check the [documents](../supported_models/overview.md) for more details. Because of the limitations of model input, a lite version of the prompt is provided to allow users to experience it, which is configured in `config_dev.yaml`. +### 📔 Step 3: Additional Setting for RAG (optional). +If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file. -### 🎉 Step 3: Start UFO +We provide the following options for RAG to enhance UFO's capabilities: + +- **[Offline Help Document](../advanced_usage/reinforce_appagent/learning_from_help_document.md)**: Enable UFO to retrieve information from offline help documents. + +- **[Online Bing Search Engine](../advanced_usage/reinforce_appagent/learning_from_bing_search.md)**: Enhance UFO's capabilities by utilizing the most up-to-date online search results. + +- **[Self-Experience](../advanced_usage/reinforce_appagent/experience_learning.md)**: Save task completion trajectories into UFO's memory for future reference. + +- **[User-Demonstration](../advanced_usage/reinforce_appagent/learning_from_demonstration.md)**: Boost UFO's capabilities through user demonstration. + +!!!tip + Consult their respective documentation for more information on how to configure these settings. + +### 🎉 Step 4: Start UFO #### ⌨️ You can execute the following on your Windows command Line (CLI): @@ -79,7 +94,7 @@ Please enter your request to be completed🛸: ``` -### Step 4 🎥: Execution Logs +### Step 5 🎥: Execution Logs You can find the screenshots taken and request & response logs in the following folder: ``` @@ -88,5 +103,7 @@ You can find the screenshots taken and request & response logs in the following You may use them to debug, replay, or analyze the agent output. !!! note - - Before UFO executing your request, please make sure the targeted applications are active on the system. - - The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process. For further information, refer to [DISCLAIMER.md](https://github.com/microsoft/UFO/blob/vyokky/dev/DISCLAIMER.md). \ No newline at end of file + Before UFO executing your request, please make sure the targeted applications are active on the system. + +!!! note + The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process. For further information, refer to [DISCLAIMER.md](https://github.com/microsoft/UFO/blob/vyokky/dev/DISCLAIMER.md). \ No newline at end of file diff --git a/documents/docs/img/action_step2.png b/documents/docs/img/action_step2.png new file mode 100644 index 00000000..0a8d93ab Binary files /dev/null and b/documents/docs/img/action_step2.png differ diff --git a/documents/docs/img/action_step2_annotated.png b/documents/docs/img/action_step2_annotated.png new file mode 100644 index 00000000..b3edb8da Binary files /dev/null and b/documents/docs/img/action_step2_annotated.png differ diff --git a/documents/docs/img/action_step2_concat.png b/documents/docs/img/action_step2_concat.png new file mode 100644 index 00000000..fba26a1b Binary files /dev/null and b/documents/docs/img/action_step2_concat.png differ diff --git a/documents/docs/img/action_step2_selected_controls.png b/documents/docs/img/action_step2_selected_controls.png new file mode 100644 index 00000000..f2d0e11d Binary files /dev/null and b/documents/docs/img/action_step2_selected_controls.png differ diff --git a/documents/docs/img/record_processor/add_comment.png b/documents/docs/img/add_comment.png similarity index 100% rename from documents/docs/img/record_processor/add_comment.png rename to documents/docs/img/add_comment.png diff --git a/documents/docs/img/app_agent_creation.png b/documents/docs/img/app_agent_creation.png new file mode 100644 index 00000000..393d4556 Binary files /dev/null and b/documents/docs/img/app_agent_creation.png differ diff --git a/documents/docs/img/app_state_machine.png b/documents/docs/img/app_state_machine.png new file mode 100644 index 00000000..64f0a60f Binary files /dev/null and b/documents/docs/img/app_state_machine.png differ diff --git a/documents/docs/img/appagent.png b/documents/docs/img/appagent.png new file mode 100644 index 00000000..83b078e4 Binary files /dev/null and b/documents/docs/img/appagent.png differ diff --git a/documents/docs/img/blackboard.png b/documents/docs/img/blackboard.png new file mode 100644 index 00000000..60e7da1c Binary files /dev/null and b/documents/docs/img/blackboard.png differ diff --git a/documents/docs/img/desomposition.png b/documents/docs/img/desomposition.png new file mode 100644 index 00000000..6fa7b9d1 Binary files /dev/null and b/documents/docs/img/desomposition.png differ diff --git a/documents/docs/img/eva.png b/documents/docs/img/eva.png new file mode 100644 index 00000000..a1213603 Binary files /dev/null and b/documents/docs/img/eva.png differ diff --git a/documents/docs/img/host_state_machine.png b/documents/docs/img/host_state_machine.png new file mode 100644 index 00000000..88e670aa Binary files /dev/null and b/documents/docs/img/host_state_machine.png differ diff --git a/documents/docs/img/save_ask.png b/documents/docs/img/save_ask.png new file mode 100644 index 00000000..e22ee59b Binary files /dev/null and b/documents/docs/img/save_ask.png differ diff --git a/documents/docs/img/screenshots.png b/documents/docs/img/screenshots.png new file mode 100644 index 00000000..18048e9d Binary files /dev/null and b/documents/docs/img/screenshots.png differ diff --git a/documents/docs/img/session.png b/documents/docs/img/session.png new file mode 100644 index 00000000..e2e00f26 Binary files /dev/null and b/documents/docs/img/session.png differ diff --git a/documents/docs/index.md b/documents/docs/index.md index 49f81a0a..2ca81f32 100644 --- a/documents/docs/index.md +++ b/documents/docs/index.md @@ -65,7 +65,7 @@ UFO sightings have garnered attention from various media outlets, including: ## ❓Get help * ❔GitHub Issues (prefered) -* For other communications, please contact ufo-agent@microsoft.com +* For other communications, please contact [ufo-agent@microsoft.com](mailto:ufo-agent@microsoft.com) --- ## 🎬 Demo Examples @@ -99,4 +99,14 @@ If you use UFO in your research, please cite our paper: ## 🎨 Related Project -You may also find [TaskWeaver](https://github.com/microsoft/TaskWeaver?tab=readme-ov-file) useful, a code-first LLM agent framework for seamlessly planning and executing data analytics tasks. \ No newline at end of file +You may also find [TaskWeaver](https://github.com/microsoft/TaskWeaver?tab=readme-ov-file) useful, a code-first LLM agent framework for seamlessly planning and executing data analytics tasks. + + + + \ No newline at end of file diff --git a/documents/docs/logs/evaluation_logs.md b/documents/docs/logs/evaluation_logs.md new file mode 100644 index 00000000..05ee396d --- /dev/null +++ b/documents/docs/logs/evaluation_logs.md @@ -0,0 +1,14 @@ +# Evaluation Logs + +The evaluation logs store the evaluation results from the `EvaluationAgent`. The evaluation log contains the following information: + +| Field | Description | Type | +| --- | --- | --- | +| Reason | The detailed reason for your judgment, by observing the screenshot differences and the . | String | +| Sub-score | The sub-score of the evaluation in decomposing the evaluation into multiple sub-goals. | List of Dictionaries | +| Complete | The completion status of the evaluation, can be `yes`, `no`, or `unsure`. | String | +| level | The level of the evaluation. | String | +| request | The request sent to the `EvaluationAgent`. | Dictionary | +| id | The ID of the evaluation. | Integer | + + diff --git a/documents/docs/logs/overview.md b/documents/docs/logs/overview.md new file mode 100644 index 00000000..b5200ca5 --- /dev/null +++ b/documents/docs/logs/overview.md @@ -0,0 +1,12 @@ +# UFO Logs + +Logs are essential for debugging and understanding the behavior of the UFO framework. There are three types of logs generated by UFO: + +| Log Type | Description | Location | Level | +| --- | --- | --- | --- | +| [Request Log](./request_logs.md) | Contains the prompt requests to LLMs. | `logs/{task_name}/request.log` | Info | +| [Step Log](./step_logs.md) | Contains the agent's response to the user's request and additional information at every step. | `logs/{task_name}/response.log` | Info | +| [Evaluation Log](./evaluation_logs.md) | Contains the evaluation results from the `EvaluationAgent`. | `logs/{task_name}/evaluation.log` | Info | +| [Screenshots](./screenshots_logs.md) | Contains the screenshots of the application UI. | `logs/{task_name}/` | - | + +All logs are stored in the `logs/{task_name}` directory. \ No newline at end of file diff --git a/documents/docs/logs/request_logs.md b/documents/docs/logs/request_logs.md new file mode 100644 index 00000000..04168942 --- /dev/null +++ b/documents/docs/logs/request_logs.md @@ -0,0 +1,20 @@ +# Request Logs + +The request is the prompt requests to the LLMs. The request log is stored in the `request.log` file. The request log contains the following information for each step: + +| Field | Description | +| --- | --- | +| `step` | The step number of the session. | +| `prompt` | The prompt message sent to the LLMs. | + +The request log is stored at the `debug` level. You can configure the logging level in the `LOG_LEVEL` field in the `config_dev.yaml` file. + +!!! tip + You can use the following python code to read the request log: + + import json + + with open('logs/{task_name}/request.log', 'r') as f: + for line in f: + log = json.loads(line) + \ No newline at end of file diff --git a/documents/docs/logs/screenshots_logs.md b/documents/docs/logs/screenshots_logs.md new file mode 100644 index 00000000..9990e7a8 --- /dev/null +++ b/documents/docs/logs/screenshots_logs.md @@ -0,0 +1,48 @@ +# Screenshot Logs + +UFO also save desktop or application screenshots for debugging and evaluation purposes. The screenshot logs are stored in the `logs/{task_name}/`. + +There are 4 types of screenshot logs generated by UFO, as detailed below. + + +## Clean Screenshots +At each step, UFO saves a clean screenshot of the desktop or application. The clean screenshot is saved in the `action_step{step_number}.png` file. In addition, the clean screenshots are also saved when a sub-task, round or session is completed. The clean screenshots are saved in the `action_round_{round_id}_sub_round_{sub_task_id}_final.png`, `action_round_{round_id}_final.png` and `action_step_final.png` files, respectively. Below is an example of a clean screenshot. + +

+ AppAgent Image +

+ + +## Annotation Screenshots +UFO also saves annotated screenshots of the application, with each control item is annotated with a number, following the [Set-of-Mark](https://arxiv.org/pdf/2310.11441) paradigm. The annotated screenshots are saved in the `action_step{step_number}_annotated.png` file. Below is an example of an annotated screenshot. + +

+ AppAgent Image +

+ +!!!info + Only selected types of controls are annotated in the screenshots. They are configured in the `config_dev.yaml` file under the `CONTROL_LIST` field. + +!!!tip + Different types of controls are annotated with different colors. You can configure the colors in the `config_dev.yaml` file under the `ANNOTATION_COLORS` field. + + +## Concatenated Screenshots +UFO also saves concatenated screenshots of the application, with clean and annotated screenshots concatenated side by side. The concatenated screenshots are saved in the `action_step{step_number}_concat.png` file. Below is an example of a concatenated screenshot. + +

+ AppAgent Image +

+ +!!!info + You can configure whether to feed the concatenated screenshots to the LLMs, or separate clean and annotated screenshots, in the `config_dev.yaml` file under the `CONCAT_SCREENSHOT` field. + +## Selected Control Screenshots +UFO saves screenshots of the selected control item for operation. The selected control screenshots are saved in the `action_step{step_number}_selected_controls.png` file. Below is an example of a selected control screenshot. + +

+ AppAgent Image +

+ +!!!info + You can configure whether to feed LLM with the selected control screenshots at the previous step to enhance the context, in the `config_dev.yaml` file under the `INCLUDE_LAST_SCREENSHOT` field. \ No newline at end of file diff --git a/documents/docs/logs/step_logs.md b/documents/docs/logs/step_logs.md new file mode 100644 index 00000000..31efb368 --- /dev/null +++ b/documents/docs/logs/step_logs.md @@ -0,0 +1,97 @@ +# Step Logs + +The step log contains the agent's response to the user's request and additional information at every step. The step log is stored in the `response.log` file. The log fields are different for `HostAgent` and `AppAgent`. The step log is at the `info` level. +## HostAgent Logs + +The `HostAgent` logs contain the following fields: + + +### LLM Output + +| Field | Description | Type | +| --- | --- | --- | +| Observation | The observation of current desktop screenshots. | String | +| Thought | The logical reasoning process of the `HostAgent`. | String | +| Current Sub-Task | The current sub-task to be executed by the `AppAgent`. | String | +| Message | The message to be sent to the `AppAgent` for the completion of the sub-task. | String | +| ControlLabel | The index of the selected application to execute the sub-task. | String | +| ControlText | The name of the selected application to execute the sub-task. | String | +| Plan | The plan for the following sub-tasks after the current sub-task. | List of Strings | +| Status | The status of the agent, mapped to the `AgentState`. | String | +| Comment | Additional comments or information provided to the user. | String | +| Questions | The questions to be asked to the user for additional information. | List of Strings | +| AppsToOpen | The application to be opened to execute the sub-task if it is not already open. | Dictionary | + + +### Additional Information + +| Field | Description | Type | +| --- | --- | --- | +| Step | The step number of the session. | Integer | +| RoundStep | The step number of the current round. | Integer | +| AgentStep | The step number of the `HostAgent`. | Integer | +| Round | The round number of the session. | Integer | +| ControlLabel | The index of the selected application to execute the sub-task. | Integer | +| ControlText | The name of the selected application to execute the sub-task. | String | +| Request | The user request. | String | +| Agent | The agent that executed the step, set to `HostAgent`. | String | +| AgentName | The name of the agent. | String | +| Application | The application process name. | String | +| Cost | The cost of the step. | Float | +| Results | The results of the step, set to an empty string. | String | +| CleanScreenshot | The image path of the desktop screenshot. | String | + + + +## AppAgent Logs + +The `AppAgent` logs contain the following fields: + +### LLM Output + +| Field | Description | Type | +| --- | --- | --- | +| Observation | The observation of the current application screenshots. | String | +| Thought | The logical reasoning process of the `AppAgent`. | String | +| ControlLabel | The index of the selected control to interact with. | String | +| ControlText | The name of the selected control to interact with. | String | +| Function | The function to be executed on the selected control. | String | +| Args | The arguments required for the function execution. | List of Strings | +| Status | The status of the agent, mapped to the `AgentState`. | String | +| Plan | The plan for the following steps after the current action. | List of Strings | +| Comment | Additional comments or information provided to the user. | String | +| SaveScreenshot | The flag to save the screenshot of the application to the `blackboard` for future reference. | Boolean | + +### Additional Information + +| Field | Description | Type | +| --- | --- | --- | +| Step | The step number of the session. | Integer | +| RoundStep | The step number of the current round. | Integer | +| AgentStep | The step number of the `AppAgent`. | Integer | +| Round | The round number of the session. | Integer | +| Subtask | The sub-task to be executed by the `AppAgent`. | String | +| SubtaskIndex | The index of the sub-task in the current round. | Integer | +| Action | The action to be executed by the `AppAgent`. | String | +| ActionType | The type of the action to be executed. | String | +| Request | The user request. | String | +| Agent | The agent that executed the step, set to `AppAgent`. | String | +| AgentName | The name of the agent. | String | +| Application | The application process name. | String | +| Cost | The cost of the step. | Float | +| Results | The results of the step. | String | +| CleanScreenshot | The image path of the desktop screenshot. | String | +| AnnotatedScreenshot | The image path of the annotated application screenshot. | String | +| ConcatScreenshot | The image path of the concatenated application screenshot. | String | + +!!! tip + You can use the following python code to read the request log: + + import json + + with open('logs/{task_name}/request.log', 'r') as f: + for line in f: + log = json.loads(line) + +!!! info + The `FollowerAgent` logs share the same fields as the `AppAgent` logs. \ No newline at end of file diff --git a/documents/docs/project/context.md b/documents/docs/modules/context.md similarity index 100% rename from documents/docs/project/context.md rename to documents/docs/modules/context.md diff --git a/documents/docs/project/round.md b/documents/docs/modules/round.md similarity index 97% rename from documents/docs/project/round.md rename to documents/docs/modules/round.md index 3cd5454b..3f53f7be 100644 --- a/documents/docs/project/round.md +++ b/documents/docs/modules/round.md @@ -29,7 +29,7 @@ def run(self) -> None: # If the subtask ends, capture the last snapshot of the application. if self.state.is_subtask_end(): - time.sleep(3) + time.sleep(configs["SLEEP_TIME"]) self.capture_last_snapshot(sub_round_id=self.subtask_amount) self.subtask_amount += 1 diff --git a/documents/docs/project/session.md b/documents/docs/modules/session.md similarity index 90% rename from documents/docs/project/session.md rename to documents/docs/modules/session.md index 27a0fe23..1473ba76 100644 --- a/documents/docs/project/session.md +++ b/documents/docs/modules/session.md @@ -1,6 +1,10 @@ # Session -A `Session` is a conversation instance between the user and UFO. It is a continuous interaction that starts when the user initiates a request and ends when the request is completed. UFO supports multiple requests within the same session. Each request is processed sequentially, by a `Round` of interaction, until the user's request is fulfilled. +A `Session` is a conversation instance between the user and UFO. It is a continuous interaction that starts when the user initiates a request and ends when the request is completed. UFO supports multiple requests within the same session. Each request is processed sequentially, by a `Round` of interaction, until the user's request is fulfilled. We show the relationship between `Session` and `Round` in the following figure: + +

+ Session and Round Image +

## Session Lifecycle diff --git a/documents/docs/prompts/api_prompts.md b/documents/docs/prompts/api_prompts.md new file mode 100644 index 00000000..86c5d752 --- /dev/null +++ b/documents/docs/prompts/api_prompts.md @@ -0,0 +1,36 @@ +# API Prompts + +The API prompts provide the description and usage of the APIs used in UFO. Shared APIs and app-specific APIs are stored in different directories: + +| Directory | Description | +| --- | --- | +| `ufo/prompts/share/base/api.yaml` | Shared APIs used by multiple applications | +| `ufo/prompts/{app_name}` | APIs specific to an application | + +!!! info + You can configure the API prompt used in the `config.yaml` file. You can find more information about the configuration file [here](../configurations/developer_configuration.md). +!!! tip + You may customize the API prompt for a specific application by adding the API prompt in the application's directory. + + +## Example API Prompt + +Below is an example of an API prompt: + +```yaml +click_input: + summary: |- + "click_input" is to click the control item with mouse. + class_name: |- + ClickInputCommand + usage: |- + [1] API call: click_input(button: str, double: bool) + [2] Args: + - button: 'The mouse button to click. One of ''left'', ''right'', ''middle'' or ''x'' (Default: ''left'')' + - double: 'Whether to perform a double click or not (Default: False)' + [3] Example: click_input(button="left", double=False) + [4] Available control item: All control items. + [5] Return: None +``` + +To create a new API prompt, follow the template above and add it to the appropriate directory. \ No newline at end of file diff --git a/documents/docs/prompts/basic_template.md b/documents/docs/prompts/basic_template.md new file mode 100644 index 00000000..b213b163 --- /dev/null +++ b/documents/docs/prompts/basic_template.md @@ -0,0 +1,18 @@ +# Basic Prompt Template + +The basic prompt template is a fixed format that is used to generate prompts for the `HostAgent`, `AppAgent`, `FollowerAgent`, and `EvaluationAgent`. It include the template for the `system` and `user` roles to construct the agent's prompt. + +Below is the default file path for the basic prompt template: + +| Agent | File Path | Version | +| --- | --- | --- | +| HostAgent | [ufo/prompts/share/base/host_agent.yaml](https://github.com/microsoft/UFO/blob/main/ufo/prompts/share/base/host_agent.yaml) | base | +| HostAgent | [ufo/prompts/share/lite/host_agent.yaml](https://github.com/microsoft/UFO/blob/main/ufo/prompts/share/lite/host_agent.yaml) | lite | +| AppAgent | [ufo/prompts/share/base/app_agent.yaml](https://github.com/microsoft/UFO/blob/main/ufo/prompts/share/base/app_agent.yaml) | base | +| AppAgent | [ufo/prompts/share/lite/app_agent.yaml](https://github.com/microsoft/UFO/blob/main/ufo/prompts/share/lite/app_agent.yaml) | lite | +| FollowerAgent | [ufo/prompts/share/base/app_agent.yaml](https://github.com/microsoft/UFO/blob/main/ufo/prompts/share/base/app_agent.yaml) | base | +| FollowerAgent | [ufo/prompts/share/lite/app_agent.yaml](https://github.com/microsoft/UFO/blob/main/ufo/prompts/share/lite/app_agent.yaml) | lite | +| EvaluationAgent | [ufo/prompts/evaluation/evaluation_agent.yaml](https://github.com/microsoft/UFO/blob/main/ufo/prompts/evaluation/evaluate.yaml) | - | + +!!! info + You can configure the prompt template used in the `config.yaml` file. You can find more information about the configuration file [here](../configurations/developer_configuration.md). diff --git a/documents/docs/prompts/examples_prompts.md b/documents/docs/prompts/examples_prompts.md new file mode 100644 index 00000000..aa6b1af2 --- /dev/null +++ b/documents/docs/prompts/examples_prompts.md @@ -0,0 +1,86 @@ +# Example Prompts + +The example prompts are used to generate textual demonstration examples for in-context learning. The examples are stored in the `ufo/prompts/examples` directory, with the following subdirectories: + +| Directory | Description | +| --- | --- | +| `lite` | Lite version of demonstration examples | +| `non-visual` | Examples for non-visual LLMs | +| `visual` | Examples for visual LLMs | + +!!!info + You can configure the example prompt used in the `config.yaml` file. You can find more information about the configuration file [here](../configurations/developer_configuration.md). + + +## Example Prompts + +Below are examples for the `HostAgent` and `AppAgent`: + +- **HostAgent**: + +```yaml +Request: |- + Summarize and add all to do items on Microsoft To Do from the meeting notes email, and write a summary on the meeting_notes.docx. +Response: + Observation: |- + The current screenshot shows the Microsoft To Do application is visible, and outlook application and the meeting_notes.docx are available in the list of applications. + Thought: |- + The user request can be decomposed into three sub-tasks: (1) Summarize all to do items on Microsoft To Do from the meeting_notes email, (2) Add all to do items to Microsoft To Do, and (3) Write a summary on the meeting_notes.docx. I need to open the Microsoft To Do application to complete the first two sub-tasks. + Each sub-task will be completed in individual applications sequentially. + CurrentSubtask: |- + Summarized all to do items from the meeting notes email in Outlook. + Message: + - (1) You need to first search for the meeting notes email in Outlook to summarize. + - (2) Only summarize the to do items from the meeting notes email, without any redundant information. + ControlLabel: |- + 16 + ControlText: |- + Mail - Outlook - Jim + Status: |- + CONTINUE + Plan: + - Add all to do items previously summarized from the meeting notes email to one-by-one Microsoft To Do. + - Write a summary about the meeting notes email on the meeting_notes.docx. + Comment: |- + I plan to first summarize all to do items from the meeting notes email in Outlook. + Questions: [] + AppsToOpen: |- + None +``` + +- **AppAgent**: + +```yaml +Request: |- + How many stars does the Imdiffusion repo have? +Sub-task: |- + Google search for the Imdiffusion repo on github and summarize the number of stars the Imdiffusion repo page visually. +Response: + Observation: |- + I observe that the Edge browser is visible in the screenshot, with the Google search page opened. + Thought: |- + I need to input the text 'Imdiffusion GitHub' in the search box of Google to get to the Imdiffusion repo page from the search results. The search box is usually in a type of ComboBox. + ControlLabel: |- + 36 + ControlText: |- + 搜索 + Function: |- + set_edit_text + Args: + {"text": "Imdiffusion GitHub"} + Status: |- + CONTINUE + Plan: + - (1) After input 'Imdiffusion GitHub', click Google Search to search for the Imdiffusion repo on github. + - (2) Once the searched results are visible, click the Imdiffusion repo Hyperlink in the searched results to open the repo page. + - (3) Observing and summarize the number of stars the Imdiffusion repo page, and reply to the user request. + Comment: |- + I plan to use Google search for the Imdiffusion repo on github and summarize the number of stars the Imdiffusion repo page visually. + SaveScreenshot: + {"save": false, "reason": ""} +Tips: |- + - The search box is usually in a type of ComboBox. + - The number of stars of a Github repo page can be found in the repo page visually. +``` + +These examples regulate the output format of the agent's response and provide a structured way to generate demonstration examples for in-context learning. \ No newline at end of file diff --git a/documents/docs/prompts/overview.md b/documents/docs/prompts/overview.md new file mode 100644 index 00000000..98106c26 --- /dev/null +++ b/documents/docs/prompts/overview.md @@ -0,0 +1,48 @@ +# Prompts + +All prompts used in UFO are stored in the `ufo/prompts` directory. The folder structure is as follows: + +```bash +📦prompts + ┣ 📂apps # Stores API prompts for specific applications + ┣ 📂excel # Stores API prompts for Excel + ┣ 📂word # Stores API prompts for Word + ┗ ... + ┣ 📂demonstration # Stores prompts for summarizing demonstrations from humans using Step Recorder + ┣ 📂experience # Stores prompts for summarizing the agent's self-experience + ┣ 📂evaluation # Stores prompts for the EvaluationAgent + ┣ 📂examples # Stores demonstration examples for in-context learning + ┣ 📂lite # Lite version of demonstration examples + ┣ 📂non-visual # Examples for non-visual LLMs + ┗ 📂visual # Examples for visual LLMs + ┗ 📂share # Stores shared prompts + ┣ 📂lite # Lite version of shared prompts + ┗ 📂base # Basic version of shared prompts + ┣ 📜api.yaml # Basic API prompt + ┣ 📜app_agent.yaml # Basic AppAgent prompt template + ┗ 📜host_agent.yaml # Basic HostAgent prompt template +``` + +!!! note + The `lite` version of prompts is a simplified version of the full prompts, which is used for LLMs that have a limited token budget. However, the `lite` version is not fully optimized and may lead to **suboptimal** performance. + +!!! note + The `non-visual` and `visual` folders contain examples for non-visual and visual LLMs, respectively. + +## Agent Prompts + +Prompts used an agent usually contain the following information: + +| Prompt | Description | +| --- | --- | +| `Basic template` | A basic template for the agent prompt. | +| `API` | A prompt for all skills and APIs used by the agent. | +| `Examples` | Demonstration examples for the agent for in-context learning. | + +You can find these prompts `share` directory. The prompts for specific applications are stored in the `apps` directory. + + +!!! tip + All information is constructed using the agent's `Prompter` class. You can find more details about the `Prompter` class in the documentation [here](../agents/design/prompter.md). + + diff --git a/documents/docs/supported_models/overview.md b/documents/docs/supported_models/overview.md index 1fab7746..d38c7fb7 100644 --- a/documents/docs/supported_models/overview.md +++ b/documents/docs/supported_models/overview.md @@ -11,4 +11,8 @@ Please refer to the following sections for more information on the supported mod | `Gemini` | [Gemini API](./gemini.md) | | `QWEN` | [QWEN API](./qwen.md) | | `Ollama` | [Ollama API](./ollama.md) | -| `Custom` | [Custom API](./custom_model.md) | \ No newline at end of file +| `Custom` | [Custom API](./custom_model.md) | + + +!!! info + Each model is implemented as a separate class in the `ufo/llm` directory, and uses the functions `chat_completion` defined in the `BaseService` class of the `ufo/llm/base.py` file to obtain responses from the model. \ No newline at end of file diff --git a/documents/mkdocs.yml b/documents/mkdocs.yml index ca2a38cd..2947a25e 100644 --- a/documents/mkdocs.yml +++ b/documents/mkdocs.yml @@ -1,4 +1,6 @@ site_name: UFO Documentation + + nav: - Home: index.md - Project Directory Structure: project_directory_structure.md @@ -6,9 +8,9 @@ nav: - Quick Start: getting_started/quick_start.md - More Guidance: getting_started/more_guidance.md - Basic Modules: - - Session: project/session.md - - Round: project/round.md - - Context: project/context.md + - Session: modules/session.md + - Round: modules/round.md + - Context: modules/context.md - Configurations: - User Configuration: configurations/user_configuration.md - Developer Configuration: configurations/developer_configuration.md @@ -32,62 +34,76 @@ nav: - HostAgent: agents/host_agent.md - AppAgent: agents/app_agent.md - FollowerAgent: agents/follower_agent.md - - EvaluationAgent: agents/evaluation.md + - EvaluationAgent: agents/evaluation_agent.md - Prompts: - Overview: prompts/overview.md - - Basic Prompts: prompts/basic_prompts.md + - Basic Prompts: prompts/basic_template.md - Examples Prompts: prompts/examples_prompts.md - API Prompts: prompts/api_prompts.md - Automator: - Overview: automator/overview.md - UI Automator: automator/ui_automator.md + - API Automator: automator/wincom_automator.md - Web Automator: automator/web_automator.md - - WinCom Automator: automator/wincom_automator.md - - Evaluation: - - Overview: evaluation/overview.md - - Evaluation Process: evaluation/evaluation_process.md + - AI Tool: automator/ai_tool_automator.md - Logs: - Overview: logs/overview.md - Request Logs: logs/request_logs.md - Step Logs: logs/step_logs.md - Evaluation Logs: logs/evaluation_logs.md - - Screenshots: logs/screenshots.md + - Screenshots: logs/screenshots_logs.md - Advanced Usage: - - Reinforcing AppAgent: - - Learning from Demonstration: advanced_usage/learning_from_demonstration.md - - Learning from Bing Search: advanced_usage/learning_from_bing_search.md - - Experience Learning: advanced_usage/experience_learning.md - - Learning from Help Document: advanced_usage/learning_from_help_document.md - - Follower Mode: advanced_usage/follower_model.md - - Control Filtering: advanced_usage/control_filtering.md + - Reinforcing AppAgent: + - Overview: advanced_usage/reinforce_appagent/overview.md + - Learning from Help Document: advanced_usage/reinforce_appagent/learning_from_help_document.md + - Learning from Bing Search: advanced_usage/reinforce_appagent/learning_from_bing_search.md + - Experience Learning: advanced_usage/reinforce_appagent/experience_learning.md + - Learning from User Demonstration: advanced_usage/reinforce_appagent/learning_from_demonstration.md + - Follower Mode: advanced_usage/follower_mode.md + - Control Filtering: + - Overview: advanced_usage/control_filtering/overview.md + - Text Filtering: advanced_usage/control_filtering/text_filtering.md + - Semantic Filtering: advanced_usage/control_filtering/semantic_filtering.md + - Icon Filtering: advanced_usage/control_filtering/icon_filtering.md - Customization: advanced_usage/customization.md - Creating Your AppAgent: + - Overview: creating_app_agent/overview.md - Help Document Provision: creating_app_agent/help_document_provision.md - Demonstration Provision: creating_app_agent/demonstration_provision.md - Warpping App-Native API: creating_app_agent/warpping_app_native_api.md - About: - Contributing: about/CONTRIBUTING.md - License: about/LICENSE.md + - Code of Conduct: about/CODE_OF_CONDUCT.md - Disclaimer: about/DISCLAIMER.md - Support: about/SUPPORT.md - - FAQ: about/faq.md + - FAQ: faq.md markdown_extensions: - pymdownx.tasklist - admonition # theme: -# name: readthedocs +# name: material +# palette: +# primary: blue +# accent: light-blue +# font: +# text: Roboto +# code: Roboto Mono theme: name: readthedocs + analytics: + - gtag: G-FX17ZGJYGC + plugins: - search - mkdocstrings: handlers: python: - paths: ["../ufo"] + paths: ["../ufo", "../record_processor"] options: docstring_style: sphinx docstring_section_style: list @@ -95,14 +111,7 @@ plugins: show_docstring_returns: true -# theme: -# name: material -# palette: -# primary: blue -# accent: light-blue -# font: -# text: Roboto -# code: Roboto Mono + # logo: ./assets/ufo_blue.png favicon: ./assets/ufo_blue.png diff --git a/ufo/agents/agent/app_agent.py b/ufo/agents/agent/app_agent.py index 79113a9e..2dcf9ef1 100644 --- a/ufo/agents/agent/app_agent.py +++ b/ufo/agents/agent/app_agent.py @@ -143,7 +143,7 @@ def message_constructor( def print_response(self, response_dict: Dict) -> None: """ Print the response. - :param response: The response dictionary. + :param response_dict: The response dictionary to print. """ control_text = response_dict.get("ControlText") @@ -349,7 +349,7 @@ def build_human_demonstration_retriever(self, db_path: str) -> None: def context_provision(self, request: str = "") -> None: """ Provision the context for the app agent. - :param app_agent: The app agent to provision the context. + :param request: The request sent to the Bing search retriever. """ # Load the offline document indexer for the app agent if available. diff --git a/ufo/agents/agent/basic.py b/ufo/agents/agent/basic.py index 2731d9fd..0961c57d 100644 --- a/ufo/agents/agent/basic.py +++ b/ufo/agents/agent/basic.py @@ -142,7 +142,9 @@ def get_response( ) -> str: """ Get the response for the prompt. - :param prompt: The prompt. + :param message: The message for LLMs. + :param namescope: The namescope for the LLMs. + :param use_backup_engine: Whether to use the backup engine. :return: The response. """ response_string, cost = llm_call.get_completion( @@ -309,7 +311,6 @@ def build_human_demonstration_retriever(self) -> None: def print_response(self) -> None: """ Print the response. - :param response: The response. """ pass diff --git a/ufo/agents/agent/host_agent.py b/ufo/agents/agent/host_agent.py index 5134fe37..dabe4ff4 100644 --- a/ufo/agents/agent/host_agent.py +++ b/ufo/agents/agent/host_agent.py @@ -244,7 +244,7 @@ def process(self, context: Context) -> None: def print_response(self, response_dict: Dict) -> None: """ Print the response. - :param response: The response. + :param response_dict: The response dictionary to print. """ application = response_dict.get("ControlText") diff --git a/ufo/agents/memory/blackboard.py b/ufo/agents/memory/blackboard.py index 324d5cce..def3b1da 100644 --- a/ufo/agents/memory/blackboard.py +++ b/ufo/agents/memory/blackboard.py @@ -136,7 +136,6 @@ def add_image( """ Add the image to the blackboard. :param screenshot_path: The path of the image. - :param screenshot_str: The string of the image, optional. :param metadata: The metadata of the image. """ diff --git a/ufo/agents/memory/memory.py b/ufo/agents/memory/memory.py index 5fde8e5f..e6102426 100644 --- a/ufo/agents/memory/memory.py +++ b/ufo/agents/memory/memory.py @@ -99,7 +99,7 @@ class Memory: def load(self, content: List[MemoryItem]) -> None: """ Load the data from the memory. - :param key: The key of the data. + :param content: The content to load. """ self._content = content diff --git a/ufo/agents/states/host_agent_state.py b/ufo/agents/states/host_agent_state.py index fc4464ed..a4d8172b 100644 --- a/ufo/agents/states/host_agent_state.py +++ b/ufo/agents/states/host_agent_state.py @@ -301,6 +301,14 @@ def is_round_end(self) -> bool: """ return True + def next_state(self, agent: HostAgent) -> AgentState: + """ + Get the next state of the agent. + :param agent: The current agent. + :return: The state for the next step. + """ + return FinishHostAgentState() + @classmethod def name(cls) -> str: """ @@ -323,6 +331,14 @@ def is_round_end(self) -> bool: """ return True + def next_state(self, agent: HostAgent) -> AgentState: + """ + Get the next state of the agent. + :param agent: The current agent. + :return: The state for the next step. + """ + return FinishHostAgentState() + @classmethod def name(cls) -> str: """ diff --git a/ufo/automator/app_apis/basic.py b/ufo/automator/app_apis/basic.py index c1c7e5b2..8f935959 100644 --- a/ufo/automator/app_apis/basic.py +++ b/ufo/automator/app_apis/basic.py @@ -37,7 +37,6 @@ def __init__(self, app_root_name: str, process_name: str, clsid: str) -> None: def get_object_from_process_name(self) -> win32com.client.CDispatch: """ Get the object from the process name. - :param process_name: The process name. """ pass diff --git a/ufo/automator/basic.py b/ufo/automator/basic.py index d30cb25b..ca4209b2 100644 --- a/ufo/automator/basic.py +++ b/ufo/automator/basic.py @@ -49,7 +49,7 @@ def self_command_mapping(self) -> Dict[str, CommandBasic]: def register(cls, command_class: Type[CommandBasic]) -> Type[CommandBasic]: """ Decorator to register the state class to the state manager. - :param state_class: The state class to be registered. + :param command_class: The state class to be registered. """ cls._command_registry[command_class.name()] = command_class return command_class @@ -75,12 +75,21 @@ def __init__(self, receiver: ReceiverBasic, params: Dict = None) -> None: @abstractmethod def execute(self): + """ + Execute the command. + """ pass def undo(self): + """ + Undo the command. + """ pass def redo(self): + """ + Redo the command. + """ self.execute() @classmethod diff --git a/ufo/automator/puppeteer.py b/ufo/automator/puppeteer.py index 40cc6816..2545b359 100644 --- a/ufo/automator/puppeteer.py +++ b/ufo/automator/puppeteer.py @@ -26,7 +26,6 @@ def __init__(self, process_name: str, app_root_name: str) -> None: Initialize the app puppeteer. :param process_name: The process name of the app. :param app_root_name: The app root name, e.g., WINWORD.EXE. - :param ui_control_interface: The UI control interface instance in pywinauto. """ self._process_name = process_name diff --git a/ufo/automator/ui_control/control_filter.py b/ufo/automator/ui_control/control_filter.py index 613d79dc..f71def1c 100644 --- a/ufo/automator/ui_control/control_filter.py +++ b/ufo/automator/ui_control/control_filter.py @@ -37,13 +37,9 @@ def inplace_append_filtered_annotation_dict( """ Appends the given control_info to the filtered_control_dict if it is not already present. For example, if the filtered_control_dict is empty, it will be updated with the control_info. The operation is performed in place. - - Args: - filtered_control_dict (dict): The dictionary of filtered control information. - control_dicts (dict): The control information to be appended. - - Returns: - dict: The updated filtered_control_dict dictionary. + :param filtered_control_dict: The dictionary of filtered control information. + :param control_dicts: The control information to be appended. + :return: The updated filtered_control_dict dictionary. """ if control_dicts: filtered_control_dict.update( @@ -59,13 +55,9 @@ def inplace_append_filtered_annotation_dict( def get_plans(plan: List[str], topk_plan: int) -> List[str]: """ Parses the given plan and returns a list of plans up to the specified topk_plan. - - Args: - plan (str): The plan to be parsed. - topk_plan (int): The maximum number of plans to be returned. - - Returns: - list: A list of plans up to the specified topk_plan. + :param plan: The plan to be parsed. + :param topk_plan: The maximum number of plans to be returned. + :return: A list of plans up to the specified topk_plan. """ return plan[:topk_plan] @@ -80,10 +72,8 @@ class BasicControlFilter: def __new__(cls, model_path): """ Creates a new instance of BasicControlFilter. - Args: - model_path (str): The path to the model. - Returns: - BasicControlFilter: The BasicControlFilter instance. + :param model_path: The path to the model. + :return: The BasicControlFilter instance. """ if model_path not in cls._instances: instance = super(BasicControlFilter, cls).__new__(cls) @@ -95,10 +85,8 @@ def __new__(cls, model_path): def load_model(model_path): """ Loads the model from the given model path. - Args: - model_path (str): The path to the model. - Returns: - SentenceTransformer: The loaded SentenceTransformer model. + :param model_path: The path to the model. + :return: The loaded model. """ import sentence_transformers @@ -107,34 +95,28 @@ def load_model(model_path): def get_embedding(self, content): """ Encodes the given object into an embedding. - Args: - content: The content to encode. - Returns: - The embedding of the object. + :param content: The content to encode. + :return: The embedding of the object. """ + return self.model.encode(content) @abstractmethod def control_filter(self, control_dicts, plans, **kwargs): """ Calculates the cosine similarity between the embeddings of the given keywords and the control item. - Args: - control_dicts (dic): The control item to be compared with the plans. - plans (str): The plans to be used for calculating the similarity. - Returns: - float: The cosine similarity between the embeddings of the keywords and the control item. + :param control_dicts: The control item to be compared with the plans. + :param plans: The plans to be used for calculating the similarity. + :return: The filtered control items. """ pass @staticmethod def plans_to_keywords(plans: List[str]) -> List[str]: """ - Gets keywords from the plan. - We only consider the words in the plan that are alphabetic or Chinese characters. - Args: - plans (list): The plan to be parsed. - Returns: - list: A list of keywords extracted from the plan. + Gets keywords from the plan. We only consider the words in the plan that are alphabetic or Chinese characters. + :param plans: The plan to be parsed. + :return: A list of keywords extracted from the plan. """ keywords = [] @@ -151,13 +133,9 @@ def plans_to_keywords(plans: List[str]) -> List[str]: @staticmethod def remove_stopwords(keywords): """ - Removes stopwords from the given list of keywords. - Note: - If you are using stopwords for the first time, you need to download them using nltk.download('stopwords'). - Args: - keywords (list): A list of keywords. - Returns: - list: A list of keywords with the stopwords removed. + Removes stopwords from the given list of keywords. If you are using stopwords for the first time, you need to download them using nltk.download('stopwords'). + :param keywords: The list of keywords to be filtered. + :return: The list of keywords with the stopwords removed. """ try: @@ -173,9 +151,12 @@ def remove_stopwords(keywords): return [keyword for keyword in keywords if keyword in stopwords_list] @staticmethod - def cos_sim(embedding1, embedding2): + def cos_sim(embedding1, embedding2) -> float: """ Computes the cosine similarity between two embeddings. + :param embedding1: The first embedding. + :param embedding2: The second embedding. + :return: The cosine similarity between the two embeddings. """ import sentence_transformers @@ -191,9 +172,9 @@ class TextControlFilter: def control_filter(control_dicts: Dict, plans: List[str]) -> Dict: """ Filters control items based on keywords. - Args: - control_dicts (dict): A dictionary of control items to be filtered. - plans (list): A list of plans for the following steps. + :param control_dicts: The dictionary of control items to be filtered. + :param plans: The list of plans to be used for filtering. + :return: The filtered control items. """ filtered_control_dict = {} @@ -216,12 +197,11 @@ class SemanticControlFilter(BasicControlFilter): def control_filter_score(self, control_text, plans): """ Calculates the score for a control item based on the similarity between its text and a set of keywords. - Args: - control_text (str): The text of the control item. - plans (list): The plan to be used for calculating the similarity. - Returns: - float: The score (0-1) indicating the similarity between the control text and the keywords. + :param control_text: The text of the control item. + :param plans: The plan to be used for calculating the similarity. + :return: The score (0-1) indicating the similarity between the control text and the keywords. """ + plan_embedding = self.get_embedding(plans) control_text_embedding = self.get_embedding(control_text) return max(self.cos_sim(control_text_embedding, plan_embedding).tolist()[0]) @@ -229,10 +209,10 @@ def control_filter_score(self, control_text, plans): def control_filter(self, control_dicts, plans, top_k): """ Filters control items based on their similarity to a set of keywords. - Args: - control_dicts (dict): A dictionary of control items to be filtered. - plans (list): A list of plans. - top_k (int): The number of top control items to be selected. + :param control_dicts: The dictionary of control items to be filtered. + :param plans: The list of plans to be used for filtering. + :param top_k: The number of top control items to return. + :return: The filtered control items. """ scores_items = [] filtered_control_dict = {} @@ -260,12 +240,11 @@ class IconControlFilter(BasicControlFilter): def control_filter_score(self, control_icon, plans): """ Calculates the score of a control icon based on its similarity to the given keywords. - Args: - control_icon: The control icon image. - plan: The plan to compare the control icon against. - Returns: - The maximum similarity score between the control icon and the keywords. + :param control_icon: The control icon image. + :param plans: The plan to compare the control icon against. + :return: The maximum similarity score between the control icon and the keywords. """ + plans_embedding = self.get_embedding(plans) control_icon_embedding = self.get_embedding(control_icon) return max(self.cos_sim(control_icon_embedding, plans_embedding).tolist()[0]) @@ -273,14 +252,13 @@ def control_filter_score(self, control_icon, plans): def control_filter(self, control_dicts, cropped_icons_dict, plans, top_k): """ Filters control items based on their scores and returns the top-k items. - Args: - control_dicts: The dictionary of all control items. - cropped_icons: The dictionary of the cropped icons. - plans: The plans to compare the control icons against. - top_k: The number of top items to return. - Returns: - The list of top-k control items based on their scores. + :param control_dicts: The dictionary of all control items. + :param cropped_icons_dict: The dictionary of the cropped icons. + :param plans: The plans to compare the control icons against. + :param top_k: The number of top items to return. + :return: The list of top-k control items based on their scores. """ + scores_items = [] filtered_control_dict = {} diff --git a/ufo/automator/ui_control/controller.py b/ufo/automator/ui_control/controller.py index 757e724d..197ad8db 100644 --- a/ufo/automator/ui_control/controller.py +++ b/ufo/automator/ui_control/controller.py @@ -44,8 +44,7 @@ def type_name(self): def atomic_execution(self, method_name: str, params: Dict[str, Any]) -> str: """ Atomic execution of the action on the control elements. - :param control: The control element to execute the action. - :param method: The method to execute. + :param method_name: The name of the method to execute. :param params: The arguments of the method. :return: The result of the action. """ @@ -137,7 +136,6 @@ def keyboard_input(self, params: Dict[str, str]) -> str: def texts(self) -> str: """ Get the text of the control element. - :param args: The arguments of the text method. :return: The text of the control element. """ return self.control.texts() diff --git a/ufo/experience/summarizer.py b/ufo/experience/summarizer.py index 07b8fb85..be8652a6 100644 --- a/ufo/experience/summarizer.py +++ b/ufo/experience/summarizer.py @@ -42,8 +42,8 @@ def __init__( def build_prompt(self, log_partition: dict) -> list: """ Build the prompt. - :param logs: The logs. - :param user_request: The user request. + :param log_partition: The log partition. + return: The prompt. """ experience_prompter = ExperiencePrompter( self.is_visual, diff --git a/ufo/module/basic.py b/ufo/module/basic.py index a89da776..adb5fe1c 100644 --- a/ufo/module/basic.py +++ b/ufo/module/basic.py @@ -109,7 +109,7 @@ def run(self) -> None: # If the subtask ends, capture the last snapshot of the application. if self.state.is_subtask_end(): - time.sleep(3) + time.sleep(configs["SLEEP_TIME"]) self.capture_last_snapshot(sub_round_id=self.subtask_amount) self.subtask_amount += 1 diff --git a/ufo/module/sessions/session.py b/ufo/module/sessions/session.py index d1e1fc40..e60b0ab6 100644 --- a/ufo/module/sessions/session.py +++ b/ufo/module/sessions/session.py @@ -176,7 +176,7 @@ def __init__( """ Initialize a session. :param task: The name of current task. - :param plan_dir: The path of the plan file to follow. + :param plan_file: The path of the plan file to follow. :param should_evaluate: Whether to evaluate the session. :param id: The id of the session. """ diff --git a/ufo/prompter/basic.py b/ufo/prompter/basic.py index d4fb6859..8e6e174f 100644 --- a/ufo/prompter/basic.py +++ b/ufo/prompter/basic.py @@ -107,23 +107,38 @@ def retrived_documents_prompt_helper( @abstractmethod def system_prompt_construction(self) -> str: + """ + Construct the system prompt for LLM. + """ pass @abstractmethod def user_prompt_construction(self) -> str: + """ + Construct the textual user prompt for LLM based on the `user` field in the prompt template. + """ pass @abstractmethod def user_content_construction(self) -> str: + """ + Construct the full user content for LLM, including the user prompt and images. + """ pass def examples_prompt_helper(self) -> str: + """ + A helper function to construct the examples prompt for in-context learning. + """ pass def api_prompt_helper(self) -> str: + """ + A helper function to construct the API list and descriptions for the prompt. + """ pass diff --git a/ufo/rag/retriever.py b/ufo/rag/retriever.py index 44d831b8..509951dd 100644 --- a/ufo/rag/retriever.py +++ b/ufo/rag/retriever.py @@ -135,14 +135,14 @@ class ExperienceRetriever(Retriever): def __init__(self, db_path) -> None: """ Create a new ExperienceRetriever. - :appname: The name of the application. + :param db_path: The path to the database. """ self.indexer = self.get_indexer(db_path) def get_indexer(self, db_path: str): """ Create an experience indexer. - :param query: The query to create an indexer for. + :param db_path: The path to the database. """ try: