Skip to content

Commit

Permalink
Merge pull request #67 from microsoft/vyokky/dev
Browse files Browse the repository at this point in the history
Minor fix for releasing new version
  • Loading branch information
vyokky authored May 8, 2024
2 parents 640bd0e + 12f641d commit 1fad66b
Show file tree
Hide file tree
Showing 30 changed files with 610 additions and 686 deletions.
19 changes: 12 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,22 @@

## 🕌 Framework
<b>UFO</b> <img src="./assets/ufo_blue.png" alt="UFO Image" width="24"> operates as a dual-agent framework, encompassing:
- <b>AppAgent 🤖</b>, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
- <b>ActAgent 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
- <b>Control Interaction 🎮</b>, is tasked with translating actions from AppAgent and ActAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** API.
- <b>HostAgent (Previously AppAgent) 🤖</b>, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
- <b>AppAgent (Previously ActAgent) 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
- <b>Control Interaction 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** API.

Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939).
<h1 align="center">
<img src="./assets/framework.png"/>
<img src="./assets/framework_v2.png"/>
</h1>


## 📢 News
- 📅 2024-05-07: **New Release for v0.1.1!** We've made some significant updates! Previously known as AppAgent and ActAgent, we've rebranded them to HostAgent and AppAgent to better align with their functionalities. Explore the latest enhancements:
1. **Learning from Human Demonstration:** UFO now supports learning from human demonstration! Utilize the [Windows Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) to record your steps and demonstrate them for UFO. Refer to our detailed guide in [README.md](/record_processor/README.md) for more information.
2. **Win32 Support:** We've incorporated support for [Win32](https://learn.microsoft.com/en-us/windows/win32/controls/window-controls) as a control backend, enhancing our UI automation capabilities.
3. **Extended Application Interaction:** UFO now goes beyond UI controls, allowing interaction with your application through keyboard inputs and native APIs! Presently, we support Word ([examples](/ufo/prompts/apps/word/api.yaml)), with more to come soon. Customize and build your own interactions.
4. **Control Filtering:** Streamline LLM's action process by using control filters to remove irrelevant control items. Enable them in [config_dev.yaml](/ufo/config/config_dev.yaml) under the `control filtering` section at the bottom.
- 📅 2024-03-25: **New Release for v0.0.1!** Check out our exciting new features:
1. We now support creating your help documents for each Windows application to become an app expert. Check the [README](./learner/README.md) for more details!
2. UFO now supports RAG from offline documents and online Bing search.
Expand Down Expand Up @@ -84,7 +89,7 @@ pip install -r requirements.txt
```

### ⚙️ Step 2: Configure the LLMs
Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows:
Before running UFO, you need to provide your LLM configurations **individully for HostAgent and AppAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows:

#### OpenAI
```bash
Expand Down Expand Up @@ -244,7 +249,7 @@ Please consult the [WindowsBench](https://arxiv.org/pdf/2402.07939.pdf) provided


## 📚 Citation
Our technical report paper can be found [here](https://arxiv.org/abs/2402.07939).
Our technical report paper can be found [here](https://arxiv.org/abs/2402.07939). Note that previous AppAgent and ActAgent in the paper are renamed to HostAgent and AppAgent in the code base to better reflect their functions.
If you use UFO in your research, please cite our paper:
```
@article{ufo,
Expand All @@ -257,9 +262,9 @@ If you use UFO in your research, please cite our paper:

## 📝 Todo List
- [x] RAG enhanced UFO.
- [x] Support more control using Win32 API.
- [ ] Documentation.
- [ ] Support local host GUI interaction model.
- [ ] Support more control using Win32 API.
- [ ] Chatbox GUI for UFO.


Expand Down
Binary file added assets/framework_v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 3 additions & 10 deletions ufo/automator/app_apis/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,16 @@ def __init__(self, app_root_name: str, process_name: str, clsid: str) -> None:
:param process_name: The process name.
:param clsid: The CLSID of the COM object.
"""
super().__init__()


self.app_root_name = app_root_name
self.process_name = process_name
self.clsid = clsid

self.client = win32com.client.Dispatch(self.clsid)
self.com_object = self.get_object_from_process_name()

super().__init__()


@abstractmethod
def get_default_command_registry(self):
Expand All @@ -48,14 +49,6 @@ def get_object_from_process_name(self) -> None:
:param process_name: The process name.
"""
pass


@abstractmethod
def get_default_command_registry(self) -> Dict:
"""
Get the method registry of the COM object.
"""
pass



Expand Down
32 changes: 19 additions & 13 deletions ufo/automator/app_apis/word/wordclient.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

from typing import Dict
from typing import Dict, Type

from ..basic import WinCOMCommand, WinCOMReceiverBasic
from ...basic import CommandBasic, ReceiverBasic
from ....prompter.agent_prompter import APIPromptLoader


class WordWinCOMReceiver(WinCOMReceiverBasic):
Expand All @@ -26,18 +28,22 @@ def get_object_from_process_name(self) -> None:

return None

@staticmethod
def get_default_command_registry() -> Dict[str, WinCOMCommand]:
"""
Get the method registry of the COM object.
:return: The method registry.
"""
mappping = {
"insert_table": InsertTableCommand,
"select_text": SelectTextCommand,
"select_table": SelectTableCommand
}
return mappping

def get_default_command_registry(self) -> Dict[str, Type[CommandBasic]]:
"""
Get the default command registry.
"""

api_prompt = APIPromptLoader(self.app_root_name).load_com_api_prompt()
class_name_dict = self.filter_api_dict(api_prompt, "class_name")

global_name_space = globals()
command_registry = self.name_to_command_class(global_name_space, class_name_dict)

return command_registry





def insert_table(self, rows: int, columns: int) -> object:
Expand Down
35 changes: 34 additions & 1 deletion ufo/automator/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
from __future__ import annotations

from abc import ABC, abstractmethod
from typing import Dict, List
from typing import Any, Dict, List, Type
from ..utils import print_with_color


class ReceiverBasic(ABC):
Expand Down Expand Up @@ -60,6 +61,38 @@ def self_command_mapping(self) -> Dict[str, CommandBasic]:
"""
return {command_name: self for command_name in self.supported_command_names}


@staticmethod
def filter_api_dict(api_dict: Dict[str, Any], key: str) -> Dict[str, str]:
"""
Filter the API dictionary.
:param api_dict: The API dictionary.
:param key: The key to filter.
:return: The filtered API dictionary.
"""
return {k: v.get(key, None) for k, v in api_dict.items()}


@staticmethod
def name_to_command_class(global_namespace:Dict[str, Any], class_name_mapping: Dict[str, str]) -> Dict[str, Type[CommandBasic]]:
"""
Convert the class name to the command class.
:param class_name_mapping: The class name mapping.
:return: The command class mapping.
"""

api_class_registry = {}

for key, command_class_name in class_name_mapping.items():
if command_class_name in global_namespace:
api_class_registry[key] = global_namespace[command_class_name]
else:
print_with_color("Warning: The command class {command_class_name} with api key {key} is not found in the global namespace.", "yellow")

return api_class_registry





@property
Expand Down
36 changes: 20 additions & 16 deletions ufo/automator/ui_control/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@
import time
import warnings
from abc import abstractmethod
from typing import Dict, List
from typing import Dict, List, Type

from ...config.config import Config
from ...prompter.agent_prompter import APIPromptLoader
from ...utils import print_with_color
from ..basic import CommandBasic, ReceiverBasic, ReceiverFactory

Expand All @@ -15,6 +16,10 @@


class ControlReceiver(ReceiverBasic):
"""
The control receiver class.
"""

def __init__(self, control, application):
"""
Initialize the control receiver.
Expand All @@ -31,21 +36,22 @@ def __init__(self, control, application):
self.application = application


@staticmethod
def get_default_command_registry():

def get_default_command_registry(self) -> Dict[str, Type[CommandBasic]]:
"""
The default command registry.
Get the default command registry.
"""
return {
"click_input": ClickInputCommand,
"summary": SummaryCommand,
"set_edit_text": SetEditTextCommand,
"texts": GetTextsCommand,
"wheel_mouse_input": WheelMouseInputCommand,
"keyboard_input": keyboardInputCommand,
"annotation": AnnotationCommand,
"": NoActionCommand
}

api_prompt = APIPromptLoader("").load_ui_api_prompt()
class_name_dict = self.filter_api_dict(api_prompt, "class_name")

global_name_space = globals()
command_registry = self.name_to_command_class(global_name_space, class_name_dict)

command_registry[""] = NoActionCommand

return command_registry


@property
def type_name(self):
Expand Down Expand Up @@ -263,8 +269,6 @@ def execute(self) -> str:





class ClickInputCommand(ControlCommand):
"""
The click input command class.
Expand Down
16 changes: 8 additions & 8 deletions ufo/config/config_dev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,26 +25,26 @@ INCLUDE_LAST_SCREENSHOT: True # Whether to include the last screenshot in the o
REQUEST_TIMEOUT: 250 # The call timeout for the GPT-V model


HOSTAGENT_PROMPT: "ufo/prompts/base/{mode}/host_agent.yaml" # The prompt for the app selection
# Due to the limitation of input size, lite version of the prompt help users have a taste. And the path is "ufo/prompts/base/lite/{mode}/host_agent.yaml"
APPAGENT_PROMPT: "ufo/prompts/base/{mode}/app_agent.yaml" # The prompt for the action selection
# Lite version: "ufo/prompts/base/lite/{mode}/app_agent.yaml"
FOLLOWERAHENT_PROMPT: "ufo/prompts/base/{mode}/app_agent.yaml" # The prompt for the follower agent
HOSTAGENT_PROMPT: "ufo/prompts/share/base/host_agent.yaml" # The prompt for the app selection
# Due to the limitation of input size, lite version of the prompt help users have a taste. And the path is "ufo/prompts/share/lite/host_agent.yaml"
APPAGENT_PROMPT: "ufo/prompts/share/base/app_agent.yaml" # The prompt for the action selection
# Lite version: "ufo/prompts/share/lite/app_agent.yaml"
FOLLOWERAHENT_PROMPT: "ufo/prompts/share/base/app_agent.yaml" # The prompt for the follower agent

HOSTAGENT_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/host_agent_example.yaml" # The prompt for the app selection
# Lite version: "ufo/prompts/examples/lite/{mode}/host_agent_example.yaml"
APPAGENT_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/app_agent_example.yaml" # The prompt for the action selection
# Lite version: "ufo/prompts/examples/lite/{mode}/app_agent_example.yaml"

## For experience learning
EXPERIENCE_PROMPT: "ufo/prompts/experience/{mode}/experience_summary.yaml"
EXPERIENCE_PROMPT: "ufo/prompts/experience/experience_summary.yaml"
EXPERIENCE_SAVED_PATH: "vectordb/experience/"

## For user demonstration learning
DEMONSTRATION_PROMPT: "ufo/prompts/demonstration/{mode}/demonstration_summary.yaml"
DEMONSTRATION_PROMPT: "ufo/prompts/demonstration/demonstration_summary.yaml"
DEMONSTRATION_SAVED_PATH: "vectordb/demonstration/"

API_PROMPT: "ufo/prompts/base/{mode}/api.yaml" # The prompt for the API
API_PROMPT: "ufo/prompts/share/base/api.yaml" # The prompt for the API
INPUT_TEXT_API: "type_keys" # The input text API
INPUT_TEXT_ENTER: True # whether to press enter after typing the text

Expand Down
6 changes: 6 additions & 0 deletions ufo/module/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@


class UFOClient:
"""
A UFO client to run the UFO system for a single session.
"""

def __init__(self, session: BaseSession) -> None:
"""
Expand All @@ -35,6 +38,9 @@ def run(self) -> None:


class UFOClientManager:
"""
The manager for the UFO clients.
"""

def __init__(self, session_list: List[BaseSession]) -> None:
"""
Expand Down
Loading

0 comments on commit 1fad66b

Please sign in to comment.