Revolutionizing AI Agents: Unlocking Computer Control with OS World

Revolutionize AI agents with OS World, a new open-source project that provides a robust environment to benchmark and test AI agents in real computer environments. Learn how this breakthrough enables agents to execute complex tasks by grounding instructions into concrete actions.

February 19, 2025

party-gif

Unlock the future of AI agents with OS World, a groundbreaking project that enables seamless control of computers across operating systems. Discover how this open-source platform revolutionizes the way we benchmark and improve AI agents, empowering them to tackle complex, real-world tasks with precision and efficiency.

How OS World Enables AI Agents to Control Computers Across Operating Systems

OS World is a new project that aims to address the challenge of benchmarking and testing AI agents in real computer environments. The key features of OS World include:

  1. Unified Multimodal Environment: OS World provides a unified environment for AI agents to operate across different operating systems, applications, and interfaces, including both graphical user interfaces (GUIs) and command-line interfaces (CLIs).

  2. Observation and Action Spaces: OS World defines the observation space, which includes the current desktop environment, instructions, screenshots, and accessibility trees. It also defines the action space, which includes actions like mouse movements, clicks, keyboard input, and more.

  3. Evaluation Metrics: OS World includes carefully annotated real-world computer tasks, with initial state configurations and custom evaluation scripts to assess the performance of AI agents.

  4. Accessibility and Grounding: OS World provides accessibility information and grounding to enable AI agents to interpret and execute instructions, overcoming the limitations of approaches like open interpreter that rely on imprecise screenshot-based interactions.

  5. Open-Source and Reproducible: The OS World project, including the research paper, code, and data, is open-source, allowing for reproducibility and further development by the research community.

The key insight behind OS World is that to enable AI agents to perform real-world computer tasks, they need access to the underlying operating system and application interfaces, not just high-level screenshots. By providing this grounding, OS World aims to facilitate the development of more capable and versatile AI agents that can operate seamlessly across different computing environments.

Defining Intelligent Agents and Their Key Components

An intelligent agent is defined as a system that perceives its environment through sensors and acts upon that environment through effectors, in a rational manner to achieve its goals. The key components of an intelligent agent are:

  1. Sensors: The agent's means of perceiving its environment, such as cameras, microphones, or other input devices.

  2. Effectors: The agent's means of acting upon its environment, such as motors, speakers, or other output devices.

  3. Autonomy: The agent's ability to operate without direct human control.

  4. Reactivity: The agent's ability to perceive and respond to changes in its environment in a timely fashion.

  5. Proactivity: The agent's ability to exhibit goal-directed behavior by taking the initiative to achieve its objectives.

  6. Social Ability: The agent's capacity to interact with other agents or humans in its environment.

These components allow the agent to perceive its environment, plan and execute actions, and learn from its experiences to improve its performance over time. The goal of an intelligent agent is to maximize its performance in achieving its objectives, while operating within the constraints of its environment.

The Challenges of Controlling Computers for AI Agents

Controlling computers and executing tasks in digital environments has been a significant challenge for AI agents. The presentation highlights the key issues:

  1. Grounding Instructions into Actions: Simply providing step-by-step instructions is not enough for an AI agent to execute a task successfully. The agent needs to be able to ground those instructions into actual actions that can control the computer interface, whether it's a mouse, keyboard, or other input methods.

  2. Closed and Proprietary Systems: Operating systems like macOS and Windows are closed and proprietary, making it difficult for AI agents to precisely control the computer environment. Existing approaches, like using accessibility features and screenshot grids, are imprecise and inefficient.

  3. Lack of Feedback and Iteration: Without the ability to perceive the environment and receive feedback, AI agents struggle to generate accurate, multi-step plans for executing tasks. The lack of interaction with the real environment limits their capacity to learn and improve.

  4. Complexity of Real-World Tasks: Many real-world computer tasks involve multiple applications, interfaces, and workflows. Translating high-level instructions into the necessary actions to complete these complex tasks is a significant challenge for current AI agents.

To address these challenges, the OS World project aims to provide a scalable, real computer environment that can serve as a unified, multimodal agent environment for evaluating open-ended computer tasks. By offering access to various operating systems, applications, and interfaces, along with detailed observations and feedback, OS World enables AI agents to ground their instructions into precise actions and iterate on their performance.

OS World: A Scalable Real-World Computer Environment for Benchmarking AI Agents

OS World is a new project that aims to address the challenge of consistently and thoroughly testing AI agents. It provides a robust environment, multiple operating systems, and a way for agents to interact with the environment and measure their performance.

The key features of OS World include:

  1. Multimodal Agent Environment: OS World serves as a unified environment for evaluating open-ended computer tasks that involve arbitrary apps and interfaces across operating systems.

  2. Observation Modes: Agents can receive observations through various modes, including the accessibility tree, screenshot, and a set of marks (a grid-based representation of the screen).

  3. Action Space: Agents can perform a range of actions, such as mouse movements, clicks, keyboard input, and using hotkeys, to interact with the environment.

  4. Task Evaluation: OS World includes carefully annotated real-world computer tasks, with initial state setups and custom execution-based evaluation scripts to assess the agent's performance.

  5. Benchmarking: The project has been used to benchmark various agents, including Cog Agent, GPT-4, and Gemini Pro Cloud 3, demonstrating the effectiveness of the accessibility tree and screenshot-based observation modes.

  6. Open-Source: The OS World project, including the code and data, is open-source, allowing researchers and developers to access and build upon the platform.

By providing a standardized and scalable environment for testing AI agents, OS World aims to advance the field of agent-based AI and enable more robust and reliable performance evaluation.

Evaluating Agent Performance in OS World

The OS World project aims to provide a robust and scalable environment for evaluating the performance of AI agents in executing real-world computer tasks. The key aspects of this evaluation process are:

  1. Task Formalization: An agent task is formalized as a Partially Observable Markov Decision Process (POMDP), with a defined state space, observation space, action space, transition function, and reward function.

  2. Observation Modalities: Agents can receive observations through various modalities, including the accessibility tree, screenshot, and a set of bounding box coordinates (marks). These provide different levels of information about the current state of the environment.

  3. Action Space: Agents can perform a range of actions to interact with the computer environment, such as mouse movements, clicks, keyboard input, scrolling, and using hotkeys.

  4. Task Execution Evaluation: Each task is carefully annotated with real-world instructions, an initial state setup, and a custom evaluation script that checks whether the task was completed successfully.

  5. Benchmark Tasks: The OS World project includes 369 real-world computer tasks involving web and desktop applications, file operations, and multi-app workflows, providing a comprehensive set of benchmarks for evaluating agent performance.

The results presented in the paper show that large language models like GPT-4 perform best when provided with the accessibility tree or a combination of the screenshot and accessibility tree, outperforming other input modalities like screenshot-only or set of marks. This suggests that the accessibility tree provides the most informative grounding for agents to execute tasks in the OS World environment.

The OS World project represents a significant step forward in the development of robust and scalable benchmarks for evaluating the capabilities of AI agents in real-world computer environments. By providing a standardized and open-source platform, it enables researchers and developers to systematically assess and improve the performance of their agents across a wide range of tasks and scenarios.

Conclusion

The OS World project is a significant step forward in the field of AI agent benchmarking. By providing a robust, open-source environment for agents to interact with real computer systems and applications, it addresses a critical gap in the current state of AI evaluation.

The key highlights of the OS World project are:

  1. Multimodal Interaction: The environment supports a variety of input modalities, including screenshots, accessibility trees, and set of marks, allowing agents to perceive and interact with the computer environment in a more natural and comprehensive way.

  2. Real-World Tasks: The project includes a diverse set of 369 real-world computer tasks, carefully curated from user instructions, that involve multi-step workflows across various applications and operating systems.

  3. Rigorous Evaluation: The tasks are accompanied by detailed initial state configurations and custom evaluation scripts, enabling a standardized and objective assessment of agent performance.

  4. Open-Source Availability: The entire project, including the code, data, and research paper, is openly available, promoting collaboration and further advancements in the field.

The results presented in the paper demonstrate the potential of large language models, such as GPT-4, to tackle these complex, real-world computer tasks, with the accessibility tree or a combination of screenshot and accessibility tree providing the most effective input modalities.

The OS World project represents a significant step towards building more capable and versatile AI agents that can seamlessly integrate with and assist humans in their daily digital tasks. As the field of AI continues to evolve, initiatives like this will be crucial in driving progress and ensuring the development of agents that can truly operate in the real world.

FAQ