Edit

Share via


Computer use tool for agents (Preview)

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Warning

The computer use tool comes with significant security and privacy risks, including prompt injection attacks. For more information about intended uses, capabilities, limitations, risks, and considerations when choosing a use case, see the Azure OpenAI transparency note.

This article explains how to work with the computer use tool in Foundry Agent Service. Computer use is a specialized AI tool that uses a specialized model to perform tasks by interacting with computer systems and applications through their user interfaces. With computer use, you can create an agent that handles complex tasks and makes decisions by interpreting visual elements and taking action based on on-screen content.

Features

  • Autonomous navigation: For example, computer use can open applications, click buttons, fill out forms, and navigate multistep workflows.
  • Dynamic adaptation: Interprets UI changes and adjusts actions accordingly.
  • Cross-application task execution: Operates across web-based and desktop applications.
  • Natural language interface: Users can describe a task in plain language, and the Computer Use model determines which UI interactions to execute.

Prerequisites

Request access

To access the computer-use-preview model, you need to register. Microsoft grants access based on eligibility criteria. If you have access to other limited access models, you still need to request access for this model.

To request access, see the application form.

After Microsoft grants access, you need to create a deployment for the model.

Code samples

Warning

Use the computer use tool on virtual machines with no access to sensitive data or critical resources. For more information about the intended uses, capabilities, limitations, risks, and considerations when choosing a use case, see the Azure OpenAI transparency note.

To run this code, you need the latest prerelease package. See the quickstart for details.

Start with a screenshot to represent the computer use tool execution

import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import AgentReference, PromptAgentDefinition, ComputerUsePreviewTool

# Import shared helper functions
from computer_use_util import (
    SearchState,
    load_screenshot_assets,
    handle_computer_action_and_take_screenshot,
    print_final_output,
)

load_dotenv()

"""Main function to demonstrate Computer Use Agent functionality."""
# Initialize state machine
current_state = SearchState.INITIAL

# Load screenshot assets
try:
    screenshots = load_screenshot_assets()
    print("Successfully loaded screenshot assets")
except FileNotFoundError:
    print("Failed to load required screenshot assets. Please ensure the asset files exist in ../assets/")
    exit(1)

Create an agent version with the tool

project_client = AIProjectClient(
    endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
    credential=DefaultAzureCredential(),
)

computer_use_tool = ComputerUsePreviewTool(display_width=1026, display_height=769, environment="windows")

with project_client:
    agent = project_client.agents.create_version(
        agent_name="ComputerUseAgent",
        definition=PromptAgentDefinition(
            model=os.environ["AZURE_AI_MODEL_DEPLOYMENT_NAME"],
            instructions="""
            You are a computer automation assistant. 

            Be direct and efficient. When you reach the search results page, read and describe the actual search result titles and descriptions you can see.
            """,
            tools=[computer_use_tool],
        ),
        description="Computer automation agent with screen interaction capabilities.",
    )
    print(f"Agent created (id: {agent.id}, name: {agent.name}, version: {agent.version})")

One iteration for the tool to process the screenshot and take the next step

    openai_client = project_client.get_openai_client()

    # Initial request with screenshot - start with Bing search page
    print("Starting computer automation session (initial screenshot: cua_browser_search.png)...")
    response = openai_client.responses.create(
        input=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": "I need you to help me search for 'OpenAI news'. Please type 'OpenAI news' and submit the search. Once you see search results, the task is complete.",
                    },
                    {
                        "type": "input_image",
                        "image_url": screenshots["browser_search"]["url"],
                        "detail": "high",
                    },  # Start with Bing search page
                ],
            }
        ],
        extra_body={"agent": AgentReference(name=agent.name).as_dict()},
        truncation="auto",
    )

    print(f"Initial response received (ID: {response.id})")

Perform multiple iterations

Make sure you review each iteration and action. The following code sample shows a basic API request. After you send the initial API request, perform a loop where your application code carries out the specified action. Send a screenshot with each turn so the model can evaluate the updated state of the environment. For an example integration for a similar API, see the Azure OpenAI documentation.


max_iterations = 10  # Allow enough iterations for completion
iteration = 0

while True:
        if iteration >= max_iterations:
            print(f"\nReached maximum iterations ({max_iterations}). Stopping.")
            break

        iteration += 1
        print(f"\n--- Iteration {iteration} ---")

        # Check for computer calls in the response
        computer_calls = [item for item in response.output if item.type == "computer_call"]

        if not computer_calls:
            print_final_output(response)
            break

        # Process the first computer call
        computer_call = computer_calls[0]
        action = computer_call.action
        call_id = computer_call.call_id

        print(f"Processing computer call (ID: {call_id})")

        # Handle the action and get the screenshot info
        screenshot_info, current_state = handle_computer_action_and_take_screenshot(action, current_state, screenshots)

        print(f"Sending action result back to agent (using {screenshot_info['filename']})...")

        # Regular response with just the screenshot
        response = openai_client.responses.create(
            previous_response_id=response.id,
            input=[
                {
                    "call_id": call_id,
                    "type": "computer_call_output",
                    "output": {
                        "type": "computer_screenshot",
                        "image_url": screenshot_info["url"],
                    },
                }
            ],
            extra_body={"agent": AgentReference(name=agent.name).as_dict()},
            truncation="auto",
        )

        print(f"Follow-up response received (ID: {response.id})")

Clean up

print("Agent deleted")

For C# usage, see the Sample for use of an Agent with Computer Use tool in Azure.AI.Projects.OpenAI example in the Azure SDK for .NET repository on GitHub.

Differences between browser automation and computer use

The following table lists some of the differences between the computer use tool and browser automation tool.

Feature Browser Automation computer use tool
Model support All GPT models Computer-use-preview model only
Can I visualize what's happening? No Yes
How it understands the screen Parses the HTML or XML pages into DOM documents Raw pixel data from screenshots
How it acts A list of actions provided by the model Virtual keyboard and mouse
Is it multistep? Yes Yes
Interfaces Browser Computer and browser
Do I need to bring my own resource? Your own Playwright resource with the keys stored as a connection. No additional resource required but we highly recommend running this tool in a sandboxed environment.

Regional support

To use the computer use tool, you need a computer use model deployment. The computer use model is available in the following regions:

  • eastus2
  • swedencentral
  • southindia

Understanding the computer use integration

When working with the computer use tool, integrate it into your application by performing the following steps:

  1. Send a request to the model that includes a call to the computer use tool, the display size, and the environment. You can also include a screenshot of the initial state of the environment in the first API request.

  2. Receive a response from the model. If the response has action items, those items contain suggested actions to make progress toward the specified goal. For example, an action might be screenshot so the model can assess the current state with an updated screenshot, or click with X/Y coordinates indicating where the mouse should be moved.

  3. Execute the action by using your application code on your computer or browser environment.

  4. After executing the action, capture the updated state of the environment as a screenshot.

  5. Send a new request with the updated state as a tool_call_output, and repeat this loop until the model stops requesting actions or you decide to stop.

    Note

    Before using the tool, set up an environment that can capture screenshots and execute the recommended actions by the agent. Use a sandboxed environment, such as Playwright, for safety reasons.

Handling conversation history

Use the tool_call_id parameter to link the current request to the previous response. Use this parameter if you don't want to manage the conversation history.

If you don't use this parameter, make sure to include all the items returned in the response output of the previous request in your inputs array. This requirement includes reasoning items if present.

Safety checks

Warning

Computer use carries substantial security and privacy risks and user responsibility. Computer use comes with significant security and privacy risks. Both errors in judgment by the AI and the presence of malicious or confusing instructions on web pages, desktops, or other operating environments that the AI encounters might cause it to execute commands you or others don't intend. These risks could compromise the security of your or other users’ browsers, computers, and any accounts to which AI has access, including personal, financial, or enterprise systems.

Use the computer use tool on virtual machines with no access to sensitive data or critical resources. For more information about the intended uses, capabilities, limitations, risks, and considerations when choosing a use case, see the Azure OpenAI transparency note.

The API has safety checks to help protect against prompt injection and model mistakes. These checks include:

Malicious instruction detection: The system evaluates the screenshot image and checks if it contains adversarial content that might change the model's behavior.

Irrelevant domain detection: The system evaluates the current_url parameter (if provided) and checks if the current domain is relevant given the conversation history.

Sensitive domain detection: The system checks the current_url parameter (if provided) and raises a warning when it detects the user is on a sensitive domain.

If one or more of the preceding checks is triggered, the model raises a safety check when it returns the next computer_call with the pending_safety_checks parameter.

"output": [ 
    { 
        "type": "reasoning", 
        "id": "rs_67cb...", 
        "summary": [ 
            { 
                "type": "summary_text", 
                "text": "Exploring 'File' menu option." 
            } 
        ] 
    }, 
    { 
        "type": "computer_call", 
        "id": "cu_67cb...", 
        "call_id": "call_nEJ...", 
        "action": { 
            "type": "click", 
            "button": "left", 
            "x": 135, 
            "y": 193 
        }, 
        "pending_safety_checks": [ 
            { 
                "id": "cu_sc_67cb...", 
                "code": "malicious_instructions", 
                "message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed." 
            } 
        ], 
        "status": "completed" 
    } 
]

You need to pass the safety checks back as acknowledged_safety_checks in the next request to proceed.

"input":[ 
        { 
            "type": "computer_call_output", 
            "call_id": "<call_id>", 
            "acknowledged_safety_checks": [ 
                { 
                    "id": "<safety_check_id>", 
                    "code": "malicious_instructions", 
                    "message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed." 
                } 
            ], 
            "output": { 
                "type": "computer_screenshot", 
                "image_url": "<image_url>" 
            } 
        } 
    ]

Safety check handling

In all cases where pending_safety_checks are returned, hand over actions to the end user to confirm proper model behavior and accuracy.

malicious_instructions and irrelevant_domain: End users should review model actions and confirm that the model behaves as intended.

sensitive_domain: Ensure an end user actively monitors the model actions on these sites. The exact implementation of this "watch mode" can vary by application, but a potential example could be collecting user impression data on the site to make sure there's active end user engagement with the application.