The AI Tech behind Web Agent

by Sheng Yi, 01/29/2024

www.simplegen.ai

In a previous post, we introduced the vision of an autonomous AI agent performing daily tasks on behalf of humans. This article will provide a brief introduction to the tech behind a specific type of AI agent: the web agent.

What is a Web Agent

Since "web agent" is not a commonly accepted term, it's worth explaining upfront. By 'web agent', I'm referring to an AI Agent that can perform browsing actions on the web, such as clicking, scrolling, or inputting text.

List of Representative Papers

[1] MULTIMODAL WEB NAVIGATION WITH INSTRUCTIONFINETUNED FOUNDATION MODELS

[2] A REAL-WORLD WEBAGENT WITH PLANNING, LONG CONTEXT UNDERSTANDING, AND PROGRAM SYNTHESIS

[3] MIND2WEB: Towards a Generalist Agent for the Web

[4] SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

[5] GPT-4V(ision) is a Generalist Web Agent, if Grounded

Tech Problems to solve

To autonomously browse the web, there are three major technical problems to solve:

plan the next action given the history of actions and observations

identify the html element to act (aka grounding)

execute the action on the identified HTML element

Existing solutions

The solution to the last problem primarily depends on 1st party browser APIs or third-party UI test automation libs, which is out of the scope of this article. AI is required to solve the first two problems. Large Language Models (LLMs), such as GPT and its various iterations, are increasingly utilized to facilitate the navigation of web pages.

For example in [2] a transformer model is trained on top of T5 with both visual and text tokens to plan the next action given the current context and history.

Image from “MULTIMODAL WEB NAVIGATION WITH INSTRUCTIONFINETUNED FOUNDATION MODELS”

And the performance of GPT-4V in the planning and grounding has been studied in [5].

Image from “GPT-4V(ision) is a Generalist Web Agent, if Grounded”, both SeeAct-Oracle and SeeAct-Choices leverage GPT-4V.

Challenges

As mentioned in [4][5], the main challenge of the existing solutions of Web Agent lies in the low success rate of grounding (problem #2). The gap shown in the image above between SeeAct-Oracle and SeeAct-Choices is mainly due to grounding errors (SeeAct-Oracle assumes perfect grounding and thus serves as an upper limit).

In this article, we won't delve deeply into the various grounding methods. However, we will offer a concise overview to shed light on why mastering these methods presents significant challenges.

For vision-based grounding methods, annotation is added to the screenshot image as registration labels between the image and HTML elements. Such annotation can often block key information on the page which causes errors.

For text-based grounding methods, HTML attributes are required as input for inference. However often these attributes are missing, shared across multiple elements, or changed (for example page reload, the new element added to the page…) after inference.