AI Interactivity (Part I): AI Agents and Multimodal Agents

Jackson Chen, Tensility Intern and MBA candidate at Northwestern University Kellogg School of Management
Armando Pauker, Managing Director at Tensility Venture Partners
Wayne Boulais, Managing Director at Tensility Venture Partners

At Tensility Venture Partners, we have reviewed hundreds of AI startup pitches in the past six months. Notably, AI startups are expanding from small-scale machine learning, which relies on proprietary data for training and testing, to also utilizing pretrained large language models (LLMs) for complex business tasks. For instance, LLMs are now being applied to summarize and generate reports, emails, and customer communications, marking a significant shift towards making AI more practical. Specifically, we've identified two emerging frameworks: AI agents, which ingest data and perform tasks, and AI copilots, which provide guidance to users during their work.

In our previous blog series, “Small Models, Big Impact,” we explored the shift towards smaller language models, highlighting significant developments in AI language model infrastructure. This two-part exploration delves into AI interactivity, examining its role in executing dedicated tasks and how advanced applications like Multimodal Agents, Multi-agent Systems, and AI copilots are tackling increasingly complex challenges. This first post is on AI agents, with a special focus on multimodal agents capable of processing diverse data types (e.g., photos and text) to inform decision-making.

What is an AI Agent?

An AI agent is a computational entity designed to act independently[1][2]. It performs specific tasks autonomously by making decisions based on its environment, inputs, and a predefined goal. What separates an AI agent from an AI model is the ability to act. There are many different kinds of agents such as reactive agents and proactive agents. Agents can also act in fixed and dynamic environments[3]. Additionally, more sophisticated applications of agents involve utilizing agents to handle data in various formats, known as multimodal agents and deploying multiple agents to tackle complex problems (coming in Part II). We have seen various potential applications of multimodal agents and multi-agent systems, so we want to explore these models more closely. Our focus will be on their workings and their significant effects on daily life, aiming for a concise understanding of their real-world applications.

What is a Multimodal Agent?

Multimodal agents are AI constructs capable of understanding and analyzing data in various modalities, such as text, images, audio, and video, leading to more refined and accurate outputs compared to unimodal systems[4][5]. Furthermore, these agents can generate outputs that span across these different modalities, enhancing their versatility and application in complex scenarios where understanding and integrating diverse data types are available and crucial for improving the accuracy of the application. For instance, you could train a model to associate specific pictures with certain sounds using both image and audio datasets, or, as demonstrated by OpenAI's recent introduction of Sora, to create videos from text. At the same time, you could ask a model to combine a text description and an audio file in order to generate an image that represents both. Multimodal agents can extend the span of AI applications to interact with and understand more of the physical world, beyond just dealing only in either text or images.

A multimodal agent might receive text inputs from human users, then utilize various tools to analyze the data in formats such as text, audio, or images, before producing outputs in text format as well. This ability allows the agent to provide detailed and accurate responses by drawing on a wide range of information sources[6].

How AI and Multimodal Agents work

Typically, AI agents use large language models (LLMs), such as GPT-4, to autonomously complete tasks through a five-step process:

1. Generate Goal: Human users input a prompt as the objective or goal, prompting the AI agent to start its process. The agent submits the prompt to the core LLM (e.g., GPT-4) and interprets the first output as its internal monologue, demonstrating its understanding of the required task.

2. Creating a Task List: The agent formulates a task list based on the goal, prioritizing tasks to efficiently achieve the objective.

3. Gather Information: The AI agent employs various "agent tools" to collect information from other AI models (e.g., see "Architecture Pattern: Multimodal Agent" from AWS below), the internet, or databases.

4. Data Storage: AI agents store collected data in “agent memory”, crucial for processing and decision-making. This memory is divided into short-term and long-term[7]:

a. Short-term memory: records the agent's immediate inference process for answering user questions

b. Long-term memory: documenting comprehensive interaction histories with users over extended periods.

5. Gathering Feedback: As the AI agent progresses with tasks, it gathers feedback from multiple sources, such as other agents (in a multi-agent system), its internal monologue, or external databases, to evaluate its proximity to the goal.

Based on the feedback, the AI agent iterates the above steps until the task is completed.

Extending to multimodal agents

Multimodal agents work similarly to those of unimodal AI agents, but stand out in two key ways. Firstly, the LLM present in a typical AI agent can be generalized to support various kinds of foundation models to process and produce outputs in various formats, like GPT-3 for text, DALL-E for images, and WaveNet for audio. These models are specialized in understanding and generating content in their specific modalities. Secondly, multimodal agents are equipped with agent tools designed to interpret and analyze data across these different formats. This layer of software provides a task-focused framework for tackling complex tasks, leading to the creation of more sophisticated and innovative applications.

The following graphic from "Architecture Pattern: Multimodal Agent" by AWS presents a generalized architecture for a multimodal agent, which is a more complex counterpart to the unimodal agent. The process begins with a user or event initiating a query (or objective). The agent then orchestrates task processing by selecting the appropriate foundation model for the data type and employing specific tools tailored to that data type to gather and analyze necessary information, regardless of its format. Memory functions assist the agent in staying focused and enhancing its performance over time. The agent will iterate this process until it has completed the goal, delivering a response upon task completion.

Here is a fun example of a model that can be used as a multimodal agent where Microsoft’s “composable diffusion (CoDi)” model can generate any combination of modalities from any mixture of input modalities[8]. This graphic is an example of CoDi generating video + audio from text, image and audio input. The input modalities are listed vertically on the left side, including the text “Teddy bear on a skateboard, 4k”, a picture of Times Square, and the waveform of raining ambience. The output is a video with sound. In the video, a Teddy bear is skateboarding in the rain on the street of Times Square. One can also hear the synchronized sound of skateboarding and rain.

Applications of AI Agents and Multimodal Agents

The characteristics of autonomy and independence make AI agents incredibly useful across various scenarios. For example:

1. Autonomous robots or vehicles equipped with AI can navigate and perform tasks with minimal human intervention, enhancing efficiency in environments like Amazon's warehouses.

2. In gaming, AI has progressed from simple computer adversaries to complex agents capable of challenging and even surpassing human skills, as seen in AI-developed games like chess with Deep Blue and Go with AlphaGo, showcasing their ability to defeat world champions.

As for multimodal agents, their capabilities of comprehending data and generating outputs in multiple modalities can broaden their applications:

1. Financial Industry: In the financial industry, multimodal data, including news, social media, earnings call recordings, and graphical charts, plays a vital role. Financial organizations can generate, collect, and use this data to gain insights into operations and make informed decisions. An example from the AWS ML Blog illustrates how a financial analyst might use multimodal agents for quantitative analysis, processing data in various modalities to produce more accurate financial analyses

2. Healthcare Industry[9]: In the healthcare industry, the surge in accessible biomedical data and affordable genome sequencing has paved the way for multimodal AI solutions to enhance our understanding of human health and disease. In a joint paper developed by Yale, Harvard, and the Scripps Research Translational Institute, the authors discuss how combining data from different layers of biological information—such as genomics, proteomics, and metabolomics—can enable more accurate and individualized diagnoses, prognoses, and treatments of diseases.

Multimodal biomedical AI, Acosta, J.N., Falcone, G.J., Rajpurkar, P. et al. Multimodal biomedical AI. Nat Med 28, 1773–1784 (2022)

One of Tensility’s portfolio companies, Boston-based Pepper Bio, exemplifies the use of multimodal data in combating cancers, neurodegenerative, and inflammatory diseases. Their machine learning technology gathers and combines insights from the entire cellular spectrum, including DNA, RNA, proteins, and phosphorylated proteins, to enhance the drug discovery and development process.

Coming up next: Multi-agent Systems and AI Copilots

We have discussed AI agents and multimodal agents from their definition to their architecture and applications. The upcoming blog will focus on multi-agent systems (MAS) and the increasingly popular AI copilots. These topics promise to further expand our understanding of AI's capabilities and applications. Stay tuned for the next part of our series, where we'll review the complexities and innovations driving these cutting-edge AI interactivity technologies forward.

References

  1. What is an AI agent?| Zapier

  2. AI Agents: Types, Benefits and Examples - Yellow.ai

  3. AI-For-Beginners/lessons/6-Other/23-MultiagentSystems/README.mdatmain·

    microsoft/AI-For-Beginners · GitHub

  4. What Is Multimodal AI? (howtogeek.com)

  5. Multimodal AI Explained | Splunk

  6. Generative AI and multi-modal agents in AWS: The key to unlocking new value in

    financial markets | AWS Machine Learning Blog (amazon.com)

  7. Introduction to LLM Agents | NVIDIA Technical Blog

  8. Breaking cross-modal boundaries in multimodal AI: Introducing CoDi,

    composable diffusion for any-to-any generation - Microsoft Research

  9. Multimodal AI for medicine, simplified - by Eric Topol (substack.com)

Previous
Previous

AI Interactivity (Part II): Multi-Agent Systems and AI Copilots

Next
Next

Navigating AI risks: Top data leak threats for enterprises in 2024