ProMSA: Smart AI Search Agent For Visual Question Answering

ProMSA GitHub Repository - Progressive Multimodal Search Agents for KB-VQA

Imagine you see a photo of a strange bird and ask: Where is this bird a migratory species? You would probably look at the bird, try to identify it, then search Wikipedia for its migration patterns. Simple enough for humans, but incredibly hard for AI.

Most AI models can look at images, and they can search the internet. But combining both tasks — knowing when to search, what to search for, and when to stop searching — that is a whole different challenge. A new paper called ProMSA tackles this problem head-on, and the results are impressive.

Accepted at ECCV 2026, ProMSA introduces a smart AI agent that progressively searches for knowledge — switching between image search and text search — until it finds enough evidence to answer a question about an image. Let’s break down how it works.

Contents

What Is the Problem?

Knowledge-Based Visual Question Answering (KB-VQA) is exactly what it sounds like: you show an AI an image, ask a question about it, and the AI needs to use external knowledge (like Wikipedia) to answer correctly.

Here is the catch. AI is great at recognizing objects in images — it can tell you that a picture contains a chair, a bird, or a building. But it does not automatically know that the bird in the photo is an Ixobrychus minutus (a little bittern), or that this particular chair style originated in mid-century Scandinavia. For that kind of information, the AI needs to search external sources.

Most existing systems follow a fixed pipeline: search once, then answer. The problem? If the first search returns the wrong result, the AI is stuck. It cannot recover, try a different search, or combine evidence from multiple pages. Think of it like a student who only visits one website for a research paper — if that site has bad information, the entire answer falls apart.

What KB-VQA really requires is both image understanding AND smart external knowledge retrieval. The AI needs to be a good researcher, not just a good viewer.

How ProMSA Solves It

The core idea behind ProMSA is surprisingly simple and elegant: let the AI agent decide for itself what to search and when to stop.

Instead of a fixed search-then-answer pipeline, ProMSA runs a progressive search loop. At every step, the agent looks at the evidence it has gathered so far and picks one of three actions:

Image search — reverse image lookup to identify unknown entities in the photo
Text search — query Wikipedia with a rewritten text question to find specific facts
Stop — the agent has enough evidence and is ready to answer

The agent can switch between these actions multiple times. It might start with an image search to identify the bird, then do a text search to find its migration range, then realize the information is incomplete, and search again with a different query — this time excluding the results it already checked.

This is like a smart researcher who knows when to keep digging and when to stop. If the first search does not help, they try again with a different angle, rather than giving up or repeating the same search.

How the Agent Thinks

Let me walk you through what happens when ProMSA processes a question, step by step.

Step 1: Look and read. The agent receives an image and a question. It first examines the image to understand what it shows, and reads the question to figure out what information is needed.

Step 2: Decide what to search. Does the agent know what the image shows? If not, it fires off an image search — a reverse image lookup against a Wikipedia knowledge base. If it already knows the entity but needs more facts, it uses a text search with a carefully rewritten query.

Step 3: Evaluate results. After each search, the agent checks the returned information. Are these results helpful? Do they answer the question?

Step 4: Search again if needed. If the evidence is not sufficient, the agent searches again — but this time with a twist. It keeps track of what it has already seen and excludes old results, so it surfaces new candidates instead of repeating the same information.

Step 5: Answer. Once the agent has gathered enough evidence, it stops searching and generates a confident answer.

The whole process is bounded by a budget: up to 3 image searches and 3 text searches per question, with a maximum of 7 interaction steps during training. This keeps things efficient while still giving the agent flexibility to recover from bad searches.

Why ProMSA Is Smarter

ProMSA introduces several clever innovations that make it genuinely better than previous approaches:

Budget-Aware Searching

The agent has a limited number of searches it can perform. This forces it to search smartly rather than blindly throwing queries at the search engine. The training includes a penalty for using too many searches, so the model learns to stop as soon as it has enough evidence.

Deduplication

One of the most practical innovations is the exclusion list. When the agent searches and gets results, it remembers them. If it searches again, it tells the search engine to skip anything it has already seen. This prevents the frustrating loop of getting the same useless results over and over.

TN-GSPO: A New RL Algorithm

This is where things get technically interesting. ProMSA uses a new reinforcement learning objective called TN-GSPO (Tool-horizon-Normalized GSPO). Standard RL algorithms normalize updates by the length of the generated text. But in a search agent, what matters is not how long the text is, but how many tool calls the agent makes.

TN-GSPO normalizes by both generation length AND tool-interaction depth. This leads to more stable training and better search policies. The reward combines answer correctness with format validity, minus a penalty for excessive tool usage.

Two-Stage Training

ProMSA does not jump straight into reinforcement learning. It first goes through a cold-start SFT stage (Supervised Fine-Tuning) using rejection sampling. This teaches the model the correct format: how to structure its thinking, when to issue tool calls, and how to format answers. Only after mastering the basics does it move to RL to learn the optimal search strategy.

Outstanding Results

The numbers speak for themselves:

E-VQA benchmark: 52.2% accuracy (up from ~42% baseline — an improvement of nearly +10 percentage points)
InfoSeek benchmark: 53.4% accuracy (best result on this benchmark)
Training progression: Base model scored 35.1% → Cold-Start SFT brought it to 40.7% → RL with TN-GSPO pushed it to 53.0%
Generalization: Also shows strong performance on OK-VQA, demonstrating the approach transfers well to new datasets

What is particularly striking is how much the RL training helps. Going from the SFT checkpoint at 40.7% to the final 53.0% shows that learning the search strategy is just as important as learning the search format.

The TN-GSPO objective also outperforms other RL approaches: compared to GRPO (44.2%) and standard GSPO (49.3%), TN-GSPO achieves 52.6% on E-VQA — a clear improvement from normalizing by tool depth.

Impact and Future Directions

ProMSA is released as open source under the Apache 2.0 license. The full code, including training scripts, evaluation tools, and service architectures, is available on GitHub:

GitHub: https://github.com/DingWu1021/Promsa

arXiv: https://arxiv.org/abs/2606.27974

The practical applications are exciting. Imagine AI assistants that can look at a photo you take and intelligently search for relevant information — not just once, but iteratively, building up knowledge until they can give you a thorough answer. Think of search engines that truly understand images and can retrieve the right facts, or educational tools that help students explore visual topics with deep, sourced knowledge.

Looking ahead, the progressive search framework opens doors to multi-hop reasoning — where the answer to one search becomes the query for the next — and real-time search capabilities. The budget-aware, modular design also makes it straightforward to plug in new tools beyond image and text search.

Key Takeaways

ProMSA is a progressive multimodal search agent for knowledge-based visual question answering
It lets the AI decide for itself whether to search by image, search by text, or stop and answer
The deduplication mechanism prevents redundant searches and helps recover from wrong initial searches
TN-GSPO is a novel RL objective that normalizes by tool-interaction depth, leading to better search policies
Two-stage training (SFT cold start + RL) teaches both format and strategy
Achieves 52.2% on E-VQA and 53.4% on InfoSeek, setting new benchmarks
Fully open source and built on top of excellent tools like veRL, Search-R1, and Qwen models

ProMSA shows that the future of visual AI is not just about seeing — it is about knowing when and how to search. And that makes all the difference.

What Is the Problem?

How ProMSA Solves It

How the Agent Thinks