API: Attention Prompting on Image for Large Vision-Language Models

Introduction

The process of using Attention Prompting on Image (API) in VQA involves two steps. First, employ an auxiliary LVLM to generate a mask. Second, overlay the mask on the original image before performing inference. For instance, an auxiliary CLIP can be used to calculate the similarity between each image patch and the query. Patches with low similarity are assigned a heavier mask, while patches with high similarity remain unmasked. Such mask serves as a visual cue to guide the VLM during inference, directing attention to regions of the image relevant to the question.

Here is an example comparing our API method with the naive VQA method without prompting. The question in the example is particularly challenging, testing the VLM's abilities in visual grounding and spatial attribute reasoning. The API-generated mask reduced the difficulty of the visual grounding task, highlighting the red bird mentioned in the query.

Citation

@misc{yu2024api,
   title = {API: Attention Prompting on Image for Large Vision-Language Models},
   url = {https://arxiv.org/abs/2409.17143},
   author={Runpeng Yu and Weihao Yu and Xinchao Wang},
   year = {2024},
   eprint = {2409.17143},
   proceeding = {European Conference on Computer Vision ECCV 2024},
}

Acknowledgement

This website is built based on the project page of SoM-GPT4V project.

API : Attention Prompting on Image

Add an adaptive mask onto the image to enhance LVLM performance.

Runpeng Yu; Weihao Yu; Xinchao Wang

National University of Singapore

Introduction

Play with API

Examples

Results

Citation

Acknowledgement