>
The process of using Attention Prompting on Image (API) in VQA involves two steps. First, employ an auxiliary LVLM to generate a mask. Second, overlay the mask on the original image before performing inference. For instance, an auxiliary CLIP can be used to calculate the similarity between each image patch and the query. Patches with low similarity are assigned a heavier mask, while patches with high similarity remain unmasked. Such mask serves as a visual cue to guide the VLM during inference, directing attention to regions of the image relevant to the question.
Here is an example comparing our API method with the naive VQA method without prompting. The question in the example is particularly challenging, testing the VLM's abilities in visual grounding and spatial attribute reasoning. The API-generated mask reduced the difficulty of the visual grounding task, highlighting the red bird mentioned in the query.