IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

Jiarui Xu¹, Yossi Gandelsman², Amir Bar^2,3, Jianwei Yang⁴
Jianfeng Gao⁴, Trevor Darrell², Xiaolong Wang¹

¹UC San Diego, ²UC Berkeley, ³Tel Aviv University, ⁴Microsoft Research

Our model in-context learns to solve computer vision tasks by inpainting the masked area with the task solution using multimodal prompt.

Abstract

In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv- a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. “Left: input image, Right: foreground segmentation”), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10% AP for Foreground Segmentation, over +5% gains in AP for Single Object Detection, and almost 20% lower LPIPS in Colorization. Our empirical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance.

Qualitative Results

For each vision task X, we evaluate our model on two tasks - X-to-images and images-to-X.

Image -> Segmentation

Text prompt of the first example: "Left - input image, right - Semantic segmentation of save your life Grace..."

Image -> Edge

Text prompt of the first example: "Left - input image, right - Edge map of Small house..."

Image -> Depth

Text prompt of the first example: "Left - input image, right - Depth map in grayscale of THE CURATOR Carl by artist..."

Image -> Normal

Text prompt of the first example: "Left - input image, right - Normal map of ArtStation - Izumi Tanaka, Space Fligh..."

Segmentation -> image

Text prompt of the first example: "Right - output image of famous historic bell tower at the reschenpass - italy..."

Edge -> Image

Text prompt of the first example: "Right - output image of THE CURATOR Carl by artist Phil Paradise was painted in..."

Depth -> Image

Text prompt of the first example: "Right - output image of left image, left - Depth map of righ..."

Normal -> Image

Text prompt of the first example: "Right - output image of sandstorm in desert and hiking man,illustration,digital..."