Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome.
We propose Pixel-Aligned Language Model (PixelLLM) to equip large language models with localization capability. The model is pre-trained on localized image captioning data, where each word is labeled with a pixel location, to learn the alignment between words and image pixels. PixelLLM can be applied to various localization tasks, for example, location-conditioned captioning when taking location as input, and referring localization when generating locations as outputs.
We first encode the input location prompt (global box prompt in this case) and the input image with the prompt encoder \(\mathcal{P}\) and the image encoder \(\mathcal{V}\) respectively. Then we input the prompt feature \(\mathbf{l}\) and the image feature \(\mathbf{f}\) into the prompt feature extractor to extract location-specific visual feature \(\mathbf{f_l}\). The large language model \(\mathcal{L}\) then auto-regressively predicts the next text tokens conditioned on previous text tokens and the visual feature. We apply a simple MLP layer on the token features before the vocabulary mapping layer of LLM, which predicts the coordinates of each text token. The alignment between the caption and the trace is represented by color gradient .
@article{xu2023pixel,
author = {Xu, Jiarui and Zhou, Xingyi and Yan, Shen and Gu, Xiuye and Arnab, Anurag and Sun, Chen and Wang, Xiaolong and Schmid, Cordelia},
title = {{Pixel Aligned Language Models}},
journal = {arXiv preprint arXiv: 2312.09237},
year = {2023},
}