Pixel Aligned Language Models

Jiarui Xu^1,2*, Xingyi Zhou¹, Shen Yan¹, Xiuye Gu¹,
Anurag Arnab¹, Chen Sun¹, Xiaolong Wang², Cordelia Schmid¹

¹Google Research ²UC San Diego,

(* the work was during a Google internship)

Each word predicted by PixelLLM is aligned with a pixel location.

hover to see the animated trace

in this image i can see a bird flying in the air. i can see few people are standing on the ground. i can see a boat, water and sky.

in this image i can see few zebras standing on the ground. in the background i can see the trees.

in this image there is a woman standing and holding a camera in her hand. there are few people sitting on the benches. there is a bird on the water. there is a bridge. there are few buildings and trees in the background.

in this image i can see a person wearing white dress, red cap and black shoe is standing on the skateboard. i can see the road and the road. i can see the road. i can see few trees, few buildings, few vehicles on the road. i can also see the buildings and the sky.

in this image i can see a cat which is in black and white color is laying on the laptop which is in brown color, at the back i can see a mobile, a mobile, a mobile phone, a mobile and a flower vase in green color, at the back ground i can see a wall in brown color.

in this image we can see a person wearing white color t - shirt, cap and holding a racket in his hand. in the background, we can see a group of people sitting on the chairs and watching the game.

Abstract

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using large language models. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome.

Video

Qualitative Results

Pixel-Aligned Image Captioning

hover to see the animated trace

in this image i can see a person standing on the ski boards and holding sticks. i can see he is wearing a jacket, cap and a bag. in the background i can see a fence, few buildings, few lights and the sky.

in this image i can see two persons holding surfboards. the person at the back side is wearing black dress and the person at the back is wearing white dress. at the back side i can see trees and rocks. the sky is in blue color.

in this image i can see a person holding a carrot and a animal. in the background i can see a fence, trees, a horse and the sky.

in this image there is a cat sitting in the box. there is a hat on the cat. there are few objects on the table. there are few objects on the table.

Click here for more results on Localized Narratives

in this image we can see three doughnuts on a paper plate. in the background, we can see a machine.

in this image i can see a teddy bear which is brown in color is sitting on the road and i can see a tree trunk and a pole. in the background i can see the road and the sky.

in this image i can see the wash basin, taps, taps, taps, a mirror, a shower, a glass door, a toilet seat, a toilet seat, a tissue roll and the floor.

in this image i can see a cat which is black and brown in color is sitting on the table. i can see a paper, a pen, a pen, a paper and few other objects on the table. in the background i can see a laptop, a monitor, few books, few other objects on the table, few other objects on the table. i can see a wall and few other objects on the wall.

in this image i can see a cat sitting on the chair. i can see few books, a table, few books, a bag and few other objects on the floor. i can see a bed and few clothes on the bed. i can see a bag and few other objects on the floor.

in this image i can see a cat which is in black and white color, at right i can see a table which is in brown color, at left i can see a cable which is in green color, at the background i can see a green color wall.

in this image i can see a train on the railway track. i can see few poles, wires, a building, few trees, few vehicles on the road. i can also see few poles, few wires, few wires and few wires. i can also see few buildings.

in this image i can see a cat which is in black and brown color standing on the car. i can see a black color car and a lamp. in the background i can see few objects on the floor, few objects, few objects on the floor and the wall.