Thoughts on padding images of different sizes for VisionTransformer? #1609

priyamtejaswin · 2022-12-31T12:51:49Z

priyamtejaswin
Dec 31, 2022

Hello,

I have a problem where I am trying to preserve the image aspect ratio (as best as possible).

The shorter side of each image is resampled to 384 (so I can use vit_base_patch32_384), and the longer side has a upper limit of 640.

Consider a batch with two images,

batch = [
    img_0,  # (384 x 384)
    img_1,  # (384 x 640)
]

I know can pad img_0 to match the dimensions of img_1.

Questions

Is there some way for the VisionTransformer to ignore the padded pixels?
Ignoring all padding might be impossible at times, since the patches have a fixed size. But could ViT ignore patches which only contain padding?

samils7 · 2023-01-07T05:50:05Z

samils7
Jan 7, 2023

Hi, I had the same questions for a while and here are my thoughts.

We know where we are padding in every image. So, how many padded patches will be created and their positions.

HF transformers have bool_masked_pos parameter in the forward func of embeddings. I think we can use this masking on padded patches.

if bool_masked_pos is not None:
    seq_length = embeddings.shape[1]
    mask_tokens = self.mask_token.expand(batch_size, seq_length, -1)
    # replace the masked visual tokens by mask_tokens
    mask = bool_masked_pos.unsqueeze(-1).type_as(mask_tokens)
    embeddings = embeddings * (1.0 - mask) + mask_tokens * mask

0 replies

rwightman · 2023-01-09T17:07:14Z

rwightman
Jan 9, 2023
Maintainer

While you could use padding/masking here, as you say, it's not likley it'd line up exactly on patch sizes. Esp for the 384x384 and larger for the few that have it, squash crop (distorting the aspect) actually works quite well and often better.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Thoughts on padding images of different sizes for VisionTransformer? #1609

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Thoughts on padding images of different sizes for VisionTransformer? #1609

Uh oh!

priyamtejaswin Dec 31, 2022

Replies: 2 comments

Uh oh!

Uh oh!

samils7 Jan 7, 2023

Uh oh!

rwightman Jan 9, 2023 Maintainer

priyamtejaswin
Dec 31, 2022

samils7
Jan 7, 2023

rwightman
Jan 9, 2023
Maintainer