HuggingFaceM4/idefics2-8b-base · I would like to ask what the specific design of the few-shot test of the base model is

zwq2018

Jul 26

Could you tell me how the few-shot examples are selected?

Also, what is the prompt format when testing few-shot examples?

I tested the effect of the base model on TextQA. It seems to be more than 20 points lower than that in the paper? It is also 8-shot.

HugoLaurencon

HuggingFaceM4 org Jul 27

The few shot examples are randomly selected.

Which version of transformers are you using?

The version 4.40 works, but there is currently a bug with the newest versions, making the generations unrelated to the images.
Could you try with this version?

zwq2018

Jul 28

•

edited Jul 28

I checked that the version of transformers is 4.42.4. Is this version okay?
I randomly selected 8 examples from the testset before the evaulation and then placed them in the prompt as 8-shot examples. Each sample was using these 8 shots.
The test result is only 26.5 on textVQA, which is much lower than the 57.9 reported in the paper.
The following is the format of my prompt:

prompt = f'\nBased on the picture: <image>, {question} Short answer: {labed_answer}###'

I do not know why

HugoLaurencon

HuggingFaceM4 org Jul 28

No this version is unfortunately not ok.
In transformers, they implemented a new caching strategy, and it broke the modeling of Idefics2.
It makes the generations unrelated to the images (so this is why you have a low score) but related to the text.
Please try to use version 4.40 which should ok and you should see a big boost in your scores.

HugoLaurencon

HuggingFaceM4 org Jul 28

Your prompt looks correct too. The prompt will influence the scores, but by just a few points.

zwq2018

Jul 28

No this version is unfortunately not ok.
In transformers, they implemented a new caching strategy, and it broke the modeling of Idefics2.
It makes the generations unrelated to the images (so this is why you have a low score) but related to the text.
Please try to use version 4.40 which should ok and you should see a big boost in your scores.

ok, I try version 4.40 again. Hope it can get the expected score

Also, does this mean that the model I trained with transformers 4.42 is completely useless?

HugoLaurencon

HuggingFaceM4 org Jul 30

This I'm now sure...

But the good thing is that there is now a fix https://github.com/huggingface/transformers/pull/32275.
So you should try with this branch of Transformers first if your model works well or not.

zwq2018

Jul 30

•

edited Jul 30

This I'm now sure...

But the good thing is that there is now a fix https://github.com/huggingface/transformers/pull/32275.
So you should try with this branch of Transformers first if your model works well or not.

Now I am very confused.
I downgraded the version of transformers back to 4.40. The manual inspection found that the model was working and it could be associated with input image.
However, the test results on textVQA, VQAv2 and OKVQA still had a large gap.
And I found that the format of the prompt has a significant impact. In the 8-shot scenario, a slight change in the prompt results in a significant difference.

May I ask if you can provide your evaluation script?

Thank you very much.

HugoLaurencon

HuggingFaceM4 org Jul 30

•

edited Jul 30

We used this for the prompt

{
    "prefix": "{bos_token}Instruction: provide an answer to the question. Use the image to answer.\n",
    "example": "Image:<image>Question: {question} Answer: {answer}\n",
},

We didn't optimize the prompt, there might be better or worse ones.

The prefix is only added before the first shot example, not every shot. The template for each shot is given by "example".
Instead of <image>, we replace it by the sequence of <fake_token_around_image> and <image>.
It should be <fake_token_around_image><image>...<image>(64 times)<fake_token_around_image>

You can provide your script

zwq2018

Jul 31

Wow，it works well.
Thank you very much, sincerely

Furthermore, I also want to ask, if I use the based model (idefics2-8b-base) for pre-training, does my training prompt also need to be set up like this? I mean, when I encounter "image" in the pre-training corpus, do I also need to add "Image: " like this?

For example, my current pre-training corpora are image1, image2, image3. I like this model because it ....

it turns into

<fake_token_around_image >< image >..(64 times)..<image><fake_token_around_image>< image >..(64 times)..<image><fake_token_around_image>< image >..(64 times)..<image><fake_token_around_image> I like this model because .....

Do I need to add the identifier "Image"?

Image: <fake_token_around_image >< image >..(64 times)..<image><fake_token_around_image>< image >..(64 times)..<image><fake_token_around_image>< image >..(64 times)..<image><fake_token_around_image> I like this model because .....

Or other better formats？

Thank you for your help, sincerely

HugoLaurencon

HuggingFaceM4 org Aug 1

•

edited Aug 1

So you obtained a good score on TextVQA?

"image" in the pre-training corpus, do I also need to add "Image: " like this?

No, you can use them however you want. Whenever you encounter or want to position an image, you put in your prompt the pattern <fake_token_around_image><image>...<image>(64 times)<fake_token_around_image>.

Note that this pattern is slightly adapted when two or more images are consecutive. Te consecutive <fake_token_around_image> wouldn't be duplicated in this case (like in the example you sent which is correct, besides the two spaces in < image > which should be <image>).

If you want to continue the training, you should follow the same pattern then, but you can put the images in any interleaved way

HugoLaurencon

HuggingFaceM4 org Aug 1

Don't hesitate if you have more questions

zwq2018

Aug 4

•

edited Aug 4

So you obtained a good score on TextVQA?

"image" in the pre-training corpus, do I also need to add "Image: " like this?

No, you can use them however you want. Whenever you encounter or want to position an image, you put in your prompt the pattern <fake_token_around_image><image>...<image>(64 times)<fake_token_around_image>.

Note that this pattern is slightly adapted when two or more images are consecutive. Te consecutive <fake_token_around_image> wouldn't be duplicated in this case (like in the example you sent which is correct, besides the two spaces in < image > which should be <image>).

If you want to continue the training, you should follow the same pattern then, but you can put the images in any interleaved way

Yeah, we achieved the same performance as the paper on TextVQA and VQAv2.
Thank you for being so helpful.

Do you mean that it is not necessary to add "Image:" deliberately:
Just use
<fake_token_around_image><image>..(64 times)<fake_token_around_image> directly.
It is not necessary to add Image: before visual token:
Image: <fake_token_around_image><image>..(64 times)<fake_token_around_image>

It is right?

Besides, I am rather puzzled that if I want to continue the pre-training, what ratio of text tokens and image tokens should I set as more appropriate?

HugoLaurencon

HuggingFaceM4 org Aug 5

•

edited Aug 5

Yes exactly, it's not needed to use Image:.
You could simply put the pattern <fake_token_around_image><image>..(64 times)<fake_token_around_image> wherever you want in your text prompt.
Then, I imagine some prompts are better than others, but for the image-text pairs for example, it was trained with data formulated as <fake_token_around_image><image>..(64 times)<fake_token_around_image>The image caption here, so there's not a real reason why adding Image: helps, if it helps.

Besides, I am rather puzzled that if I want to continue the pre-training, what ratio of text tokens and image tokens should I set as more appropriate?

I think it really depends on your task. It's ok to have an imbalanced distribution between the number of text tokens and image tokens.
In some scenarios (long captions, + few visual tokens like 64 as we chose), it's possible to have much more text tokens than image tokens.
However, when there are short captions or many tokens to encode an image (for example, with the image splitting strategy, see the paper for explanation), then you could have much more visual tokens than text tokens. However, since the loss will be only computed on text tokens, make sure to have a batch size large enough in this case.

zwq2018

Aug 5

OK. Thank you for your detailed and caring help.

zwq2018 changed discussion status to closed 22 days ago