Skip to content

Imagenet1k benchmark #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

ellonde
Copy link
Contributor

@ellonde ellonde commented Sep 13, 2023

A small script to setup ImageNet1k for the benchmark

Minor changes to the downloaded dataset as there are duplicated classes, some classes also contains "/" which is replaced with "or".

I looked at the clip-benchmark package and they average the vectors of a bunch of templates strings where they insert the class name. I made some c++ code for that as well but did not include it. Let me know if you want to have a look at that. Perhaps in another PR.

I have run this on the default setup described in the README and performance is not really looking as expected.
As mentioned in another issue, the performance difference could be due to different tokenizer strategy. But I have tried to make a dumb test change to the benchmark script where I call the python tokenizer instead and I only get slightly better results.

Here are the results:
With python tokenizer:
| total | 0.1102 | 0.1600 |

With cpp tokenizer:
| total | 0.0800 | 0.1182 |

As mentioned what I ran with was CLIP-ViT-B-32-laion2B-s34B-b79K fp16 model.

Regardless, I think the benchmark dataset is nice to have to validate what performance hits are taken from different quantization strategies.

@monatis
Copy link
Owner

monatis commented Sep 14, 2023

Wow! Do you have any info about those scores for the original Pytorch implementation? And how long does it take to run the inference through the entire imagenet1k dataset with clip.cpp?

Regardless of that, I was not aware that tokenization makes such a big difference. I should definitely have a look at it.

@ellonde
Copy link
Contributor Author

ellonde commented Sep 14, 2023

I have not run the original implementation myself but this model is reported to have 66.6% top 1 performance in the open-clip repo. That is, I'm assuming, a fp32 version of the model.

I only have the timings from the hacky python-tokenizer version which is super slow when doing the tokenization. But processing the images is of course most of the work if you use the tokenizer from cpp.

- 998 texts encoded in 3211832.00 ms ( 3218.27 ms per text)
- 47904 images encoded in 5594410.00 ms (  116.78 ms per image)

@monatis
Copy link
Owner

monatis commented Sep 14, 2023

this model is reported to have 66.6% top 1 performance in the open-clip repo.

I think this is a big margin that cannot be attributed to fp16 vs. fp32 models. Is it possible that you post the code for calculating the metrics you reported?

@ellonde
Copy link
Contributor Author

ellonde commented Sep 14, 2023

I completely agree!

I have used the benchmark script that you provide, in which I trusted blindly 😅 I haven't look through to verify that its computing the metrics correctly.

I also dont think averaging of the class templates as text targets can account for this big of a difference neither

@monatis
Copy link
Owner

monatis commented Sep 14, 2023

There should be a difference between test protocols because vectors from this implementation are comparable to those from the original implementation --I verified it during the implementation, and it's also confirmed from the community feedback. I'd like to hold this PR until I better understand the difference between the metrics.

@monatis monatis marked this pull request as draft September 14, 2023 18:47
@skirodev
Copy link

I found issues with the batch inference implementation in ggml_conv_2d, so I submitted a PR #528 fix it.

Here are the results of benchmark testing using fixed code and the ggml_laion_clip-vit-b-32-laion2b-s34b-b79k f16 model on the Imagenet1k dataset.

Origin code:
| total | 0.0805 | 0.1185 |

Fixed code:
| total | 0.3137 | 0.4582 |

@monatis
Copy link
Owner

monatis commented Sep 18, 2023

@skirodev I tracked your branch until it's merged into master. I was not aware that there's a regression for it. Thanks.

@ellonde Now that we know the reason for that large margin, I think we can merge this. There's still a margin between clip.cpp and the scores reported by OpenCLIP, but this can be attributed to their different test protocol plus tokenization and image preprocessing divergences in this repo. I will fix the second and third items soon, but the first one is just for research-grade purposes, and I believe we can skip it for now

.I'll handle this PR in #72 because I couldn't push to your branch.

@monatis monatis closed this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants