Welcome! This repository accompanies the paper On the Perception Bottleneck of VLMs for Chart Understanding.
This repository provides implementations for training and evaluating CLIP and LLaVA models on chart understanding tasks. Specifically, it includes:
- CLIP Training: Training scripts for CLIP with and without hard negative captions.
- CLIP Evaluation: Code for evaluating CLIP on various chart-related datasets.
- LLaVA Training: Training scripts for LLaVA-13B and LLaVA-Phi.
- LLaVA Evaluation: Evaluation scripts for LLaVA on multiple chart benchmarks.
- CLIP Learning Data: Data from CLIP contrastive learning on Chart Tasks.
Detailed instructions for setting up the environment are provided in config_env.md
.
We utilize the open_clip repository for CLIP training. The source code is available in the open_clip
directory.
Example training script: example_scripts/train_openclip.sh
.
For NegCLIP training, we build upon the neg_clip repository, modifying it to support multi-GPU training. The modified code is in the neg_clip
directory.
Example NegCLIP training script: example_scripts/train_negclip.sh
.
The evaluation code for CLIP is located in the eval_clip
directory.
Example evaluation script: example_scripts/eval_clip.sh
.
We train two types of LLaVA models:
- LLaVA-v1.5-13B: Uses Vicuna-13B as the language model.
- LLaVA-Phi: Uses Phi-3-mini-4k-instruct as the language model.
LLaVA-v1.5-13B training is based on the LLaVA repository, while LLaVA-Phi training is based on the LLaVA-pp repository. Additionally, we enable unfreezing vision encoder tuning.
Example training full llava script: example_scripts/train_full_llava.sh
.
LLaVA is evaluated on multiple chart-related benchmarks.
For FigureQA, DVQA, PlotQA, ChartQA, ChartBench, and ChartX, evaluation scripts are provided in: example_scripts/eval_llava.sh
.
For MathVista, evaluation scripts are provided in: example_scripts/eval_mathvista.sh
.
Model | Link |
---|---|
ChartCLIP | 🤗 |
Dataset | Link |
---|---|
Vision4Chart | 🤗 |
If you find this work helpful, please kindly cite as:
@misc{liu2025perceptionbottleneckvlmschart,
title={On the Perception Bottleneck of VLMs for Chart Understanding},
author={Junteng Liu and Weihao Zeng and Xiwen Zhang and Yijun Wang and Zifei Shan and Junxian He},
year={2025},
eprint={2503.18435},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.18435},
}