Take the Essence and Discard the Dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

✨ Latest News

[02/23/2025]: 🔨 Latest website and ranking algorithms addressed some minor errors in the feasibility ranking plot (Figure 5) and feasibility rank table (Appendix Table 5). Check out the updated Arxiv version here.
[02/08/2025]: 🎉🎉🎉 Our paper has been accepted at NAACL 2025! The full paper is available here.
[01/26/2025]: 👋 Our TEDD-Ranker visualization is now available! Check it out at TEDD-Ranker with your own data selection methods!

⚡ Introduction

𝑸𝒖𝒂𝒍𝒊𝒕𝒚 𝒎𝒂𝒕𝒕𝒆𝒓𝒔 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 𝒒𝒖𝒂𝒏𝒕𝒊𝒕𝒚!

Data selection for fine-tuning large language models has been a hit topic with various methods proposed over the last few years. For anyone intersted in the field or wish to develop new methods, some natural questions would be: What are the existing methods and How good are they?

Our work takes a retrospective look at a dozen key data selection techniques for fine-tuning LLMs, and introduces the following:

A novel three-stage scheme, comprising feature extraction, criteria design, and selector evaluation, which systematically categorizes and evaluates these methods.
a unified comparison approach that incorporates ratio-based efficiency and ranking-based feasibility metrics to address inconsistencies across evaluation settings.

💭 Results and Discussions

TL;DR:

𝑴𝒆𝒕𝒉𝒐𝒅𝒔 𝒆𝒎𝒑𝒉𝒂𝒔𝒊𝒛𝒊𝒏𝒈 𝒎𝒐𝒓𝒆 𝒕𝒂𝒓𝒈𝒆𝒕𝒆𝒅 𝒒𝒖𝒂𝒍𝒊𝒕𝒚 𝒎𝒆𝒂𝒔𝒖𝒓𝒆𝒎𝒆𝒏𝒕 𝒂𝒄𝒉𝒊𝒆𝒗𝒆 𝒉𝒊𝒈𝒉𝒆𝒓 𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒄𝒚 𝒃𝒖𝒕 𝒂𝒕 𝒕𝒉𝒆 𝒄𝒐𝒔𝒕 𝒐𝒇 𝒇𝒆𝒂𝒔𝒊𝒃𝒊𝒍𝒊𝒕𝒚. -- 𝒕𝒉𝒆 𝒂𝒖𝒕𝒉𝒐𝒓𝒔

The frameworks introduced above allow us to obtain a quantitative Efficiency Rank Plot, and a qualitative Feasibility Rank Plot.

Integrating the detailed method analysis and these ranking results, we discuss the main trends from the perspective of Candidate Dataset, Quality Measurement and Selected Features; as well as a look into the future challenges and directions.

🔗 TEDD-Ranker: Code & Visualization

We provide an interactive visualization of how we compute and visualize the efficiency and feasibility of data selection methods for easy comparison at: 🔗 TEDD-Ranker Visualization

Efficiency Rank: Performance Improvement Ratio (PIR) vs. Selected Dataset Fraction (SDF).
Feasibility Rank: Simplicity and flexibility of each method.

Note: The feasibility ranking table and feasibility rank plot contained minor errors in the original version. These are now corrected in the latest ArXiv update and TEDD-Ranker website.

🧐 Limitations

Error Corrections: Our feasibility ranking plot (Appendix Figure 5) had minor ranking errors in early versions. The website and latest ArXiv version are now correct [02/23/2025].
Ongoing Updates: TEDD-Ranker is evolving. We welcome feedback and will update rankings with new datasets/methods.
Contact for Fixes: If you spot any inconsistencies, feel free to email [email protected] or [email protected]. Confirmed errors will be corrected and updated.

🤝 Acknowledgements

We are grateful for all the researchers who have been studying data selection methods and striving to make better and better algorithms!

📜 Citation

@article{liu2024take,
  title={Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models},
  author={Liu, Ziche and Ke, Rui and Jiang, Feng and Li, Haizhou},
  journal={arXiv preprint arXiv:2406.14115},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
_site		_site
assets		assets
.DS_Store		.DS_Store
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
index.html		index.html
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Take the Essence and Discard the Dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

✨ Latest News

⚡ Introduction

💭 Results and Discussions

🔗 TEDD-Ranker: Code & Visualization

🧐 Limitations

🤝 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tREeFrOGcoder/TEDD-Ranker

Folders and files

Latest commit

History

Repository files navigation

Take the Essence and Discard the Dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

✨ Latest News

⚡ Introduction

💭 Results and Discussions

🔗 TEDD-Ranker: Code & Visualization

🧐 Limitations

🤝 Acknowledgements

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages