PlausibleQA: A Large-Scale QA Dataset with Answer Plausibility Scores

PlausibleQA is a large-scale QA dataset designed to evaluate and enhance the ability of Large Language Models (LLMs) in distinguishing between correct and highly plausible incorrect answers. Unlike traditional QA datasets that primarily focus on correctness, PlausibleQA provides candidate answers annotated with plausibility scores and justifications.

🗂 Overview

📌 Key Statistics

10,000 questions sourced from TriviaQA, Natural Questions (NQ), and WebQuestions (WebQ).
100,000 candidate answers, each with plausibility scores (0–100).
1,000,000 justifications explaining plausibility rankings.

🌟 What Makes PlausibleQA Unique?

✅ Plausibility-Aware MCQA: Enables adaptive distractor selection based on difficulty.
✅ LLM Robustness Evaluation: Measures a model’s ability to reject misleading but plausible answers.
✅ Pairwise Answer Comparisons: Provides structured ranking of incorrect answers to refine plausibility assessments.

🔑 Research Contributions

Introduction of PlausibleQA:
- First large-scale QA dataset with explicit plausibility scores for incorrect answers.
- Comprises 10,000 questions, 100,000 candidate answers, and 1,000,000 justifications.
New QA Benchmark for MCQA & QARA:
- Multiple-Choice Question Answering (MCQA): Facilitates plausibility-aware distractor generation.
- QA Robustness Assessment (QARA): Evaluates LLM resilience against plausible distractors.
Plausibility Score Annotations:
- Each answer is assigned a plausibility score ranging from 0 to 100.
- Scores are derived from listwise ranking (direct plausibility assignment) and pairwise comparisons.
- Human evaluation confirms the reliability of the plausibility scores.
Dataset Generation Pipeline:
- Questions are sourced from TriviaQA, Natural Questions (NQ), and WebQuestions (WebQ).
- LLaMA-3.3-70B generates 10 candidate answers per question.
- Pairwise answer comparison is used to refine plausibility rankings.
- Question & answer difficulty estimation is incorporated.
Comprehensive Human Evaluation:
- Conducted pairwise comparisons for candidate answers.
- Showed high agreement with plausibility rankings.
- Confirms that plausibility-aware distractors are more effective than traditional random distractors.

📥 Dataset Download

The dataset is available on HuggingFace:

wget "https://huggingface.co/datasets/JamshidJDMY/PlausibleQA/resolve/main/test.json?download=true"

📂 Use Cases of PlausibleQA

Improving MCQA models:
- Helps generate more realistic and challenging multiple-choice options.
- Enables adaptive distractor selection based on difficulty.
Enhancing QA Robustness Assessment:
- Provides structured evaluation of how well LLMs handle plausible distractors.
- Can be used for adversarial QA evaluation.
Fine-tuning LLMs for Better Answer Differentiation:
- Models can be trained to better distinguish between correct and plausible answers.
- Useful for reducing hallucinations in generative AI.
Contrastive Learning & Negative Example Selection:
- Helps contrastive learning tasks by using plausibility scores for better negative sample selection.
Automatic Hint Generation & Evaluation:
- The entropy of plausibility scores can be used for question difficulty estimation.
- Can be integrated into educational tools for intelligent tutoring.

📜 License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to use, share, and adapt the dataset with proper attribution.

📑 Citation

If you find this work useful, please cite 📜our paper:

Plain

Mozafari, J., Abdallah, A., Piryani, B., & Jatowt, A. (2025). Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores. arXiv [Cs.CL]. doi:10.48550/arXiv.2502.16358

Bibtex

@article{mozafari2025wronganswersusefulplausibleqa,
      title={Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores}, 
      author={Jamshid Mozafari and Abdelrahman Abdallah and Bhawna Piryani and Adam Jatowt},
      year={2025},
      eprint={2502.16358},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.16358},
      doi={10.48550/arXiv.2502.16358} 
}

🙏Acknowledgments

Thanks to our contributors and the University of Innsbruck for supporting this project.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Experiments		Experiments
HumanEvaluation		HumanEvaluation
Images		Images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PlausibleQA: A Large-Scale QA Dataset with Answer Plausibility Scores

🗂 Overview

📌 Key Statistics

🌟 What Makes PlausibleQA Unique?

🔑 Research Contributions

📥 Dataset Download

📂 Use Cases of PlausibleQA

📜 License

📑 Citation

Plain

Bibtex

🙏Acknowledgments

About

Uh oh!

Releases

Packages

License

DataScienceUIBK/PlausibleQA

Folders and files

Latest commit

History

Repository files navigation

PlausibleQA: A Large-Scale QA Dataset with Answer Plausibility Scores

🗂 Overview

📌 Key Statistics

🌟 What Makes PlausibleQA Unique?

🔑 Research Contributions

📥 Dataset Download

📂 Use Cases of PlausibleQA

📜 License

📑 Citation

Plain

Bibtex

🙏Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages