Skip to content

DataScienceUIBK/PlausibleQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PlausibleQA: A Large-Scale QA Dataset with Answer Plausibility Scores

Huggingface License

PlausibleQA is a large-scale QA dataset designed to evaluate and enhance the ability of Large Language Models (LLMs) in distinguishing between correct and highly plausible incorrect answers. Unlike traditional QA datasets that primarily focus on correctness, PlausibleQA provides candidate answers annotated with plausibility scores and justifications.

🗂 Overview

📌 Key Statistics

  • 10,000 questions sourced from TriviaQA, Natural Questions (NQ), and WebQuestions (WebQ).
  • 100,000 candidate answers, each with plausibility scores (0–100).
  • 1,000,000 justifications explaining plausibility rankings.

🌟 What Makes PlausibleQA Unique?

Plausibility-Aware MCQA: Enables adaptive distractor selection based on difficulty.
LLM Robustness Evaluation: Measures a model’s ability to reject misleading but plausible answers.
Pairwise Answer Comparisons: Provides structured ranking of incorrect answers to refine plausibility assessments.

🔑 Research Contributions

  1. Introduction of PlausibleQA:

    • First large-scale QA dataset with explicit plausibility scores for incorrect answers.
    • Comprises 10,000 questions, 100,000 candidate answers, and 1,000,000 justifications.
  2. New QA Benchmark for MCQA & QARA:

    • Multiple-Choice Question Answering (MCQA): Facilitates plausibility-aware distractor generation.
    • QA Robustness Assessment (QARA): Evaluates LLM resilience against plausible distractors.
  3. Plausibility Score Annotations:

    • Each answer is assigned a plausibility score ranging from 0 to 100.
    • Scores are derived from listwise ranking (direct plausibility assignment) and pairwise comparisons.
    • Human evaluation confirms the reliability of the plausibility scores.
  4. Dataset Generation Pipeline:

    • Questions are sourced from TriviaQA, Natural Questions (NQ), and WebQuestions (WebQ).
    • LLaMA-3.3-70B generates 10 candidate answers per question.
    • Pairwise answer comparison is used to refine plausibility rankings.
    • Question & answer difficulty estimation is incorporated.
  5. Comprehensive Human Evaluation:

    • Conducted pairwise comparisons for candidate answers.
    • Showed high agreement with plausibility rankings.
    • Confirms that plausibility-aware distractors are more effective than traditional random distractors.

📥 Dataset Download

The dataset is available on HuggingFace:

wget "https://huggingface.co/datasets/JamshidJDMY/PlausibleQA/resolve/main/test.json?download=true"

📂 Use Cases of PlausibleQA

  • Improving MCQA models:

    • Helps generate more realistic and challenging multiple-choice options.
    • Enables adaptive distractor selection based on difficulty.
  • Enhancing QA Robustness Assessment:

    • Provides structured evaluation of how well LLMs handle plausible distractors.
    • Can be used for adversarial QA evaluation.
  • Fine-tuning LLMs for Better Answer Differentiation:

    • Models can be trained to better distinguish between correct and plausible answers.
    • Useful for reducing hallucinations in generative AI.
  • Contrastive Learning & Negative Example Selection:

    • Helps contrastive learning tasks by using plausibility scores for better negative sample selection.
  • Automatic Hint Generation & Evaluation:

    • The entropy of plausibility scores can be used for question difficulty estimation.
    • Can be integrated into educational tools for intelligent tutoring.

📜 License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to use, share, and adapt the dataset with proper attribution.

📑 Citation

If you find this work useful, please cite 📜our paper:

Plain

Mozafari, J., Abdallah, A., Piryani, B., & Jatowt, A. (2025). Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores. arXiv [Cs.CL]. doi:10.48550/arXiv.2502.16358

Bibtex

@article{mozafari2025wronganswersusefulplausibleqa,
      title={Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores}, 
      author={Jamshid Mozafari and Abdelrahman Abdallah and Bhawna Piryani and Adam Jatowt},
      year={2025},
      eprint={2502.16358},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.16358},
      doi={10.48550/arXiv.2502.16358} 
}

🙏Acknowledgments

Thanks to our contributors and the University of Innsbruck for supporting this project.

About

Wrong Answers Can Also Be Useful: PlausibleQA — A QA Dataset with Answer Plausibility Scores

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published