RealDevQA is a multilingual Development Knowledge QA benchmark derived from large-scale real user dialogues via a rigorous three-phase pipeline. The data pipeline is as follow:
🔍 Dataset Overview
- 📊 2,740 Dev Knowledge QA pairs
- 🌍 Trilingual support (EN/CN/RU)
- 💻 Diverse development topics
- 🔗 Verifiable answers with web references
-
Pipeline Implementation
- The
code/generate
directory contains code for collecting web documents based on real conversations and jointly inputting them with actual dialogues into LLMs to regenerate Q&A pairs. - The
code/filter
directory implements a series of rigorous filtering processes to ensure data quality.
- The
-
Reference Support
- All generated Q&A pairs come with corresponding reference URLs stored in
data/reference
, enabling verification of answer accuracy.
- All generated Q&A pairs come with corresponding reference URLs stored in
-
Efficient Evaluation Framework
- The
code/eval
directory provides ready-to-use code for conveniently and efficiently evaluating LLM performance on the RealDevQA benchmark.
- The
Create the environment and install the required packages
conda create -n RealDevQA python=3.11
conda activate RealDevQA
You can evaluate LLM performance on RealDevQA by following these steps:
(1) Get Started.
git clone https://anonymous.4open.science/r/RealDevQA-25E7/
cd RealDevQA
(2) In the code/eval/eval_demo.py
file, you should add the grader model you wish to use and its corresponding API key and base URL:
grading_sampler = ChatCompletionSampler(
model="",
system_message=OPENAI_SYSTEM_MESSAGE_API,
api_key="",
base_url=""
)
(3) In the code/eval/eval_demo.py
file, you should add some eval models you wish to use and its corresponding API key and base URL:
samplers = {
"model_name": ChatCompletionSampler(
model="",
system_message=OPENAI_SYSTEM_MESSAGE_API,
api_key="",
base_url=""
),
# ...
}
(4) Run the eval script. After running it, you can get the eval results:
python code/eval/eval_demo.py
Metric | Name | Description | Ideal |
---|---|---|---|
CO✅ | Correct | Full match with reference | ↑ |
NA |
Not Attempted | Partial match | ↓ |
IN❌ | Incorrect | Contradicts reference | ↓ |
CGA🎯 | Correct Given Attempted | CO/(CO+IN) ratio | ↑ |
F-score⚖️ | F-score | Harmonic mean of CO & CGA | ↑ |