Skip to content

Add dataset based on Attention Is All You Need paper #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions llama_datasets/attention_is_all_you_need_paper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Attention Is All You Need Paper Dataset

## CLI Usage

You can download `llamadatasets` directly using `llamaindex-cli`, which comes installed with the `llama-index` python package:

```bash
llamaindex-cli download-llamadataset AttentionIsAllYouNeedPaperDataset --download-dir ./data
```

You can then inspect the files at `./data`. When you're ready to load the data into
python, you can use the below snippet of code:

```python
from llama_index import SimpleDirectoryReader
from llama_index.llama_dataset import LabelledRagDataset

rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(
input_dir="./data/source_files"
).load_data()
```

## Code Usage

You can download the dataset to a directory, say `./data` directly in Python
as well. From there, you can use the convenient `RagEvaluatorPack` llamapack to
run your own LlamaIndex RAG pipeline with the `llamadataset`.

```python
from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
"AttentionIsAllYouNeedPaperDataset", "./data"
)

# build basic RAG system
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

# evaluate using the RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
"RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack(
rag_dataset=rag_dataset,
query_engine=query_engine
)

############################################################################
# NOTE: If have a lower tier subscription for OpenAI API like Usage Tier 1 #
# then you'll need to use different batch_size and sleep_time_in_seconds. #
# For Usage Tier 1, settings that seemed to work well were batch_size=5, #
# and sleep_time_in_seconds=15 (as of December 2023.) #
############################################################################

benchmark_df = await rag_evaluator_pack.arun(
batch_size=20, # batches the number of openai api calls to make
sleep_time_in_seconds=1, # seconds to sleep before making an api call
)
```
1 change: 1 addition & 0 deletions llama_datasets/attention_is_all_you_need_paper/card.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"name": "Attention is All You Need Paper Dataset", "className": "LabelledRagDataset", "description": "A labelled RAG dataset based off Attention is All You Need paper, consisting of queries, reference answers, and reference contexts.", "numberObservations": 20, "containsExamplesByHumans": false, "containsExamplesByAi": true, "sourceUrls": ["https://arxiv.org/pdf/1706.03762.pdf"], "baselines": [{"name": "llamaindex", "config": {"chunkSize": 1024, "llm": "gpt-3.5-turbo", "similarityTopK": 2, "embedModel": "text-embedding-ada-002"}, "metrics": {"contextSimilarity": 0.967, "correctness": 4.65, "faithfulness": 0.95, "relevancy": 0.95}, "codeUrl": ""}]}
5 changes: 5 additions & 0 deletions llama_datasets/library.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"AttentionIsAllYouNeedPaperDataset": {
"id": "llama_datasets/attention_is_all_you_need",
"author": "ravi.theja",
"keywords": ["rag"]
}