Skip to content

Commit cee8261

Browse files
committed
green
Signed-off-by: Chris Abraham <[email protected]>
1 parent 9a0b378 commit cee8261

File tree

1 file changed

+32
-17
lines changed

1 file changed

+32
-17
lines changed

_posts/2024-12-19-improve-rag-performance.md

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,6 @@ title: "Improve RAG performance with torch.compile on AWS Graviton Processors"
44
author: Sunita Nadampalli(AWS), Ankith Gunapal(Meta), Hamid Shojanazeri(Meta)
55
---
66

7-
```html
8-
<pre><code class="language-python">
9-
<span style="color: green;">print("This line is green")</span>
10-
print("This line is normal")
11-
<span style="color: green;">x = 10</span>
12-
</code></pre>
13-
```
14-
15-
<div class="code-block">
16-
<pre>
17-
<span style="color: green;">let x = 10;</span>
18-
console.log(x);
19-
</pre>
20-
</div>
21-
22-
237
Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to support tasks like answering questions, translating languages, and completing sentences. There are a few challenges when working with LLMs such as domain knowledge gaps, factuality issues, and hallucination, which affect their reliability especially for the fields that require high levels of accuracy, such as healthcare, law, or engineering. Retrieval Augmented Generation (RAG) provides a solution to mitigate some of these issues by augmenting LLMs with a specific domain or an organization's internal knowledge base, without the need to retrain the model.
248

259
The RAG knowledge source is generally business specific databases which are typically deployed on general-purpose CPU infrastructure. So, deploying RAG on general-purpose CPU infrastructure alongside related business services is both efficient and cost-effective. With this motivation, we evaluated RAG deployment on [AWS Graviton](https://aws.amazon.com/ec2/graviton/) based Amazon EC2 instances which have been delivering up to [40% price-performance advantage](https://aws.amazon.com/ec2/graviton/getting-started/) compared to comparable instances for the majority of the workloads including databases, in-memory caches, big data analytics, media codecs, gaming servers, and machine learning inference.
@@ -250,9 +234,40 @@ The following table shows the incremental performance improvements achieved for
250234
The following script is an updated example for the embedding model inference with the previously discussed optimizations included. The optimizations are highlighted in **BOLD**.
251235

252236

253-
![code optimizations](/assets/images/improve-rag-performance2.jpg){:style="width:100%"}
237+
<div class="code-block">
238+
<pre>
239+
import torch
240+
from torch.profiler import profile, record_function, ProfilerActivity
241+
from transformers import AutoTokenizer, AutoModel
242+
<span style="color: green;">import torch._inductor.config as config</span>
243+
<span style="color: green;">config.cpp.weight_prepack=True</span>
244+
<span style="color: green;">config.freezing=True</span>
245+
246+
model_name = "sentence-transformers/all-mpnet-base-v2"
247+
input_text = ['This is an example sentence', 'Each sentence is converted']
248+
249+
model = AutoModel.from_pretrained(model_name)
250+
tokenizer = AutoTokenizer.from_pretrained(model_name)
254251

252+
encoded_input = tokenizer(input_text, padding=True, truncation=True, return_tensors='pt')
255253

254+
warmup , actual = 100, 100
255+
model.eval()
256+
<span style="color: green;">model = torch.compile(model)</span>
257+
258+
<span style="color: green;">with torch.inference_mode():</span>
259+
#instead of with torch.no_grad()
260+
# warmup
261+
for i in range(warmup):
262+
embeddings = model(**encoded_input)
263+
264+
with profile(activities=[ProfilerActivity.CPU]) as prof:
265+
with record_function("model_inference"):
266+
for i in range(actual):
267+
embeddings = model(**encoded_input)
268+
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
269+
</pre>
270+
</div>
256271

257272
### End-to-End RAG scenario on CPU
258273

0 commit comments

Comments
 (0)