typos

cjyabraham · cjyabraham · commit f0de27a09a81 · 2024-12-19T17:28:16.000-08:00
Signed-off-by: Chris Abraham &lt;cjyabraham@gmail.com&gt;
diff --git a/_posts/2024-12-19-improve-rag-performance.md b/_posts/2024-12-19-improve-rag-performance.md
@@ -4,9 +4,9 @@ title: "Improve RAG performance with torch.compile on AWS Graviton Processors"
 author: Sunita Nadampalli(AWS), Ankith Gunapal(Meta), Hamid Shojanazeri(Meta)
 ---
 
-Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to support tasks like answering questions, translating languages, and completing sentences. There are few challenges when working with LLMs such as domain knowledge gaps, factuality issues, and hallucination, which affect their reliability especially for the fields that require high levels of accuracy, such as healthcare, law, or engineering. Retrieval Augmented Generation (RAG) provides a solution to mitigate some of these issues by augmenting LLMs with a specific domain or an organization's internal knowledge base, without the need to retrain the model.
+Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to support tasks like answering questions, translating languages, and completing sentences. There are a few challenges when working with LLMs such as domain knowledge gaps, factuality issues, and hallucination, which affect their reliability especially for the fields that require high levels of accuracy, such as healthcare, law, or engineering. Retrieval Augmented Generation (RAG) provides a solution to mitigate some of these issues by augmenting LLMs with a specific domain or an organization's internal knowledge base, without the need to retrain the model.
 
-The RAG knowledge source is generally business specific databases which are typically deployed on general-purpose CPU infrastructure. So, deploying RAG on general-purpose CPU infrastructure alongside related business services is both efficient and cost-effective. With this motivation, we evaluated RAG deployment on [AWS Graviton](https://aws.amazon.com/ec2/graviton/) based Amazon EC2 instances which have been delivering up to [40% price-performance advantage](https://aws.amazon.com/ec2/graviton/getting-started/) compared to comparable instances for majority of the workloads including databases, in-memory caches, big data analytics, media codecs, gaming servers, and machine learning inference. 
+The RAG knowledge source is generally business specific databases which are typically deployed on general-purpose CPU infrastructure. So, deploying RAG on general-purpose CPU infrastructure alongside related business services is both efficient and cost-effective. With this motivation, we evaluated RAG deployment on [AWS Graviton](https://aws.amazon.com/ec2/graviton/) based Amazon EC2 instances which have been delivering up to [40% price-performance advantage](https://aws.amazon.com/ec2/graviton/getting-started/) compared to comparable instances for the majority of the workloads including databases, in-memory caches, big data analytics, media codecs, gaming servers, and machine learning inference. 
 
 In the past we published a few blog posts on how PyTorch was optimized for AWS Graviton processors to accelerate ML Inference performance for both eager mode ([blog](https://pytorch.org/blog/optimized-pytorch-w-graviton/)) and `torch.compile` mode ([blog](https://pytorch.org/blog/accelerated-pytorch-inference/)). In this blog we cover how to deploy a typical RAG workload using PyTorch and `torch.compile`, how we improved its performance up to **1.7x** for embedding model and **1.3x** for RAG query on AWS Graviton3-based m7g.xlarge instance compared to the default PyTorch “eager mode”, and finally a few recommendations that you can apply for your RAG use cases.
 
@@ -64,7 +64,7 @@ with torch.no_grad():
 
 #### Eager mode
 
-Since PyTorch eager mode was already optimized on AWS Graviton processors with the following runtime environment settings, we included them in the baseline, and measured the following performance. Please refer to [Optimized PyTorch 2.0 Inference with AWS Graviton processors](https://pytorch.org/blog/optimized-pytorch-w-graviton/) for more details on how we optimized the PyTorch eager mode on AWS Graviton processors. 
+Since PyTorch eager mode was already optimized on AWS Graviton processors with the following runtime environment settings, we included them in the baseline and measured the following performance. Please refer to [Optimized PyTorch 2.0 Inference with AWS Graviton processors](https://pytorch.org/blog/optimized-pytorch-w-graviton/) for more details on how we optimized the PyTorch eager mode on AWS Graviton processors. 
 
 
 ```
@@ -363,7 +363,7 @@ model.client = torch.compile(model.client)
 
 For a typical inference scenario where the graph is frozen and gradient calculations are disabled, Torch inductor (the compiler backend we used for CPUs) invokes hardware specific optimizations like graph rewrite into more performant operators, operator fusion, and weights pre-packing. Though Torch dynamo was able to see the model and trigger generic compilation, it failed to trigger these additional Fx passes in the Torch inductor.
 
-There were two main reasons for Torch inductor not triggering the optimization passes: (1) The application didn’t set `no_grad()` or `inference_mode()` for torch inductor to detect that the graph was frozen; and (2) We hit a limitation with the torch.compile framework, where, if the `no_grad` is set just at the beginning of the compiled region, `torch.compile` wouldn’t be able to detect it while invoking the inductor `Fx` passes, because it would not have hit the `no_grad` region by then. Please refer to[ this GitHub issue](https://github.com/pytorch/pytorch/issues/125474) for more details.
+There were two main reasons for Torch inductor not triggering the optimization passes: (1) The application didn’t set `no_grad()` or `inference_mode()` for torch inductor to detect that the graph was frozen; and (2) We hit a limitation with the torch.compile framework, where, if the `no_grad` is set just at the beginning of the compiled region, `torch.compile` wouldn’t be able to detect it while invoking the inductor `Fx` passes because it would not have hit the `no_grad` region by then. Please refer to[ this GitHub issue](https://github.com/pytorch/pytorch/issues/125474) for more details.
 
 
 #### Solution
@@ -420,6 +420,6 @@ We would like to express our gratitude to Eli Uriegas for the support in making
 
 **Sunita Nadampalli** is a Principal Engineer and AI/ML expert at AWS. She leads AWS Graviton software performance optimizations for AI/ML and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions for SoCs based on the Arm ISA.
 
-**Ankith Gunapal** is an AI Partner Engineer at Meta (PyTorch). He leads customer support, evangelizing & release engineering of TorchServe. He is passionate about solving production problems in model inference and model serving. He also enjoys distilling technically complex material in a user friendly format
+**Ankith Gunapal** is an AI Partner Engineer at Meta (PyTorch). He leads customer support, evangelizing & release engineering of TorchServe. He is passionate about solving production problems in model inference and model serving. He also enjoys distilling technically complex material in a user friendly format.
 
-**Hamid Shojanazeri** leads the AI Frameworks Partner Engineering team at Meta. He is passionate about building scalable AI solutions and specializes in working with PyTorch to tackle the challenges of large-scale distributed training, inference, model serving, and optimization
+**Hamid Shojanazeri** leads the AI Frameworks Partner Engineering team at Meta. He is passionate about building scalable AI solutions and specializes in working with PyTorch to tackle the challenges of large-scale distributed training, inference, model serving, and optimization.