pytorch
diff --git a/‎docs/source/executorch-arm-delegate-tutorial.md renamed to ‎docs/source/backends-arm-ethos-u.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/executorch-arm-delegate-tutorial.md renamed to ‎docs/source/backends-arm-ethos-u.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/build-run-xtensa.md renamed to ‎docs/source/backends-cadence.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/build-run-xtensa.md renamed to ‎docs/source/backends-cadence.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/build-run-coreml.md renamed to ‎docs/source/backends-coreml.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/build-run-coreml.md renamed to ‎docs/source/backends-coreml.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/build-run-mediatek-backend.md renamed to ‎docs/source/backends-mediatek.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/build-run-mediatek-backend.md renamed to ‎docs/source/backends-mediatek.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/backends-mps.md
Lines changed: 157 additions & 0 deletions b/‎docs/source/backends-mps.md
Lines changed: 157 additions & 0 deletions
diff --git a/‎docs/source/build-run-qualcomm-ai-engine-direct-backend.md renamed to ‎docs/source/backends-qualcomm.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/build-run-qualcomm-ai-engine-direct-backend.md renamed to ‎docs/source/backends-qualcomm.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/backends-vulkan.md
Lines changed: 205 additions & 0 deletions b/‎docs/source/backends-vulkan.md
Lines changed: 205 additions & 0 deletions
diff --git a/‎docs/source/native-delegates-executorch-xnnpack-delegate.md renamed to ‎docs/source/backends-xnnpack.md b/‎docs/source/native-delegates-executorch-xnnpack-delegate.md renamed to ‎docs/source/backends-xnnpack.md
diff --git a/‎docs/source/build-run-mps.md
Lines changed: 0 additions & 1 deletion b/‎docs/source/build-run-mps.md
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/source/build-run-vulkan.md
Lines changed: 0 additions & 1 deletion b/‎docs/source/build-run-vulkan.md
Lines changed: 0 additions & 1 deletion
@@ -1,5 +1,5 @@
 <!---- Name is a WIP - this reflects better what it can do today ----->
-# Building and Running ExecuTorch with ARM Ethos-U Backend
+# ARM Ethos-U Backend
 
 <!----This will show a grid card on the page----->
 ::::{grid} 2
 
@@ -1,4 +1,4 @@
-# Building and Running ExecuTorch on Xtensa HiFi4 DSP
+# Cadence Xtensa Backend
 
 
 In this tutorial we will walk you through the process of getting setup to build ExecuTorch for an Xtensa HiFi4 DSP and running a simple model on it.
 
@@ -1,4 +1,4 @@
-# Building and Running ExecuTorch with Core ML Backend
+# Core ML Backend
 
 Core ML delegate uses Core ML APIs to enable running neural networks via Apple's hardware acceleration. For more about Core ML you can read [here](https://developer.apple.com/documentation/coreml). In this tutorial, we will walk through the steps of lowering a PyTorch model to Core ML delegate
 
 
@@ -1,4 +1,4 @@
-# Building and Running ExecuTorch with MediaTek Backend
+# MediaTek Backend
 
 MediaTek backend empowers ExecuTorch to speed up PyTorch models on edge devices that equips with MediaTek Neuron Processing Unit (NPU). This document offers a step-by-step guide to set up the build environment for the MediaTek ExecuTorch libraries.
 
 
@@ -0,0 +1,157 @@
+# MPS Backend
+
+In this tutorial we will walk you through the process of getting setup to build the MPS backend for ExecuTorch and running a simple model on it.
+
+The MPS backend device maps machine learning computational graphs and primitives on the [MPS Graph](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph?language=objc) framework and tuned kernels provided by [MPS](https://developer.apple.com/documentation/metalperformanceshaders?language=objc).
+
+::::{grid} 2
+:::{grid-item-card}  What you will learn in this tutorial:
+:class-card: card-prerequisites
+* In this tutorial you will learn how to export [MobileNet V3](https://pytorch.org/vision/main/models/mobilenetv3.html) model to the MPS delegate.
+* You will also learn how to compile and deploy the ExecuTorch runtime with the MPS delegate on macOS and iOS.
+:::
+:::{grid-item-card}  Tutorials we recommend you complete before this:
+:class-card: card-prerequisites
+* [Introduction to ExecuTorch](intro-how-it-works.md)
+* [Setting up ExecuTorch](getting-started-setup.md)
+* [Building ExecuTorch with CMake](runtime-build-and-cross-compilation.md)
+* [ExecuTorch iOS Demo App](demo-apps-ios.md)
+* [ExecuTorch iOS LLaMA Demo App](llm/llama-demo-ios.md)
+:::
+::::
+
+
+## Prerequisites (Hardware and Software)
+
+In order to be able to successfully build and run a model using the MPS backend for ExecuTorch, you'll need the following hardware and software components:
+
+### Hardware:
+ - A [mac](https://www.apple.com/mac/) for tracing the model
+
+### Software:
+
+  - **Ahead of time** tracing:
+    - [macOS](https://www.apple.com/macos/) 12
+
+  - **Runtime**:
+    - [macOS](https://www.apple.com/macos/) >= 12.4
+    - [iOS](https://www.apple.com/ios) >= 15.4
+    - [Xcode](https://developer.apple.com/xcode/) >= 14.1
+
+## Setting up Developer Environment
+
+***Step 1.*** Please finish tutorial [Setting up ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup).
+
+***Step 2.*** Install dependencies needed to lower MPS delegate:
+
+  ```bash
+  ./backends/apple/mps/install_requirements.sh
+  ```
+
+## Build
+
+### AOT (Ahead-of-time) Components
+
+**Compiling model for MPS delegate**:
+- In this step, you will generate a simple ExecuTorch program that lowers MobileNetV3 model to the MPS delegate. You'll then pass this Program (the `.pte` file) during the runtime to run it using the MPS backend.
+
+```bash
+cd executorch
+# Note: `mps_example` script uses by default the MPSPartitioner for ops that are not yet supported by the MPS delegate. To turn it off, pass `--no-use_partitioner`.
+python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --bundled --use_fp16
+
+# To see all options, run following command:
+python3 -m examples.apple.mps.scripts.mps_example --help
+```
+
+### Runtime
+
+**Building the MPS executor runner:**
+```bash
+# In this step, you'll be building the `mps_executor_runner` that is able to run MPS lowered modules:
+cd executorch
+./examples/apple/mps/scripts/build_mps_executor_runner.sh
+```
+
+## Run the mv3 generated model using the mps_executor_runner
+
+```bash
+./cmake-out/examples/apple/mps/mps_executor_runner --model_path mv3_mps_bundled_fp16.pte --bundled_program
+```
+
+- You should see the following results. Note that no output file will be generated in this example:
+```
+I 00:00:00.003290 executorch:mps_executor_runner.mm:286] Model file mv3_mps_bundled_fp16.pte is loaded.
+I 00:00:00.003306 executorch:mps_executor_runner.mm:292] Program methods: 1
+I 00:00:00.003308 executorch:mps_executor_runner.mm:294] Running method forward
+I 00:00:00.003311 executorch:mps_executor_runner.mm:349] Setting up non-const buffer 1, size 606112.
+I 00:00:00.003374 executorch:mps_executor_runner.mm:376] Setting up memory manager
+I 00:00:00.003376 executorch:mps_executor_runner.mm:392] Loading method name from plan
+I 00:00:00.018942 executorch:mps_executor_runner.mm:399] Method loaded.
+I 00:00:00.018944 executorch:mps_executor_runner.mm:404] Loading bundled program...
+I 00:00:00.018980 executorch:mps_executor_runner.mm:421] Inputs prepared.
+I 00:00:00.118731 executorch:mps_executor_runner.mm:438] Model executed successfully.
+I 00:00:00.122615 executorch:mps_executor_runner.mm:501] Model verified successfully.
+```
+
+### [Optional] Run the generated model directly using pybind
+1. Make sure `pybind` MPS support was installed:
+```bash
+./install_executorch.sh --pybind mps
+```
+2. Run the `mps_example` script to trace the model and run it directly from python:
+```bash
+cd executorch
+# Check correctness between PyTorch eager forward pass and ExecuTorch MPS delegate forward pass
+python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --no-use_fp16 --check_correctness
+# You should see following output: `Results between ExecuTorch forward pass with MPS backend and PyTorch forward pass for mv3_mps are matching!`
+
+# Check performance between PyTorch MPS forward pass and ExecuTorch MPS forward pass
+python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --no-use_fp16 --bench_pytorch
+```
+
+### Profiling:
+1. [Optional] Generate an [ETRecord](./etrecord.rst) while you're exporting your model.
+```bash
+cd executorch
+python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --generate_etrecord -b
+```
+2. Run your Program on the ExecuTorch runtime and generate an [ETDump](./etdump.md).
+```
+./cmake-out/examples/apple/mps/mps_executor_runner --model_path mv3_mps_bundled_fp16.pte --bundled_program --dump-outputs
+```
+3. Create an instance of the Inspector API by passing in the ETDump you have sourced from the runtime along with the optionally generated ETRecord from step 1.
+```bash
+python3 -m sdk.inspector.inspector_cli --etdump_path etdump.etdp --etrecord_path etrecord.bin
+```
+
+## Deploying and Running on Device
+
+***Step 1***. Create the ExecuTorch core and MPS delegate frameworks to link on iOS
+```bash
+cd executorch
+./build/build_apple_frameworks.sh --mps
+```
+
+`mps_delegate.xcframework` will be in `cmake-out` folder, along with `executorch.xcframework` and `portable_delegate.xcframework`:
+```bash
+cd cmake-out && ls
+```
+
+***Step 2***. Link the frameworks into your XCode project:
+Go to project Target’s  `Build Phases`  -  `Link Binaries With Libraries`, click the **+** sign and add the frameworks: files located in  `Release` folder.
+- `executorch.xcframework`
+- `portable_delegate.xcframework`
+- `mps_delegate.xcframework`
+
+From the same page, include the needed libraries for the MPS delegate:
+- `MetalPerformanceShaders.framework`
+- `MetalPerformanceShadersGraph.framework`
+- `Metal.framework`
+
+In this tutorial, you have learned how to lower a model to the MPS delegate, build the mps_executor_runner and run a lowered model through the MPS delegate, or directly on device using the MPS delegate static library.
+
+
+## Frequently encountered errors and resolution.
+
+If you encountered any bugs or issues following this tutorial please file a bug/issue on the [ExecuTorch repository](https://github.com/pytorch/executorch/issues), with hashtag **#mps**.
@@ -1,4 +1,4 @@
-# Building and Running ExecuTorch with Qualcomm AI Engine Direct Backend
+# Qualcomm AI Engine Backend
 
 In this tutorial we will walk you through the process of getting started to
 build ExecuTorch for Qualcomm AI Engine Direct and running a model on it.
 
@@ -0,0 +1,205 @@
+# Vulkan Backend
+
+The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
+built on top of the cross-platform Vulkan GPU API standard. It is primarily
+designed to leverage the GPU to accelerate model inference on Android devices,
+but can be used on any platform that supports an implementation of Vulkan:
+laptops, servers, and edge devices.
+
+::::{note}
+The Vulkan delegate is currently under active development, and its components
+are subject to change.
+::::
+
+## What is Vulkan?
+
+Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
+It is designed to offer developers more explicit control over GPUs compared to
+previous specifications in order to reduce overhead and maximize the
+capabilities of the modern graphics hardware.
+
+Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
+desktop and mobile) in the market support Vulkan. Vulkan is also included in
+Android from Android 7.0 onwards.
+
+**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
+provides a way to execute compute and graphics operations on a GPU, but does not
+come with a built-in library of performant compute kernels.
+
+## The Vulkan Compute Library
+
+The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
+the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
+provide GPU implementations for PyTorch operators via GLSL compute shaders.
+
+The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
+The core components of the PyTorch Vulkan backend were forked into ExecuTorch
+and adapted for an AOT graph-mode style of model inference (as opposed to
+PyTorch which adopted an eager execution style of model inference).
+
+The components of the Vulkan Compute Library are contained in the
+`executorch/backends/vulkan/runtime/` directory. The core components are listed
+and described below:
+
+```
+runtime/
+├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
+└── graph/ .................. ComputeGraph class which implements graph mode inference
+    └── ops/ ................ Base directory for operator implementations
+        ├── glsl/ ........... GLSL compute shaders
+        │   ├── *.glsl
+        │   └── conv2d.glsl
+        └── impl/ ........... C++ code to dispatch GPU compute shaders
+            ├── *.cpp
+            └── Conv2d.cpp
+```
+
+## Features
+
+The Vulkan delegate currently supports the following features:
+
+* **Memory Planning**
+  * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
+* **Capability Based Partitioning**:
+  * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
+* **Support for upper-bound dynamic shapes**:
+  * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
+
+In addition to increasing operator coverage, the following features are
+currently in development:
+
+* **Quantization Support**
+  * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
+* **Memory Layout Management**
+  * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
+* **Selective Build**
+  * We plan to make it possible to control build size by selecting which operators/shaders you want to build with
+
+## End to End Example
+
+To further understand the features of the Vulkan Delegate and how to use it,
+consider the following end to end example with a simple single operator model.
+
+### Compile and lower a model to the Vulkan Delegate
+
+Assuming ExecuTorch has been set up and installed, the following script can be
+used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
+
+Once ExecuTorch has been set up and installed, the following script can be used
+to generate a simple model and lower it to the Vulkan delegate.
+
+```
+# Note: this script is the same as the script from the "Setting up ExecuTorch"
+# page, with one minor addition to lower to the Vulkan backend.
+import torch
+from torch.export import export
+from executorch.exir import to_edge
+
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+
+# Start with a PyTorch model that adds two input tensors (matrices)
+class Add(torch.nn.Module):
+  def __init__(self):
+    super(Add, self).__init__()
+
+  def forward(self, x: torch.Tensor, y: torch.Tensor):
+      return x + y
+
+# 1. torch.export: Defines the program with the ATen operator set.
+aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))
+
+# 2. to_edge: Make optimizations for Edge devices
+edge_program = to_edge(aten_dialect)
+# 2.1 Lower to the Vulkan backend
+edge_program = edge_program.to_backend(VulkanPartitioner())
+
+# 3. to_executorch: Convert the graph to an ExecuTorch program
+executorch_program = edge_program.to_executorch()
+
+# 4. Save the compiled .pte program
+with open("vk_add.pte", "wb") as file:
+    file.write(executorch_program.buffer)
+```
+
+Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
+using the `to_backend()` API. The Vulkan Delegate implements the
+`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
+that are supported by the Vulkan delegate, and separates compatible sections of
+the model to be executed on the GPU.
+
+This means the a model can be lowered to the Vulkan delegate even if it contains
+some unsupported operators. This will just mean that only parts of the graph
+will be executed on the GPU.
+
+
+::::{note}
+The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/supported_ops.py)
+Vulkan partitioner code can be inspected to examine which ops are currently
+implemented in the Vulkan delegate.
+::::
+
+### Build Vulkan Delegate libraries
+
+The easiest way to build and test the Vulkan Delegate is to build for Android
+and test on a local Android device. Android devices have built in support for
+Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
+compile the Vulkan Compute Library's GLSL compute shaders.
+
+The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
+when building with CMake.
+
+First, make sure that you have the Android NDK installed; any NDK version past
+NDK r19c should work. Note that the examples in this doc have been validated with
+NDK r27b. The Android SDK should also be installed so that you have access to `adb`.
+
+The instructions in this page assumes that the following environment variables
+are set.
+
+```shell
+export ANDROID_NDK=<path_to_ndk>
+# Select the appropriate Android ABI for your device
+export ANDROID_ABI=arm64-v8a
+# All subsequent commands should be performed from ExecuTorch repo root
+cd <path_to_executorch_root>
+# Make sure adb works
+adb --version
+```
+
+To build and install ExecuTorch libraries (for Android) with the Vulkan
+Delegate:
+
+```shell
+# From executorch root directory
+(rm -rf cmake-android-out && \
+  pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_ABI=$ANDROID_ABI \
+    -DEXECUTORCH_BUILD_VULKAN=ON \
+    -DPYTHON_EXECUTABLE=python \
+    -Bcmake-android-out && \
+  cmake --build cmake-android-out -j16 --target install)
+```
+
+### Run the Vulkan model on device
+
+::::{note}
+Since operator support is currently limited, only binary arithmetic operators
+will run on the GPU. Expect inference to be slow as the majority of operators
+are being executed via Portable operators.
+::::
+
+Now, the partially delegated model can be executed (partially) on your device's
+GPU!
+
+```shell
+# Build a model runner binary linked with the Vulkan delegate libs
+cmake --build cmake-android-out --target vulkan_executor_runner -j32
+
+# Push model to device
+adb push vk_add.pte /data/local/tmp/vk_add.pte
+# Push binary to device
+adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin
+
+# Run the model
+adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
+```
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Building and Running ExecuTorch on Xtensa HiFi4 DSP`
	`1`	`+# Cadence Xtensa Backend`
`2`	`2`
`3`	`3`
`4`	`4`	`In this tutorial we will walk you through the process of getting setup to build ExecuTorch for an Xtensa HiFi4 DSP and running a simple model on it.`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Building and Running ExecuTorch with Core ML Backend`
	`1`	`+# Core ML Backend`
`2`	`2`
`3`	`3`	`Core ML delegate uses Core ML APIs to enable running neural networks via Apple's hardware acceleration. For more about Core ML you can read [here](https://developer.apple.com/documentation/coreml). In this tutorial, we will walk through the steps of lowering a PyTorch model to Core ML delegate`
`4`	`4`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Building and Running ExecuTorch with MediaTek Backend`
	`1`	`+# MediaTek Backend`
`2`	`2`
`3`	`3`	`MediaTek backend empowers ExecuTorch to speed up PyTorch models on edge devices that equips with MediaTek Neuron Processing Unit (NPU). This document offers a step-by-step guide to set up the build environment for the MediaTek ExecuTorch libraries.`
`4`	`4`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Building and Running ExecuTorch with Qualcomm AI Engine Direct Backend`
	`1`	`+# Qualcomm AI Engine Backend`
`2`	`2`
`3`	`3`	`In this tutorial we will walk you through the process of getting started to`
`4`	`4`	`build ExecuTorch for Qualcomm AI Engine Direct and running a model on it.`