[llm] Use new API to register custom ops for llama model #2840

larryliu0820 · 2024-04-03T21:48:44Z

Summary: Using the following 2 APIs:

EXECUTORCH_LIBRARY to replace the need of a yaml file. With this macro we can directly register a custom kernel into ExecuTorch runtime.
WRAP_TO_ATEN allows custom op authors to use the same kernel for ExecuTorch and PyTorch. This can be helpful during debugging.

Test Plan: Rely on the new CI job test_llama with xnnpack+kv+custom option.

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-04-03T21:48:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2840

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit d1612c8 with merge base 081c849 ():

NEW FAILURES - The following jobs have failed:

trunk / test-llama-runner-mac (fp32, buck2, xnnpack+kv+custom) / macos-job (gh)
AttributeError: '_OpNamespace' 'llama' object has no attribute 'sdpa_with_kv_cache'
trunk / test-llama-runner-mac (fp32, cmake, xnnpack+kv+custom) / macos-job (gh)
AttributeError: '_OpNamespace' 'llama' object has no attribute 'sdpa_with_kv_cache'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-04-03T21:49:05Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kimishpatel · 2024-04-03T21:53:20Z

backends/xnnpack/CMakeLists.txt

@@ -72,6 +72,7 @@ target_include_directories(
  xnnpack_schema INTERFACE ${_xnnpack_schema__include_dir}
                           ${EXECUTORCH_ROOT}/third-party/flatbuffers/include)

+target_compile_options(pthreadpool PUBLIC ${_common_compile_options})


Why is this needed? and pthreadpool related stuff has moved to root CMakeLists

my local build failed complaining no -fPIC

kimishpatel · 2024-04-03T21:53:52Z

examples/models/llama2/CMakeLists.txt

 if(ANDROID)
  list(APPEND link_libraries log)
 endif()

 target_compile_options(llama_main PUBLIC ${_common_compile_options}
-  -DET_USE_THREADPOOL)
+                                         -DET_USE_THREADPOOL)


Reformat cmakelists.txt

kimishpatel · 2024-04-03T21:55:03Z

examples/models/llama2/custom_ops/op_sdpa.cpp

@@ -22,6 +21,7 @@
 #include <executorch/backends/xnnpack/threadpool/threadpool.h>
 #include <executorch/extension/parallel/thread_parallel.h>
 #endif
+#include <executorch/extension/kernel_util/make_boxed_from_unboxed_functor.h>


We should rename this header to something else

kimishpatel · 2024-04-03T21:57:39Z

examples/models/llama2/custom_ops/op_sdpa_aot.cpp

+    // @lint-ignore CLANGTIDY facebook-hte-ParameterMightThrowOnCopy
+    const c10::optional<double> scale) {
+  auto output = at::empty_like(q_projected);
+  WRAP_TO_ATEN(sdpa_with_kv_cache_out_no_context, 11)


whats this 11?

The 11th argument is out

number of args? I think there is some template magic that allows you to count number of args, right?

yeah this is telling the template we need to return the 11th object, since it is out. See this code: https://github.com/pytorch/executorch/blob/main/extension/aten_util/make_aten_functor_from_et_functor.h#L268-L277

kimishpatel · 2024-04-03T21:58:40Z

examples/models/llama2/custom_ops/op_sdpa_aot.cpp

+      "Tensor(b!) value_cache, SymInt start_pos, SymInt seq_len, Tensor? attn_mask=None, "
+      "float drpout_p=0.0, bool is_causal=False, float? scale=None) -> Tensor",
+      &torch::executor::native::sdpa_with_kv_cache_aten);
+  m.def(


WHy is this one needed? So that we can generate out variant one?

We need both sdpa_with_kv_cache and sdpa_with_kv_cache.out in ATen so that exir is happy.

kimishpatel · 2024-04-03T21:59:09Z

examples/models/llama2/custom_ops/op_sdpa_with_kv_cache_test.cpp

@@ -32,7 +32,7 @@ exec_aten::Tensor op_sdpa_with_kv_cache(
    exec_aten::optional<double> scale,
    exec_aten::Tensor& out) {
  exec_aten::RuntimeContext context{};
-  return torch::executor::llama::sdpa_with_kv_cache_outf(
+  return torch::executor::native::sdpa_with_kv_cache_out(


why changes the namespace?

llama namespace was generated by FunctionHeaderWrapper.h. Here I added a header op_sdpa.h and that is using the same namespace as op_sdpa.cpp.

kimishpatel · 2024-04-03T22:00:08Z

examples/models/llama2/custom_ops/sdpa_with_kv_cache.py

+    # assuming we only hit this in OSS, find the default install path
+    prefix = os.environ.get("CMAKE_INSTALL_PREFIX", "../../../../cmake-out")
+    lib_path = os.path.join(prefix, "lib/libcustom_ops_aot_lib.so")
+    torch.ops.load_library(lib_path)
+    op = torch.ops.llama.sdpa_with_kv_cache.default
+    assert op is not None


This is a little bit clunky but I dont know how to improv e it

kimishpatel · 2024-04-03T22:00:27Z

examples/models/llama2/custom_ops/targets.bzl

    runtime.cxx_library(
-        name = "sdpa",
+        name = "custom_ops",


why rename this?

A lot of places are referring to examples/models/llama2/custom_ops:custom_ops library name. I'm just too lazy to change all of them to sdpa.

kimishpatel

Left some comments

facebook-github-bot · 2024-04-05T06:04:27Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-04-05T07:05:56Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-04-05T20:48:11Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-04-05T21:09:58Z

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Using the following 2 APIs: * `EXECUTORCH_LIBRARY` to replace the need of a yaml file. With this macro we can directly register a custom kernel into ExecuTorch runtime. * `WRAP_TO_ATEN` allows custom op authors to use the same kernel for ExecuTorch and PyTorch. This can be helpful during debugging. Test Plan: Rely on the new CI job `test_llama` with `xnnpack+kv+custom` option. Reviewed By: kimishpatel Differential Revision: D55713944 Pulled By: larryliu0820

facebook-github-bot · 2024-04-05T22:01:11Z