[ET-VK][LlaMa] Split SDPA + KV cache operator into SDPA operator and KV cache update operator + Add RemoveAsserts
pass and apply it during LlaMa export
#8074
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Note: This diff is a combination of D68919676 (#8068) and D68919678 (no pull request). I decided to combine the two because of problems with
ghexport
, which was having some problems exporting the second diff, as well as the fact that both diffs are needed forexport_llama
to work so it makes more sense to just have a single diff.Context
#7413 and #7412 split the
sdpa_with_kv_cache
operator into two separate operators,update_cache
andcustom_sdpa
to decouple the cache update step from the actual SDPA computation.As a result, SDPA is no longer being delegated on Vulkan because of this interface change. To rectify this, Vulkan must also split
sdpa_with_kv_cache
into two operators.Note that during this diff the new operators are not partitioned yet because of complications caused by assertion ops in the graph. The next diff adds a pass to remove such assertion ops which allows the new operators to be partitioned.
Context
Recently, some assertion ops were added to the Llama source code.
Unfortunately, this causes issues for the Vulkan delegate because runtime assertions are not yet supported in Vulkan and the assertion ops cause graph breaks due to not being supported.
To prevent graph breaks when delegating to Vulkan, apply a pass to remove assertion ops during the llama export.
Differential Revision: D68922404