Skip to content

Kernel Library Overview #629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 45 additions & 2 deletions docs/source/ir-ops-set-definition.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,46 @@
# OpSet definition
# Definition of the Core ATen Operator Set

TBA
This page provides the description and background of the Core ATen Operator Set (opset). This page is recommended reading for those developing a new kernel library or delegate for ExecuTorch. It is also recommended that one is familiar with [`torch.export`](https://pytorch.org/docs/main/export.html) as a prerequisite; in particular, the concepts of torch FX graphs, operator decomposition, and functionalization.

The list of operators that have been identified as a Core ATen operator can be found on the [IRs page of the PyTorch documentation website](https://pytorch.org/docs/main/torch.compiler_ir.html).

## What is an Operator Set?

`torch.export` performs a full graph capture on a given PyTorch program, producing a graph IR that describes the computation performed by the program. An operator (i.e. an operation performed on a Tensor) is the basic unit of computation in the graph, often corresponding to a unique node in the graph IR. The primary source of operators is the [ATen library](https://pytorch.org/cppdocs/#aten); outside of ATen operators, developers can also define their own operators (i.e. custom operators).

An “ATen operator set” or “ATen opset” is the set of ATen operators that can be used to represent a PyTorch program once it has been captured into a graph IR.

## The Functional ATen Operator Set

The program capture mechanism of `torch.export` produces a functionalized graph, which only allows functional operators (i.e. operators that do not mutate or alias inputs). Therefore, `torch.export` produces a graph that will contain the functional ATen opset, which contains only functional ATen operators.

## The Core ATen Operator Set

An exported graph can be further transformed by applying operator decompositions. This process will replace specified ATen operators with equivalent sequences of other ATen operators. For instance, `aten.hardsigmoid` can be replaced with `aten.clamp(aten.clamp(self + 3, min=0), max=6) / 6`.

If a PyTorch program is decomposed with the default decomposition settings, then the resulting graph IR will contain the “core ATen” opset. This opset will be a subset of the functional ATen opset, as some operators will be decomposed. ATen operators that are a part of the core ATen opset (i.e. core ATen operators) will not be decomposed under the default decomposition setting. Generally, core ATen operators cannot be easily re-expressed by other ATen operators through decomposition.

The key motivation behind the core ATen opset is to reduce the number of operators that need to be handled by PyTorch backends and compilers once a model is exported. Not only are there a great number of operators defined in the ATen library, but new operators may be added, or the schema of existing operators may change. Without operator decomposition, backends built on top of the IR produced by `torch.export` would have to deal with both a large operator surface, as well as an opset that is constantly in flux. The core ATen opset addresses this by defining a much smaller, more manageable set of operators that was developed with stability in mind.

## Development of the Core ATen Operator Set

Although Executorch uses the core ATen opset, it is not specific to ExecuTorch. One of the primary design goals of the core ATen opset is that it should be as generic as possible; the vast majority of use-cases will not want to decompose the operators contained within it. By extension, the decompositions implied by the core ATen opset should be useful to the vast majority of use-cases.

Another key consideration was to keep the opset as minimal as possible, but not at the expense of imposing decompositions that would have a profound negative impact on performance or developer experience.

The core ATen opset was developed by reviewing a list of ATen operators created by surveying models in public GitHub repositories in addition to well-known open source models. The purpose of the surveying process was to obtain a reduced list of ATen operators that is a proxy of which ATen operators are used the most. This way the most commonly used operators may be reviewed first.

The decision of whether each operator should be a core operator or be decomposed by the Core ATen Decomposition Table was determined by:

1. Examining potential decompositions of the operators; the decomposition should be a relatively straightforward re-expression of the operator using other ATen operators.
* The decomposition shouldn’t look like an outright implementation of the operator.
* The decomposition shouldn't vary based on run-time characteristics of the input.
* We also consider if decomposing the operator will impact the precision, numerical validity or memory layout of the output.
2. Thinking about whether developers would want to preserve the operator in the graph for performance or other reasons.
* For instance, perhaps an operator can be decomposed but it can map to a single hardware instruction on most platforms, in which case it would be preferable to promote it to a core operator.

## Future Work

Until every ATen operator has been reviewed and given a designation of “core” or “decomposed by default”, the core ATen opset cannot be considered fully complete. However, this is a monumental task, and there is a long tail of operators that are not often used. This is why an approach was taken where models were surveyed to determine which ops were the most commonly used which allowed “higher impact” operators to be prioritized.

Nonetheless, there are still many operators which have not been evaluated. The plan is to continue evaluating additional operators as the need arises; the PyTorch community may propose additional core operators or additional core decompositions through opening a GitHub issue or by [commenting on this post on the PyTorch Forums](https://dev-discuss.pytorch.org/t/defining-the-core-aten-opset/1464).
38 changes: 36 additions & 2 deletions docs/source/kernel-library-overview.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
# Kernel Library Overview
This page provides a description of the Portable Kernel Library and the Optimized Kernel Library, which are the default kernel libraries shipped with ExecuTorch. It is recommended reading for those who are interested in executing ExecuTorch programs with these kernel libraries, or for those who want to implement their own kernels and kernel libraries.

TBA
# Overview of ExecuTorch’s Kernel Libraries

An ExecuTorch program encodes instructions that describe the computation that should be performed by the program. Many of these instructions will correspond to calling a specific ATen operator, for example aten.convolution. However, one of the core design principles of ExecuTorch is that the signature of an operator should be separate from the implementation of the operator. This means that the ExecuTorch runtime does not ship with any standard implementation for ATen operators; users must make sure to link against kernel libraries that contain implementations of the operators required by their ExecuTorch program, and configure [operator registration](https://github.com/pytorch/executorch/blob/main/docs/website/docs/tutorials/aten_ops_and_aten_mode.md) to map an operator signature to the desired implementation. This makes it easy to adjust the implementation of operators such as `aten.convolution` that will be called when executing an ExecuTorch program; it allows users to select the exact operator implementations that will meet the unique performance, memory usage, battery usage, etc. constraints of their use-case.

**In essence, a kernel library is simply a collection of ATen operator implementations that follow a common theme or design principle**. Note that due to ExecuTorch’s selective build process (discussed in the following section), operator implementations are linked individually. This means that users can easily mix different kernel libraries in their build without sacrificing build size.

Executorch ships with two kernel libraries by default: the **Portable Kernel Library** and the **Optimized Kernel Library**, both of which provide CPU operator implementations.

## Portable Kernel Library

The Portable Kernel Library is in a sense the “reference” kernel library that is used by ExecuTorch. The Portable Kernel Library was developed with the following goals in mind:

* Correctness
* Provide straightforward implementations of ATen operators that are strictly consistent with the original implementation of the operator in PyTorch’s ATen library
* Readability / Simplicity
* Provide clear, readable source code so that those who want to develop custom implementations of an operator can easily understand the desired behavior of the operator.
* Portability
* Portable Kernels should be just as portable as the ExecuTorch runtime; operator implementations should not use any external dependencies, or use any unsanctioned features of C++.
* Operator Coverage
* As the “reference” kernel library for ExecuTorch, the Portable Kernel Library aims to have a high degree of operator coverage. The goal is for the Portable Kernel library to provide an implementation for every operator listed as a Core ATen operator. However, note that operator coverage for the Portable Kernel Library is still a work in progress.

The Portable Kernel Library primarily aims to provide easily accessible operator implementations that will “just work” on most platforms, and are guaranteed to provide correct output. Performance is a non-goal for the Portable Kernel Library. In fact, many bottleneck operators such as convolution and matrix multiplication are implemented in the most straightforward way possible in the interest of prioritizing simplicity and readability. Therefore, one should not expect to observe fast inference times if exclusively using the Portable Kernel library. However, outside of specific bottleneck operators, most operators are simple enough where the straightforward implementation of the Portable Kernel Library should still provide adequate performance. Binary size is also a non-goal for the Portable Kernel Library.

## Optimized Kernel Library

The Optimized Kernel Library is a supplemental kernel library shipped with ExecuTorch that, in contrast to the Portable Kernel Library, aims to provide performance focused implementations of operators at the cost of portability and readability. Many operator implementations in the Optimized Kernel Library are inspired or based off of the corresponding implementation in PyTorch’s ATen library, so in many cases one can expect the same degree of performance.

Generally speaking, operators in the Optimized Kernel Library are optimized in one of two ways:

1. Using CPU vector intrinsics
2. Using optimized math libraries, such as `sleef` and OpenBLAS

Although portability is not a design goal of the Optimized Kernel Library, implementations are not meant to be fine-tuned for a specific CPU architecture. Instead, the Optimized Kernel library seeks to provide performant implementations that can be applied across a variety of platforms, rather than using optimizations that are specific to a single platform.

Another important note is that operator coverage is also a non-goal for the Optimized Kernel Library. There are no plans to add optimized kernels for every Core ATen operator; rather, optimized kernels are added on an as-needed basis to improve performance on specific models. Thus, the operator coverage in the Optimized Kernel Library will be much more limited compared to the Portable Kernel Library.