update doc with partitioner and quantizer

yifan_shen3 · yifan_shen3 · commit 70e6a9907203 · 2024-04-20T14:52:02.000-07:00
diff --git a/backends/apple/coreml/README.md b/backends/apple/coreml/README.md
@@ -6,6 +6,8 @@ Core ML is an optimized framework for running machine learning models on Apple d
 
 ## Layout
 - `compiler/` : Lowers a module to Core ML backend.
+- `partition/`: Partitions a module fully or partially to Core ML backend.
+- `quantizer/`: Quantizes a module in Core ML favored scheme
 - `scripts/` : Scripts for installing dependencies and running tests.
 - `runtime/`: Core ML delegate runtime implementation.
     - `inmemoryfs`: InMemory filesystem implementation used to serialize/de-serialize AOT blob.
@@ -20,41 +22,104 @@ Core ML is an optimized framework for running machine learning models on Apple d
 If you have problems or questions or have suggestions for ways to make
 implementation and testing better, please create an issue on [github](https://www.github.com/pytorch/executorch/issues).
 
-## Delegation
+## Partition and Delegation
 
-For delegating the Program to the **Core ML** backend, the client must be responsible for calling `to_backend` with the **CoreMLBackend** tag.
+To delegate a Program to the **Core ML** backend, the client must call `to_backend` with the **CoreMLPartitioner**.
 
 ```python
-import executorch.exir as exir
 import torch
-
-from torch.export import export
-
-from executorch.exir import to_edge
-
-from executorch.exir.backend.backend_api import to_backend
+import executorch.exir
 
 from executorch.backends.apple.coreml.compiler import CoreMLBackend
+from executorch.backends.apple.coreml.partition.coreml_partitioner import CoreMLPartitioner
 
-class LowerableSubModel(torch.nn.Module):
+class Model(torch.nn.Module):
     def __init__(self):
         super().__init__()
 
     def forward(self, x):
         return torch.sin(x)
 
-# Convert the lowerable module to Edge IR Representation
-to_be_lowered = LowerableSubModel()
-example_input = (torch.ones(1), )
-to_be_lowered_exir_submodule = to_edge(export(to_be_lowered, example_input))
+source_model = Model()
+example_inputs = (torch.ones(1), )
+
+# Export the source model to Edge IR representation
+aten_program = torch.export.export(source_model, example_inputs)
+edge_program_manager = executorch.exir.to_edge(aten_program)
 
-# Lower to Core ML backend
-lowered_module = to_backend('CoreMLBackend', to_be_lowered_exir_submodule.exported_program, [])
+# Delegate to Core ML backend
+delegated_program_manager = edge_program_manager.to_backend(CoreMLPartitioner())
+
+# Serialize delegated program
+executorch_program = delegated_program_manager.to_executorch()
+with open("model.pte", "wb") as f:
+    f.write(executorch_program.buffer)
 ```
 
-Currently, the **Core ML** backend delegates the whole module to **Core ML**. If a specific op is not supported by the **Core ML** backend then the `to_backend` call would throw an exception. We will be adding a **Core ML Partitioner** to resolve the issue.
+The module will be fully or partially delegated to **Core ML**, depending on whether all or part of ops are supported by the **Core ML** backend. User may force skip certain ops by `CoreMLPartitioner(skip_ops_for_coreml_delegation=...)`
+
+The `to_backend` implementation is a thin wrapper over [coremltools](https://apple.github.io/coremltools/docs-guides/), `coremltools` is responsible for converting an **ExportedProgram** to a **MLModel**. The converted **MLModel** data is saved, flattened, and returned as bytes to **ExecuTorch**.
+
+## Quantization
+
+To quantize a Program in a Core ML favored way, the client may utilize **CoreMLQuantizer**.
+
+```python
+import torch
+import executorch.exir
+
+from torch._export import capture_pre_autograd_graph
+from torch.ao.quantization.quantize_pt2e import (
+    convert_pt2e,
+    prepare_pt2e,
+    prepare_qat_pt2e,
+)
+
+from executorch.backends.apple.coreml.quantizer.coreml_quantizer import CoreMLQuantizer
+from coremltools.optimize.torch.quantization.quantization_config import (
+    LinearQuantizerConfig,
+    QuantizationScheme,
+)
+
+class Model(torch.nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.conv = torch.nn.Conv2d(
+            in_channels=3, out_channels=16, kernel_size=3, padding=1
+        )
+        self.relu = torch.nn.ReLU()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        a = self.conv(x)
+        return self.relu(a)
+
+source_model = Model()
+example_inputs = (torch.randn((1, 3, 256, 256)), )
+
+pre_autograd_aten_dialect = capture_pre_autograd_graph(model, example_inputs)
+
+quantization_config = LinearQuantizerConfig.from_dict(
+    {
+        "global_config": {
+            "quantization_scheme": QuantizationScheme.symmetric,
+            "milestones": [0, 0, 10, 10],
+            "activation_dtype": torch.uint8,
+            "weight_dtype": torch.int8,
+            "weight_per_channel": True,
+        }
+    }
+)
+quantizer = CoreMLQuantizer(quantization_config)
+
+# For post-training quantization, use `prepare_pt2e`
+# For quantization-aware trainin,g use `prepare_qat_pt2e`
+prepared_graph = prepare_pt2e(pre_autograd_aten_dialect, quantizer)
+
+prepared_graph(*example_inputs)
+converted_graph = convert_pt2e(prepared_graph)
+```
 
-The `to_backend` implementation is a thin wrapper over `coremltools`, `coremltools` is responsible for converting an **ExportedProgram** to a **MLModel**. The converted **MLModel** data is saved, flattened, and returned as bytes to **ExecuTorch**.
+The `converted_graph` is the quantized torch model, and can be delegated to **Core ML** similarly through **CoreMLPartitioner**
 
 ## Runtime
 
diff --git a/backends/apple/coreml/setup.md b/backends/apple/coreml/setup.md
@@ -29,8 +29,8 @@ python3 -m examples.apple.coreml.scripts.export --model_name add
 4. You can now integrate the **Core ML** backend in code.
 
 ```python
-# Lower to Core ML backend
-lowered_module = to_backend('CoreMLBackend', to_be_lowered_exir_submodule, [])
+# Delegate to Core ML backend
+delegated_program_manager = edge_program_manager.to_backend(CoreMLPartitioner())
 ```