You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
The PR adds a direct memcpy fast-path for portable copy and copy_ ops. This speeds up copy significantly in cases where no broadcasting is needed.
This is most noticable when copying buffer mutations back, such as transformer KV cache when managing the cache as a mutable buffer. Prior to this change, an encoder/decoder model was taking ~25% of the total runtime copying KV cache back after permuting. With this change, the copy becomes significantly cheaper.
I benchmarked a simple model on S23 and Pixel 5:
```
class TestModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.register_buffer("buffer", torch.zeros((2, 10, 1024, 1024)))
def forward(self, x):
self.buffer.add_(x)
return self.buffer
model = TestModel()
inputs = (torch.randn(2, 10, 1024, 1024),)
lowered = to_edge_transform_and_lower(
torch.export.export(model, inputs),
partitioner=[XnnpackPartitioner()],
).to_executorch()
```
S23, average of 50 runs, time in copy_:
4.1ms vs 22.3ms
Pixel 5, average of 50 runs, time in copy_:
12.1ms vs 66.6ms
This is approximately a ~5.5x speedup of the copy operator.
Differential Revision: D73656456
0 commit comments