Add direct copy fast path for portable copy op #10487

GregoryComer · 2025-04-25T18:32:11Z

Summary:
The PR adds a direct memcpy fast-path for portable copy and copy_ ops. This speeds up copy significantly in cases where no broadcasting is needed. Note that the copy op always checks that dim order and dtype match, so this should be sound in all cases where the shape matches (no broadcasting).

This is most noticeable when copying buffer mutations back, such as transformer KV cache when managing the cache as a mutable buffer. Prior to this change, an encoder/decoder model was taking ~25% of the total runtime copying KV cache back after permuting. With this change, the copy becomes significantly cheaper.

I benchmarked a simple model on S23 and Pixel 5:

class TestModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.register_buffer("buffer", torch.zeros((2, 10, 1024, 1024)))

    def forward(self, x):
        self.buffer.add_(x)
        return self.buffer


model = TestModel()
inputs = (torch.randn(2, 10, 1024, 1024),)

lowered = to_edge_transform_and_lower(
    torch.export.export(model, inputs),
    partitioner=[XnnpackPartitioner()],
).to_executorch()

S23, average of 50 runs, time in copy_:
4.1ms vs 22.3ms

Pixel 5, average of 50 runs, time in copy_:
12.1ms vs 66.6ms

This is approximately a ~5.5x speedup of the copy operator.

Differential Revision: D73656456

pytorch-bot · 2025-04-25T18:32:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10487

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 60bb4ef with merge base 12079fe ():

NEW FAILURE - The following job has failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for backends/cadence/hifi/operators/op_mm.cpp:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-04-25T18:32:24Z

This pull request was exported from Phabricator. Differential Revision: D73656456

facebook-github-bot · 2025-04-25T18:46:15Z

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kernels/portable/cpu/op_copy.cpp

Summary: The PR adds a direct memcpy fast-path for portable copy and copy_ ops. This speeds up copy significantly in cases where no broadcasting is needed. This is most noticable when copying buffer mutations back, such as transformer KV cache when managing the cache as a mutable buffer. Prior to this change, an encoder/decoder model was taking ~25% of the total runtime copying KV cache back after permuting. With this change, the copy becomes significantly cheaper. I benchmarked a simple model on S23 and Pixel 5: ``` class TestModel(torch.nn.Module): def __init__(self): super().__init__() self.register_buffer("buffer", torch.zeros((2, 10, 1024, 1024))) def forward(self, x): self.buffer.add_(x) return self.buffer model = TestModel() inputs = (torch.randn(2, 10, 1024, 1024),) lowered = to_edge_transform_and_lower( torch.export.export(model, inputs), partitioner=[XnnpackPartitioner()], ).to_executorch() ``` S23, average of 50 runs, time in copy_: 4.1ms vs 22.3ms Pixel 5, average of 50 runs, time in copy_: 12.1ms vs 66.6ms This is approximately a ~5.5x speedup of the copy operator. Reviewed By: swolchok Differential Revision: D73656456 Pulled By: GregoryComer

facebook-github-bot · 2025-04-26T10:11:07Z

This pull request was exported from Phabricator. Differential Revision: D73656456

Summary: The PR adds a direct memcpy fast-path for portable copy and copy_ ops. This speeds up copy significantly in cases where no broadcasting is needed. This is most noticable when copying buffer mutations back, such as transformer KV cache when managing the cache as a mutable buffer. Prior to this change, an encoder/decoder model was taking ~25% of the total runtime copying KV cache back after permuting. With this change, the copy becomes significantly cheaper. I benchmarked a simple model on S23 and Pixel 5: ``` class TestModel(torch.nn.Module): def __init__(self): super().__init__() self.register_buffer("buffer", torch.zeros((2, 10, 1024, 1024))) def forward(self, x): self.buffer.add_(x) return self.buffer model = TestModel() inputs = (torch.randn(2, 10, 1024, 1024),) lowered = to_edge_transform_and_lower( torch.export.export(model, inputs), partitioner=[XnnpackPartitioner()], ).to_executorch() ``` S23, average of 50 runs, time in copy_: 4.1ms vs 22.3ms Pixel 5, average of 50 runs, time in copy_: 12.1ms vs 66.6ms This is approximately a ~5.5x speedup of the copy operator. Reviewed By: swolchok Differential Revision: D73656456 Pulled By: GregoryComer

facebook-github-bot · 2025-04-26T11:06:21Z

This pull request was exported from Phabricator. Differential Revision: D73656456

GregoryComer · 2025-04-27T01:55:38Z

Overriding lint failure for land - broken trunk.

GregoryComer requested review from manuelcandales and swolchok as code owners April 25, 2025 18:32

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2025

facebook-github-bot added the fb-exported label Apr 25, 2025

GregoryComer added the release notes: ops & kernels Changes to the opset and any new / changed kernel implementations label Apr 25, 2025

GregoryComer force-pushed the export-D73656456 branch from 99c6237 to 552eb1e Compare April 25, 2025 18:44

swolchok approved these changes Apr 25, 2025

View reviewed changes

kernels/portable/cpu/op_copy.cpp Outdated Show resolved Hide resolved

kernels/portable/cpu/op_copy.cpp Outdated Show resolved Hide resolved

GregoryComer force-pushed the export-D73656456 branch from 552eb1e to fb4fe1d Compare April 26, 2025 10:10

GregoryComer force-pushed the export-D73656456 branch from fb4fe1d to 60bb4ef Compare April 26, 2025 11:06

facebook-github-bot merged commit 9ea9313 into pytorch:main Apr 27, 2025
84 of 86 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add direct copy fast path for portable copy op #10487

Add direct copy fast path for portable copy op #10487

Uh oh!

GregoryComer commented Apr 25, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 25, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 25, 2025

Uh oh!

facebook-github-bot commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Apr 26, 2025

Uh oh!

facebook-github-bot commented Apr 26, 2025

Uh oh!

GregoryComer commented Apr 27, 2025

Uh oh!

Uh oh!

Uh oh!

Add direct copy fast path for portable copy op #10487

Add direct copy fast path for portable copy op #10487

Uh oh!

Conversation

GregoryComer commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10487

❌ 1 New Failure

Uh oh!

facebook-github-bot commented Apr 25, 2025

Uh oh!

facebook-github-bot commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Apr 26, 2025

Uh oh!

facebook-github-bot commented Apr 26, 2025

Uh oh!

GregoryComer commented Apr 27, 2025

Uh oh!

Uh oh!

Uh oh!

GregoryComer commented Apr 25, 2025 •

edited

Loading

pytorch-bot bot commented Apr 25, 2025 •

edited

Loading