-
Notifications
You must be signed in to change notification settings - Fork 607
Add direct copy fast path for portable copy op #10487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add direct copy fast path for portable copy op #10487
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10487
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 60bb4ef with merge base 12079fe ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D73656456 |
99c6237
to
552eb1e
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
552eb1e
to
fb4fe1d
Compare
Summary: The PR adds a direct memcpy fast-path for portable copy and copy_ ops. This speeds up copy significantly in cases where no broadcasting is needed. This is most noticable when copying buffer mutations back, such as transformer KV cache when managing the cache as a mutable buffer. Prior to this change, an encoder/decoder model was taking ~25% of the total runtime copying KV cache back after permuting. With this change, the copy becomes significantly cheaper. I benchmarked a simple model on S23 and Pixel 5: ``` class TestModel(torch.nn.Module): def __init__(self): super().__init__() self.register_buffer("buffer", torch.zeros((2, 10, 1024, 1024))) def forward(self, x): self.buffer.add_(x) return self.buffer model = TestModel() inputs = (torch.randn(2, 10, 1024, 1024),) lowered = to_edge_transform_and_lower( torch.export.export(model, inputs), partitioner=[XnnpackPartitioner()], ).to_executorch() ``` S23, average of 50 runs, time in copy_: 4.1ms vs 22.3ms Pixel 5, average of 50 runs, time in copy_: 12.1ms vs 66.6ms This is approximately a ~5.5x speedup of the copy operator. Reviewed By: swolchok Differential Revision: D73656456 Pulled By: GregoryComer
This pull request was exported from Phabricator. Differential Revision: D73656456 |
Summary: The PR adds a direct memcpy fast-path for portable copy and copy_ ops. This speeds up copy significantly in cases where no broadcasting is needed. This is most noticable when copying buffer mutations back, such as transformer KV cache when managing the cache as a mutable buffer. Prior to this change, an encoder/decoder model was taking ~25% of the total runtime copying KV cache back after permuting. With this change, the copy becomes significantly cheaper. I benchmarked a simple model on S23 and Pixel 5: ``` class TestModel(torch.nn.Module): def __init__(self): super().__init__() self.register_buffer("buffer", torch.zeros((2, 10, 1024, 1024))) def forward(self, x): self.buffer.add_(x) return self.buffer model = TestModel() inputs = (torch.randn(2, 10, 1024, 1024),) lowered = to_edge_transform_and_lower( torch.export.export(model, inputs), partitioner=[XnnpackPartitioner()], ).to_executorch() ``` S23, average of 50 runs, time in copy_: 4.1ms vs 22.3ms Pixel 5, average of 50 runs, time in copy_: 12.1ms vs 66.6ms This is approximately a ~5.5x speedup of the copy operator. Reviewed By: swolchok Differential Revision: D73656456 Pulled By: GregoryComer
fb4fe1d
to
60bb4ef
Compare
This pull request was exported from Phabricator. Differential Revision: D73656456 |
Overriding lint failure for land - broken trunk. |
Summary:
The PR adds a direct memcpy fast-path for portable copy and copy_ ops. This speeds up copy significantly in cases where no broadcasting is needed. Note that the copy op always checks that dim order and dtype match, so this should be sound in all cases where the shape matches (no broadcasting).
This is most noticeable when copying buffer mutations back, such as transformer KV cache when managing the cache as a mutable buffer. Prior to this change, an encoder/decoder model was taking ~25% of the total runtime copying KV cache back after permuting. With this change, the copy becomes significantly cheaper.
I benchmarked a simple model on S23 and Pixel 5:
S23, average of 50 runs, time in copy_:
4.1ms vs 22.3ms
Pixel 5, average of 50 runs, time in copy_:
12.1ms vs 66.6ms
This is approximately a ~5.5x speedup of the copy operator.
Differential Revision: D73656456