Skip to content

[NeoML] Remove excess CUDA syncs in layers #1070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

favorart
Copy link
Contributor

@favorart favorart commented May 23, 2024

Please, merge before:


The idea behind eliminating unnecessary synchronizations for CUDA is that scalar constants can be passed to GPU computation kernels from host memory by value.

It would be possible to replace arguments that imply scalar constants in math engine methods with float or int types. But then, if such operations would use the result of a previous operation (for example, you often need to multiply by the result of a scalar product), you would have to add additional synchronization to obtain the result from the device memory to the host memory.

To exclude both synchronization options, the wrapper class CScalarPararmeter<T> is used, which contains both a scalar constant and a handler-pointer to the device’s memory as fields. The value of a scalar parameter can lie in the only its field (but not both at the same time), depending on which constructor was called initially.
CScalarPararmeter<T> is instantiated by two types: float and int.

All constructors of the CScalarPararmeter<T> wrapper are implicit. Therefore, it can easily be constructed itself, both from the value of a scalar constant in the host memory, and from a handler-pointer to the device’s memory. This design will allow you to avoid compilation errors while merging to the OCRT.

To eliminate unnecessary synchronizations for CUDA and in the OCRT module, you will have to manually transfer scalar constants that were previously constructed using any CTypedHandleStackVar<T> directly to the method of the mathematical engine in all places where scalar constants are used, which is more natural and readable.

@favorart favorart force-pushed the golikovCudaSynks branch from 348b510 to 5a0a420 Compare May 27, 2024 18:24
@favorart favorart force-pushed the golikovCudaSynks branch from 5a0a420 to b4a4d87 Compare June 6, 2024 18:32
@favorart favorart force-pushed the golikovCudaSynks branch 5 times, most recently from 10a5d73 to 45e7272 Compare July 1, 2024 21:09
@favorart favorart requested a review from AndrewAndrianov July 1, 2024 21:11
@favorart favorart force-pushed the golikovCudaSynks branch from 45e7272 to 29e6080 Compare July 1, 2024 21:13
@favorart favorart force-pushed the golikovCudaSynks branch 7 times, most recently from 0243fef to d6a59b3 Compare July 5, 2024 15:10
@favorart favorart marked this pull request as ready for review July 5, 2024 15:12
@favorart favorart force-pushed the golikovCudaSynks branch 2 times, most recently from c034568 to a5b3b57 Compare July 30, 2024 21:03
@favorart favorart force-pushed the golikovCudaSynks branch 4 times, most recently from 9003d6c to 8f9b250 Compare August 15, 2024 08:47
@favorart favorart added the performance Changes of performance improvements only label Aug 16, 2024
favorart added a commit to favorart/neoml that referenced this pull request Aug 29, 2024
favorart added a commit to favorart/neoml that referenced this pull request Sep 2, 2024
@favorart favorart force-pushed the golikovCudaSynks branch 2 times, most recently from a095afb to ff8788f Compare September 12, 2024 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Changes of performance improvements only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants