[NeoML] Remove excess CUDA syncs in layers #1070

favorart · 2024-05-23T20:16:59Z

Please, merge before:

The idea behind eliminating unnecessary synchronizations for CUDA is that scalar constants can be passed to GPU computation kernels from host memory by value.

It would be possible to replace arguments that imply scalar constants in math engine methods with float or int types. But then, if such operations would use the result of a previous operation (for example, you often need to multiply by the result of a scalar product), you would have to add additional synchronization to obtain the result from the device memory to the host memory.

To exclude both synchronization options, the wrapper class CScalarPararmeter<T> is used, which contains both a scalar constant and a handler-pointer to the device’s memory as fields. The value of a scalar parameter can lie in the only its field (but not both at the same time), depending on which constructor was called initially.
CScalarPararmeter<T> is instantiated by two types: float and int.

All constructors of the CScalarPararmeter<T> wrapper are implicit. Therefore, it can easily be constructed itself, both from the value of a scalar constant in the host memory, and from a handler-pointer to the device’s memory. This design will allow you to avoid compilation errors while merging to the OCRT.

To eliminate unnecessary synchronizations for CUDA and in the OCRT module, you will have to manually transfer scalar constants that were previously constructed using any CTypedHandleStackVar<T> directly to the method of the mathematical engine in all places where scalar constants are used, which is more natural and readable.

Signed-off-by: Kirill Golikov <[email protected]>

favorart force-pushed the golikovCudaSynks branch from 348b510 to 5a0a420 Compare May 27, 2024 18:24

favorart force-pushed the golikovCudaSynks branch from 5a0a420 to b4a4d87 Compare June 6, 2024 18:32

favorart force-pushed the golikovCudaSynks branch 5 times, most recently from 10a5d73 to 45e7272 Compare July 1, 2024 21:09

favorart requested a review from AndrewAndrianov July 1, 2024 21:11

favorart force-pushed the golikovCudaSynks branch from 45e7272 to 29e6080 Compare July 1, 2024 21:13

favorart mentioned this pull request Jul 3, 2024

[NeoML] Optimize CUDA syncs in CDnnSolver #1047

Closed

favorart force-pushed the golikovCudaSynks branch 7 times, most recently from 0243fef to d6a59b3 Compare July 5, 2024 15:10

favorart marked this pull request as ready for review July 5, 2024 15:12

favorart force-pushed the golikovCudaSynks branch 2 times, most recently from c034568 to a5b3b57 Compare July 30, 2024 21:03

favorart force-pushed the golikovCudaSynks branch 4 times, most recently from 9003d6c to 8f9b250 Compare August 15, 2024 08:47

favorart force-pushed the golikovCudaSynks branch from 8f9b250 to e0e255b Compare August 15, 2024 11:19

favorart added the performance Changes of performance improvements only label Aug 16, 2024

favorart force-pushed the golikovCudaSynks branch from e0e255b to 8eb3272 Compare August 29, 2024 07:53

favorart added a commit to favorart/neoml that referenced this pull request Aug 29, 2024

[NeoML] Remove excess CUDA syncs in layers (neoml-lib#1070)

264eb42

Signed-off-by: Kirill Golikov <[email protected]>

favorart force-pushed the golikovCudaSynks branch from 8eb3272 to fc60504 Compare August 30, 2024 18:54

favorart added a commit to favorart/neoml that referenced this pull request Sep 2, 2024

[NeoML] Remove excess CUDA syncs in layers (neoml-lib#1070)

5e9b01e

Signed-off-by: Kirill Golikov <[email protected]>

favorart force-pushed the golikovCudaSynks branch 2 times, most recently from a095afb to ff8788f Compare September 12, 2024 21:40

favorart added 12 commits September 13, 2024 12:31

[NeoML] Layers mem-optimize

6759eb9

Signed-off-by: Kirill Golikov <[email protected]>

[NeoMathEngine] Vector operations with float and int arguments

1a3c3d8

Signed-off-by: Kirill Golikov <[email protected]>

[VulkanMathEngine] const CMemoryHandle arrays

b673324

Signed-off-by: Kirill Golikov <[email protected]>

[VulkanMathEngine] Unite CFloatHandleStackVar

65043aa

Signed-off-by: Kirill Golikov <[email protected]>

[VulkanMathEngine] Get handles for stack vars

7558ae0

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: RowwiseCh, MobileNetV2, MobileNetV3

7eade33

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: CPrecisionRecallLayer

6ed2903

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: FocalLossLayer

29de7f9

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: BinaryFocalLossLayer

cf969a5

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: CrossEntropyLossLayer

3e3b26f

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: BinaryCrossEntropyLayer

4383822

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: CenterLossLayer

40f5d9b

Signed-off-by: Kirill Golikov <[email protected]>

favorart force-pushed the golikovCudaSynks branch from ff8788f to 41e9ca4 Compare September 13, 2024 10:56

favorart added 14 commits September 13, 2024 15:33

[NeoML] remove excess CUDA syncs: CCtcLossLayer

bbe319e

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: CLossLayer

bb8c1cd

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: AutoDiffFunctions

62aba8c

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: LoraFullyConnectedLayer

fbf13c4

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: MultichannelLookupLayer

a87f094

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: ActivationLayers

3d02c0f

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: BatchNormalizationLayer

dbce51e

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] Express old vector operations with operations of new arguments

5811de4

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: other layers

8b5596b

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] remove excess CUDA syncs: DnnSolver

1c9828a

Signed-off-by: Kirill Golikov <[email protected]>

[NeoML] CUDA sync in DnnSolver::clipGradients

bb78241

Signed-off-by: Kirill Golikov <[email protected]>

[NeoMathEngine] CPU arm64 fix compilation

847fbc7

Signed-off-by: Kirill Golikov <[email protected]>

[CudaMathEngine] CUBLAS_POINTER_MODE_DEVICE allows device pointers only

06e4d2e

Signed-off-by: Kirill Golikov <[email protected]>

[MetalMathEngine] Add CScalarParameter

c2fd9cc

Signed-off-by: Kirill Golikov <[email protected]>

favorart force-pushed the golikovCudaSynks branch from 41e9ca4 to c2fd9cc Compare September 13, 2024 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NeoML] Remove excess CUDA syncs in layers #1070

[NeoML] Remove excess CUDA syncs in layers #1070

Uh oh!

favorart commented May 23, 2024 •

edited

Loading

Uh oh!

Uh oh!

[NeoML] Remove excess CUDA syncs in layers #1070

Are you sure you want to change the base?

[NeoML] Remove excess CUDA syncs in layers #1070

Uh oh!

Conversation

favorart commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

favorart commented May 23, 2024 •

edited

Loading