Skip to content

Add blocked convolution #768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 79 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
89574a9
Fix current progress
Aug 26, 2022
f1e664b
Fix a couple more bugs
Aug 26, 2022
75bccb1
Organize tests in a better way
Aug 26, 2022
05ac4d9
Fix Trivial/PaddingHeight test
Aug 26, 2022
c52427d
HELLL YEAHHHH It passes all the given test!!!
Aug 26, 2022
5f68ebe
Add tests from yolox
Aug 26, 2022
93550c2
Add simple performance check
Aug 29, 2022
affcfe1
Use COMPUTE_BLOCK macro
Aug 29, 2022
96b189c
Move ClearBlock to macro
Aug 29, 2022
497175b
Move ProcessFilterCountN to macro
Aug 29, 2022
40601bd
Move PROCESS_OUTPUT_COUNT_N to macro
Aug 29, 2022
4bc7630
Optimize PackData and UnpackData functions
Aug 31, 2022
5d43103
Fix convDesc leak
Aug 31, 2022
6b019b2
Refactor measurements
Aug 31, 2022
ecf9efc
Add temp code
Sep 1, 2022
21eab6e
DEBUG (expand ISimdMathEngine interface)
Sep 24, 2022
f76fbf0
DEBUG add AvxTestDesktop
Sep 24, 2022
8d3cda2
Test commit
Sep 25, 2022
3c4b03d
Add blocked convolution with macro to NeoMathEngineAvx
Sep 27, 2022
6ae35b5
Fix compilation warning in NeoMathEngine test
Sep 27, 2022
aa50778
Add blocked convolution test for NeoMathEngineAvx
Sep 27, 2022
6af3f0d
EXPERIMENTAL: start moving to JIT
Oct 9, 2022
9aac8e6
Use single generator for all of the broadcast values
Oct 17, 2022
d20c990
Remove unused code
Oct 17, 2022
69e8d29
Pass arguments to JIT via struct
Oct 18, 2022
0082a87
Move kernelHeight and kernelWidth loops to JIT
Oct 19, 2022
2299071
Move CLEAR_BLOCK macro to JIT
Oct 19, 2022
76a7519
Move postProcessing to JIT
Oct 19, 2022
b730e7d
Use loops during JIT
Oct 21, 2022
f56f27c
Even less lines
Oct 21, 2022
f303189
Manually assign registers
Oct 22, 2022
0983e8f
Use RSP instead of Param1
Oct 23, 2022
e973333
Shorten the macro (a little bit)
Oct 23, 2022
c7cdfae
Wrap generator into C++ class (will be useful in future)
Oct 23, 2022
610d32a
Move filterCount and outputCount to generation parameters
Oct 23, 2022
e920418
Move PROCESS_FILTER_COUNT_N to JIT
Oct 23, 2022
d92b687
Move all we need to JIT!!!
Oct 23, 2022
f7b74e0
Merge branch 'master' into BlockedConvExperiments
Oct 24, 2022
7497091
Disable NeoAvxTestDesktop for 32-bit and for FineObjects
Oct 24, 2022
0456104
Disable BlockedConv tests for FineObjects
Oct 24, 2022
74e15b2
Use blocked conv when possible
Oct 24, 2022
874ecfc
DEBUG: temporary disable blocked conv
Oct 25, 2022
f5be1d8
Extend test set
Oct 25, 2022
f97f47f
Enable blocked convolution when heuristics are OK
Oct 25, 2022
fc7df2e
Add more statistics to BlockedConv Real tests
Oct 26, 2022
93a5ac6
Overflow protection
Oct 26, 2022
98dfc10
Reduce the number of ZEROUPPER calls
Oct 26, 2022
b5b0c3a
Better tuning
Oct 26, 2022
1726218
Reduce memory consumption
Oct 26, 2022
85c40b5
Add even more tests to NeoAvxTestDesktop
Oct 26, 2022
067e1c8
Start cleaning up this mess...
Oct 28, 2022
011332d
Use calcOutputPad for width
Oct 28, 2022
64bcd1f
Don't depend on the order of fields in CBlockedConvGen::CParams
Oct 28, 2022
1d3d647
More parameter renamings and optimizations
Oct 28, 2022
247521c
Fix comment
Oct 28, 2022
a8d9b79
Use more clear indexes for Ymm accumulators
Oct 28, 2022
9621ba2
Clarify bias Ymms
Oct 28, 2022
9fa5a15
Clarify a bit more
Oct 29, 2022
572e277
Improve naming
Oct 29, 2022
3963962
Optimize passing arguments runConv
Oct 29, 2022
cf1917d
Add multithreading
Oct 29, 2022
2385013
Remove fixed TODO comment
Oct 29, 2022
008a9fa
Turn on all the heuristics
Oct 31, 2022
f75dffb
Use NeoML terms
Oct 31, 2022
7090124
Reorder fields
Oct 31, 2022
6c5700f
More renamings
Oct 31, 2022
63fe462
Add more comments
Oct 31, 2022
9a9272e
Fix some naming and add some comments
Nov 14, 2022
d0b54ca
Minor optimization and even more comments
Nov 14, 2022
857d06c
Fix test skip condition
Nov 14, 2022
ede4132
Fix non-MSVC compilation
Nov 14, 2022
25fc832
Reduce diff size
Nov 14, 2022
af96c8b
Refactor tests
Nov 15, 2022
591a047
Add more heuristics for blocked convolution
Nov 15, 2022
75aa7ac
Grammar fix
Nov 15, 2022
ab0638c
Disable AvxTestDesktop everywhere but Windows
Nov 15, 2022
f48de7c
Merge branch 'master' into BlockedConvExperiments
Dec 10, 2022
69dc1f1
Fix compilation
Dec 10, 2022
1db000d
Disable blocked conv for AVX512
Dec 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions NeoMathEngine/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,7 @@ add_subdirectory(src)
if(NeoMathEngine_BUILD_TESTS AND NOT IOS AND NOT ANDROID)
enable_testing()
add_subdirectory(test/FullTestDesktop)
if(WIN32 AND NOT USE_FINE_OBJECTS AND CMAKE_SIZEOF_VOID_P EQUAL 8)
add_subdirectory(test/AvxTestDesktop)
endif()
endif()
9 changes: 9 additions & 0 deletions NeoMathEngine/include/NeoMathEngine/SimdMathEngine.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,15 @@ class ISimdMathEngine : public CCrtAllocatedObject {
virtual void BlobConvolution( const CConvolutionDesc& convDesc, const float* source,
const float* filter, const float* freeTerm, float* result ) const = 0;

virtual CConvolutionDesc* InitBlockedConvolution( const CBlobDesc& source, int paddingHeight, int paddingWidth,
int strideHeight, int strideWidth, int dilationHeight, int dilationWidth, const CBlobDesc& filter,
const CBlobDesc& result ) const = 0;
virtual void PackBlockedData(const CBlobDesc& desc, const float* source, float* result) const = 0;
virtual void UnpackBlockedData( const CBlobDesc& desc, const float* source, float* result ) const = 0;
virtual void PackBlockedFilter( const CBlobDesc& desc, const float* source, float* result ) const = 0;
virtual void BlockedConvolution( const CConvolutionDesc& convDesc, const float* packedSource,
const float* packedFilter, const float* freeTerm, float* packedResult ) const = 0;

virtual SgemmFunc GetSgemmFunction() const = 0;

virtual void Tanh( float* dst, const float* src, size_t dataSize, bool isMultithread = true ) = 0;
Expand Down
36 changes: 30 additions & 6 deletions NeoMathEngine/src/CPU/CpuMathEngineDnnConv.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,16 @@ struct CCpuConvolutionDesc : public CCommonConvolutionDesc {
TConvAlgo ForwardAlgo;
TConvAlgo BackwardAlgo;
std::unique_ptr<CConvolutionDesc> SimdConvolutionDesc;
std::unique_ptr<CConvolutionDesc> BlockedConvolutionDesc;

CCpuConvolutionDesc( std::unique_ptr<CConvolutionDesc>& simdConvolutionDesc, const CBlobDesc& source, const CBlobDesc& result, const CBlobDesc& filter,
int paddingHeight, int paddingWidth, int strideHeight, int strideWidth, int dilationHeight, int dilationWidth ) :
CCpuConvolutionDesc( std::unique_ptr<CConvolutionDesc>& simdConvolutionDesc, std::unique_ptr<CConvolutionDesc>& blockedConvolutionDesc,
const CBlobDesc& source, const CBlobDesc& result, const CBlobDesc& filter, int paddingHeight, int paddingWidth,
int strideHeight, int strideWidth, int dilationHeight, int dilationWidth ) :
CCommonConvolutionDesc( source, result, filter, paddingHeight, paddingWidth, strideHeight, strideWidth, dilationHeight, dilationWidth ),
ForwardAlgo( getActualForwardAlgo() ),
BackwardAlgo( getActualBackwardAlgo() ),
SimdConvolutionDesc( std::move( simdConvolutionDesc ) )
SimdConvolutionDesc( std::move( simdConvolutionDesc ) ),
BlockedConvolutionDesc( std::move( blockedConvolutionDesc ) )
{
}

Expand Down Expand Up @@ -131,14 +134,20 @@ CConvolutionDesc* CCpuMathEngine::InitBlobConvolution( const CBlobDesc& source,
ASSERT_EXPR( result.Channels() == filter.BatchWidth() );
ASSERT_EXPR( result.Depth() == 1 );

std::unique_ptr<CConvolutionDesc> simdConvolutionDesc;
std::unique_ptr<CConvolutionDesc> blockedConvolutionDesc;
if( simdMathEngine != nullptr ) {
blockedConvolutionDesc.reset( simdMathEngine->InitBlockedConvolution( source, paddingHeight, paddingWidth,
strideHeight, strideWidth, dilationHeight, dilationWidth, filter, result ) );
}

std::unique_ptr<CConvolutionDesc> simdConvolutionDesc;
if( simdMathEngine != nullptr && blockedConvolutionDesc == nullptr ) {
simdConvolutionDesc = std::unique_ptr<CConvolutionDesc>( simdMathEngine->InitBlobConvolution( source, paddingHeight, paddingWidth,
strideHeight, strideWidth, dilationHeight, dilationWidth, filter, result ) );
}

CCpuConvolutionDesc* desc = new CCpuConvolutionDesc( simdConvolutionDesc, source, result, filter,
paddingHeight, paddingWidth, strideHeight, strideWidth, dilationHeight, dilationWidth );
CCpuConvolutionDesc* desc = new CCpuConvolutionDesc( simdConvolutionDesc, blockedConvolutionDesc, source, result,
filter, paddingHeight, paddingWidth, strideHeight, strideWidth, dilationHeight, dilationWidth );
return desc;
}

Expand Down Expand Up @@ -517,6 +526,21 @@ void CCpuMathEngine::BlobConvolution( const CConvolutionDesc& convDesc, const CC

const CCpuConvolutionDesc& desc = static_cast<const CCpuConvolutionDesc&>( convDesc );

if( desc.BlockedConvolutionDesc != nullptr ) {
CFloatHandleStackVar packBuff( *this,
std::max<int>( desc.Source.BlobSize(), desc.Result.BlobSize() ) + desc.Filter.BlobSize() );
float* packedFilter = GetRaw( packBuff.GetHandle() );
float* packedIO = packedFilter + desc.Filter.BlobSize();
float* rawResult = GetRaw( result );
simdMathEngine->PackBlockedData( desc.Source, GetRaw( source ), packedIO );
simdMathEngine->PackBlockedFilter( desc.Filter, GetRaw( filter ), packedFilter );
simdMathEngine->BlockedConvolution( *desc.BlockedConvolutionDesc, packedIO, packedFilter,
freeTerm != nullptr ? GetRaw( *freeTerm ) : nullptr, rawResult );
simdMathEngine->UnpackBlockedData( desc.Result, rawResult, packedIO );
dataCopy( rawResult, packedIO, desc.Result.BlobSize() );
return;
}

if( desc.SimdConvolutionDesc != nullptr ) {
simdMathEngine->BlobConvolution( *desc.SimdConvolutionDesc, sourceRaw, filterRaw, freeTermRaw, resultRaw );
return;
Expand Down
2 changes: 2 additions & 0 deletions NeoMathEngine/src/CPU/x86/avx/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,10 @@ target_sources(${PROJECT_NAME}
./src/BlobConvolution_jit_FltCnt_18.inl
./src/BlobConvolution_jit_FltCnt_24.inl
./src/BlobConvolution_jit_FltCnt_32.inl
./src/BlobBlockedConvolution.cpp
./src/PrimitivesJit.h
./src/AvxCommon.h
./src/AvxMathEngine.h
./src/JitCommon.h
./src/MatrixMultiplyingInterleaved/Interleavers/Interleavers.h
./src/MatrixMultiplyingInterleaved/MicroKernels/Kernel_AVX_6x16.h
Expand Down
27 changes: 1 addition & 26 deletions NeoMathEngine/src/CPU/x86/avx/src/AvxMathEngine.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ limitations under the License.
#include <NeoMathEngine/SimdMathEngine.h>
#include <BlobConvolution.h>
#include <PrimitivesJit.h>
#include <AvxMathEngine.h>
#include <CPUInfo.h>

namespace NeoML {
Expand Down Expand Up @@ -48,32 +49,6 @@ CAvxConvolutionDesc::CAvxConvolutionDesc( IMathEngine* mathEngine, const CBlobDe
{
}

class CAvxMathEngine : public ISimdMathEngine {
public:
CAvxMathEngine( IMathEngine* _mathEngine, int _threadCount ) :
mathEngine( _mathEngine ), threadCount( _threadCount ), primitives( _mathEngine, _threadCount ) {}

CConvolutionDesc* InitBlobConvolution( const CBlobDesc& source, int paddingHeight, int paddingWidth,
int strideHeight, int strideWidth, int dilationHeight, int dilationWidth, const CBlobDesc& filter,
const CBlobDesc& result ) const override;

void BlobConvolution( const CConvolutionDesc& convDesc, const float* source,
const float* filter, const float* freeTerm, float* result ) const override;

SgemmFunc GetSgemmFunction() const override;

void Tanh( float* dst, const float* src, size_t dataSize, bool isMultithread ) override;
void Sigmoid( float* dst, const float* src, size_t dataSize, bool isMultithread ) override;
void Exp( float* dst, const float* src, size_t dataSize, bool isMultithread ) override;
void RunOnceRestOfLstm( CMathEngineLstmDesc* desc, const CConstFloatHandle& inputStateBackLink,
const CFloatHandle& outputStateBackLink, const CFloatHandle& outputMainBackLink, bool isMultithread ) override;

private:
IMathEngine* mathEngine;
int threadCount;
CPrimitivesJit primitives;
};

CConvolutionDesc* CAvxMathEngine::InitBlobConvolution( const CBlobDesc& source, int paddingHeight, int paddingWidth,
int strideHeight, int strideWidth, int dilationHeight, int dilationWidth, const CBlobDesc& filter,
const CBlobDesc& result ) const
Expand Down
59 changes: 59 additions & 0 deletions NeoMathEngine/src/CPU/x86/avx/src/AvxMathEngine.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
/* Copyright © 2017-2022 ABBYY Production LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------------------------------------*/

#pragma once

#include <NeoMathEngine/SimdMathEngine.h>
#include <PrimitivesJit.h>

namespace NeoML {

class CAvxMathEngine : public ISimdMathEngine {
public:
CAvxMathEngine( IMathEngine* _mathEngine, int _threadCount ) :
mathEngine( _mathEngine ), threadCount( _threadCount ), primitives( _mathEngine, _threadCount ) {}

CConvolutionDesc* InitBlobConvolution( const CBlobDesc& source, int paddingHeight, int paddingWidth,
int strideHeight, int strideWidth, int dilationHeight, int dilationWidth, const CBlobDesc& filter,
const CBlobDesc& result ) const override;

void BlobConvolution( const CConvolutionDesc& convDesc, const float* source,
const float* filter, const float* freeTerm, float* result ) const override;

virtual CConvolutionDesc* InitBlockedConvolution( const CBlobDesc& source, int paddingHeight, int paddingWidth,
int strideHeight, int strideWidth, int dilationHeight, int dilationWidth, const CBlobDesc& filter,
const CBlobDesc& result ) const override;
void PackBlockedData( const CBlobDesc& desc, const float* source, float* result ) const override;
void UnpackBlockedData( const CBlobDesc& desc, const float* source, float* result ) const override;
void PackBlockedFilter( const CBlobDesc& desc, const float* source, float* result ) const override;
void BlockedConvolution( const CConvolutionDesc& convDesc, const float* packedSource,
const float* packedFilter, const float* freeTerm, float* packedResult ) const override;

SgemmFunc GetSgemmFunction() const override;

void Tanh( float* dst, const float* src, size_t dataSize, bool isMultithread ) override;
void Sigmoid( float* dst, const float* src, size_t dataSize, bool isMultithread ) override;
void Exp( float* dst, const float* src, size_t dataSize, bool isMultithread ) override;
void RunOnceRestOfLstm( CMathEngineLstmDesc* desc, const CConstFloatHandle& inputStateBackLink,
const CFloatHandle& outputStateBackLink, const CFloatHandle& outputMainBackLink, bool isMultithread ) override;

private:
IMathEngine* mathEngine;
int threadCount;
CPrimitivesJit primitives;
};

}

Loading