Skip to content

Commit 07e71a1

Browse files
committed
Added image filtering to documentation.
1 parent 134ee6f commit 07e71a1

File tree

8 files changed

+41
-6
lines changed

8 files changed

+41
-6
lines changed

benchmark/looptests.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@ function randomaccess(P, basis, coeffs::Vector{T}) where {T}
244244
end
245245
p += pc
246246
end
247-
return p
247+
return p
248248
end
249249
function randomaccessavx(P, basis, coeffs::Vector{T}) where {T}
250250
C = length(coeffs)

docs/make.jl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ makedocs(;
1010
"examples/matrix_multiplication.md",
1111
"examples/matrix_vector_ops.md",
1212
"examples/dot_product.md",
13-
"examples/sum_of_squared_error.md"
13+
"examples/sum_of_squared_error.md",
14+
"examples/filtering.md"
1415
],
1516
"Vectorized Convenience Functions" => "vectorized_convenience_functions.md",
1617
"Future Work" => "future_work.md",

docs/src/assets/bench_filter2d_3x3_v1.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/src/examples/filtering.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Image Filtering
2+
3+
Here, we convolve a small matrix `kern` with a larger matrix `A`, storing the results in `out`:
4+
```julia
5+
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
6+
rng1k, rng2k = axes(kern)
7+
rng1, rng2 = axes(out)
8+
@avx for j in rng2, i in rng1
9+
tmp = zero(eltype(out))
10+
for jk in rng2k, ik in rng1k
11+
tmp += A[i+ik,j+jk]*kern[ik,jk]
12+
end
13+
out[i,j] = tmp
14+
end
15+
out
16+
end
17+
```
18+
These are four nested loops. For all the benchmarks, `kern` was only 3 by 3, making it too small for vectorizing these loops to be particularly profitable. By vectorizing the `i` loop instead, it can benefit from SIMD and also avoid having to do a reduction (horizontal addition) of a vector before storing in `out`, as the vectors can then be stored directly.
19+
![dynamicfilter](../assets/bench_filter2d_dynamic_v1.svg)
20+
21+
LoopVectorization achieved much better performance than all the alternatives, which tended to prefer vectorizing the inner loops.
22+
By making the compilers aware that the `ik` loop is too short to be worth vectorizing, we can get them to vectorize something else instead. By defining the size of `kern` as constant in `C` and `Fortran`, and using size parameters in Julia, we can inform the compilers:
23+
![staticsizefilter](../assets/bench_filter2d_3x3_v1.svg)
24+
Now all are doing much better than they were before, although still well shy of the 131.2 GFLOPS theoretical limit for the host CPU cores. While they all improved, three are lagging behind the main group:
25+
- `ifort` lags behind all the others except base Julia. I'd need to do more investigating to find out why.
26+
- Providing static size information was enough for all to realize vectorizing the inner loops was not worth it. However, all but base Julia decided to vectorize a different loop instead, while the base Julia version I tested just didn't vectorize at all.
27+
- LoopVectorization currently only unrolls up to 2 loops. To get optimal performance in this problem, if you know the size of the inner loops, you should completely unroll them, and then also partially unroll the outer loops. I'll have to lift that restriction ([tracking issue](https://github.com/chriselrod/LoopVectorization.jl/issues/73)), and also make it aware that unrolling the outer loops is cheap, thanks to the ability to reuse neighboring `A` entries.
28+
29+
Trying to provide hints by manually unrolling produces:
30+
![unrolledfilter](../assets/bench_filter2d_unrolled_v1.svg)
31+
This manual unrolling helped both Julia versions, while there was no change in any of the others.
32+

docs/src/examples/matrix_multiplication.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ One of the friendliest problems for vectorization is matrix multiplication. Give
55
LoopVectorization currently doesn't do any memory-modeling or memory-based optimizations, so it will still run into problems as the size of matrices increases. But at smaller sizes, it's capable of achieving a healthy percent of potential GFLOPS.
66
We can write a single function:
77
```julia
8-
@inline function A_mul_B!(𝐂, 𝐀, 𝐁)
8+
function A_mul_B!(𝐂, 𝐀, 𝐁)
99
@avx for m 1:size(𝐀,1), n 1:size(𝐁,2)
1010
𝐂ₘₙ = zero(eltype(𝐂))
1111
for k 1:size(𝐀,2)

docs/src/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Pages = [
1111
"examples/matrix_multiplication.md",
1212
"examples/matrix_vector_ops.md",
1313
"examples/dot_product.md",
14+
"examples/filtering.md",
1415
"examples/sum_of_squared_error.md",
1516
"vectorized_convenience_functions.md",
1617
"future_work.md",

src/LoopVectorization.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ using VectorizationBase: REGISTER_SIZE, REGISTER_COUNT, extract_data, num_vector
77
Static, StaticUnitRange, StaticLowerUnitRange, StaticUpperUnitRange, unwrap, maybestaticrange,
88
PackedStridedPointer, SparseStridedPointer, RowMajorStridedPointer, StaticStridedPointer, StaticStridedStruct,
99
maybestaticfirst, maybestaticlast
10-
using SIMDPirates: VECTOR_SYMBOLS, evadd, evmul, vrange, reduced_add, reduced_prod, reduce_to_add, reduce_to_prod,
10+
using SIMDPirates: VECTOR_SYMBOLS, evadd, evsub, evmul, evfdiv, vrange, reduced_add, reduced_prod, reduce_to_add, reduce_to_prod,
1111
sizeequivalentfloat, sizeequivalentint, vadd!, vsub!, vmul!, vfdiv!, vfmadd!, vfnmadd!, vfmsub!, vfnmsub!,
1212
vfmadd231, vfmsub231, vfnmadd231, vfnmsub231, sizeequivalentfloat, sizeequivalentint, #prefetch,
1313
vmullog2, vmullog10, vdivlog2, vdivlog10, vmullog2add!, vmullog10add!, vdivlog2add!, vdivlog10add!, vfmaddaddone

test/offsetarrays.jl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,8 @@ using Test
171171

172172
fill!(out3, NaN); avx2dunrolled3x3!(out3, A, skern);
173173
@test out1 out3
174-
end
174+
175+
end
175176

176177

177178
end

0 commit comments

Comments
 (0)