Skip to content

Commit 1bda510

Browse files
committed
Update docs on image filtering example.
1 parent 5cedee2 commit 1bda510

File tree

1 file changed

+12
-14
lines changed

1 file changed

+12
-14
lines changed

docs/src/examples/filtering.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,29 @@
11
# Image Filtering
22

3-
Here, we convolve a small matrix `kern` with a larger matrix `A`, storing the results in `out`:
3+
Here, we convolve a small matrix `kern` with a larger matrix `A`, storing the results in `out`, using Julia's generic [Cartesian Indexing](https://julialang.org/blog/2016/02/iteration/):
44
```julia
55
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
6-
rng1k, rng2k = axes(kern)
7-
rng1, rng2 = axes(out)
8-
@avx for j in rng2, i in rng1
6+
@avx for J in CartesianIndices(out)
97
tmp = zero(eltype(out))
10-
for jk in rng2k, ik in rng1k
11-
tmp += A[i+ik,j+jk]*kern[ik,jk]
8+
for I CartesianIndices(kern)
9+
tmp += A[I + J] * kern[I]
1210
end
13-
out[i,j] = tmp
11+
out[J] = tmp
1412
end
1513
out
1614
end
1715
```
18-
These are four nested loops. For all the benchmarks, `kern` was only 3 by 3, making it too small for vectorizing these loops to be particularly profitable. By vectorizing the `i` loop instead, it can benefit from SIMD and also avoid having to do a reduction (horizontal addition) of a vector before storing in `out`, as the vectors can then be stored directly.
16+
These are effectively four nested loops. For all the benchmarks, `kern` was 3 by 3, making it too small for vectorizing these loops to be particularly profitable. By vectorizing an outer loop instead, it can benefit from SIMD and also avoid having to do a reduction (horizontal addition) of a vector before storing in `out`, as the vectors can then be stored directly.
1917
![dynamicfilter](../assets/bench_filter2d_dynamic_v1.svg)
2018

21-
LoopVectorization achieved much better performance than all the alternatives, which tended to prefer vectorizing the inner loops.
22-
By making the compilers aware that the `ik` loop is too short to be worth vectorizing, we can get them to vectorize something else instead. By defining the size of `kern` as constant in `C` and `Fortran`, and using size parameters in Julia, we can inform the compilers:
19+
LoopVectorization achieved much better performance than all the alternatives, which tried vectorizing the inner loops.
20+
By making the compilers aware that the inner loops are too short to be worth vectorizing, we can get them to vectorize an outer loop instead. By defining the size of `kern` as constant in `C` and `Fortran`, and using size parameters in Julia, we can inform the compilers:
2321
![staticsizefilter](../assets/bench_filter2d_3x3_v1.svg)
2422
Now all are doing much better than they were before, although still well shy of the 131.2 GFLOPS theoretical limit for the host CPU cores. While they all improved, two are lagging behind the main group:
25-
- `ifort` lags behind all the others except base Julia. I'd need to do more investigating to find out why.
26-
- Providing static size information was enough for all to realize vectorizing the inner loops was not worth it. However, all but base Julia decided to vectorize a different loop instead, while the base Julia version I tested just didn't vectorize at all.
23+
- `ifort` lags behind all the others except base Julia. I'll need to do more investigating to find out why.
24+
- Base Julia. While providing static size information was enough for it to realize vectorizing the inner loops was not worth it, base Julia was seemingly the only one that didn't decide to vectorize an outer loop instead.
2725

28-
Helping Base Julia out by manually unrolling the inner loops:
26+
Manually unrolling the inner loops allows base Julia to vectorize, while the performance of all non-Julia variants was unchanged:
2927
![unrolledfilter](../assets/bench_filter2d_unrolled_v1.svg)
30-
This manual unrolling helped Julia, but had no real impact on any of the others.
28+
LoopVectorization is currently limited to only unrolling two loops (but a third may be vectorized, effectively unrolling it by the length of the vectors). Manually unrolling two of the loops lets up to four loops be unrolled.
3129

0 commit comments

Comments
 (0)