|
1 | 1 | # Image Filtering
|
2 | 2 |
|
3 |
| -Here, we convolve a small matrix `kern` with a larger matrix `A`, storing the results in `out`: |
| 3 | +Here, we convolve a small matrix `kern` with a larger matrix `A`, storing the results in `out`, using Julia's generic [Cartesian Indexing](https://julialang.org/blog/2016/02/iteration/): |
4 | 4 | ```julia
|
5 | 5 | function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
|
6 |
| - rng1k, rng2k = axes(kern) |
7 |
| - rng1, rng2 = axes(out) |
8 |
| - @avx for j in rng2, i in rng1 |
| 6 | + @avx for J in CartesianIndices(out) |
9 | 7 | tmp = zero(eltype(out))
|
10 |
| - for jk in rng2k, ik in rng1k |
11 |
| - tmp += A[i+ik,j+jk]*kern[ik,jk] |
| 8 | + for I ∈ CartesianIndices(kern) |
| 9 | + tmp += A[I + J] * kern[I] |
12 | 10 | end
|
13 |
| - out[i,j] = tmp |
| 11 | + out[J] = tmp |
14 | 12 | end
|
15 | 13 | out
|
16 | 14 | end
|
17 | 15 | ```
|
18 |
| -These are four nested loops. For all the benchmarks, `kern` was only 3 by 3, making it too small for vectorizing these loops to be particularly profitable. By vectorizing the `i` loop instead, it can benefit from SIMD and also avoid having to do a reduction (horizontal addition) of a vector before storing in `out`, as the vectors can then be stored directly. |
| 16 | +These are effectively four nested loops. For all the benchmarks, `kern` was 3 by 3, making it too small for vectorizing these loops to be particularly profitable. By vectorizing an outer loop instead, it can benefit from SIMD and also avoid having to do a reduction (horizontal addition) of a vector before storing in `out`, as the vectors can then be stored directly. |
19 | 17 | 
|
20 | 18 |
|
21 |
| -LoopVectorization achieved much better performance than all the alternatives, which tended to prefer vectorizing the inner loops. |
22 |
| -By making the compilers aware that the `ik` loop is too short to be worth vectorizing, we can get them to vectorize something else instead. By defining the size of `kern` as constant in `C` and `Fortran`, and using size parameters in Julia, we can inform the compilers: |
| 19 | +LoopVectorization achieved much better performance than all the alternatives, which tried vectorizing the inner loops. |
| 20 | +By making the compilers aware that the inner loops are too short to be worth vectorizing, we can get them to vectorize an outer loop instead. By defining the size of `kern` as constant in `C` and `Fortran`, and using size parameters in Julia, we can inform the compilers: |
23 | 21 | 
|
24 | 22 | Now all are doing much better than they were before, although still well shy of the 131.2 GFLOPS theoretical limit for the host CPU cores. While they all improved, two are lagging behind the main group:
|
25 |
| -- `ifort` lags behind all the others except base Julia. I'd need to do more investigating to find out why. |
26 |
| -- Providing static size information was enough for all to realize vectorizing the inner loops was not worth it. However, all but base Julia decided to vectorize a different loop instead, while the base Julia version I tested just didn't vectorize at all. |
| 23 | +- `ifort` lags behind all the others except base Julia. I'll need to do more investigating to find out why. |
| 24 | +- Base Julia. While providing static size information was enough for it to realize vectorizing the inner loops was not worth it, base Julia was seemingly the only one that didn't decide to vectorize an outer loop instead. |
27 | 25 |
|
28 |
| -Helping Base Julia out by manually unrolling the inner loops: |
| 26 | +Manually unrolling the inner loops allows base Julia to vectorize, while the performance of all non-Julia variants was unchanged: |
29 | 27 | 
|
30 |
| -This manual unrolling helped Julia, but had no real impact on any of the others. |
| 28 | +LoopVectorization is currently limited to only unrolling two loops (but a third may be vectorized, effectively unrolling it by the length of the vectors). Manually unrolling two of the loops lets up to four loops be unrolled. |
31 | 29 |
|
0 commit comments