|
| 1 | +# Explicit Pipeline Register Insertion with `fpga_reg` |
| 2 | + |
| 3 | +This FPGA tutorial demonstrates how a power user can apply the DPC++ extension `intel::fpga_reg` to tweak the hardware generated by the compiler. |
| 4 | + |
| 5 | +***Documentation***: The [oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide) provides comprehensive instructions for targeting FPGAs through DPC++. The [oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programming-guide) is a general resource for target-independent DPC++ programming. |
| 6 | + |
| 7 | +| Optimized for | Description |
| 8 | +--- |--- |
| 9 | +| OS | Linux* Ubuntu* 18.04 |
| 10 | +| Hardware | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA; <br> Intel® Programmable Acceleration Card (PAC) with Intel Stratix® 10 SX FPGA |
| 11 | +| Software | Intel® oneAPI DPC++ Compiler (Beta) <br> Intel® FPGA Add-On for oneAPI Base Toolkit |
| 12 | +| What you will learn | How to use the `intel::fpga_reg` extension <br> How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware <br> Situations in which applying `intel::fpga_reg` might be beneficial |
| 13 | +| Time to complete | 20 minutes |
| 14 | + |
| 15 | +_Notice: This code sample is not yet supported in Windows*_ |
| 16 | + |
| 17 | +## Purpose |
| 18 | + |
| 19 | +This FPGA tutorial demonstrates an example of using the `intel::fpga_reg` extension to: |
| 20 | + |
| 21 | +* Help reduce the fanout of specific signals in the DPC++ design |
| 22 | +* Improve the overall f<sub>MAX</sub> of the generated hardware |
| 23 | + |
| 24 | +Note that this is an advanced tutorial for FPGA power users. |
| 25 | + |
| 26 | +### Simple Code Example |
| 27 | + |
| 28 | +The signature of `intel::fpga_reg` is as follows: |
| 29 | + |
| 30 | +```cpp |
| 31 | +template <typenameT> |
| 32 | +T intel::fpga_reg(T input) |
| 33 | +``` |
| 34 | +
|
| 35 | +To use this function in your code, you must include the following header: |
| 36 | +
|
| 37 | +```cpp |
| 38 | +#include <CL/sycl/intel/fpga_extensions.hpp> |
| 39 | +``` |
| 40 | + |
| 41 | +When you use this function on any value in your code, the compiler will insert at least one register stage between the input and output of `intel::fpga_reg` function. For example: |
| 42 | + |
| 43 | +```cpp |
| 44 | +int func (int input) { |
| 45 | + int output = intel::fpga_reg(input) |
| 46 | + return output; |
| 47 | +} |
| 48 | +``` |
| 49 | +
|
| 50 | +This forces the compiler to insert a register between the input and output. You can observe this in the optimization report's System Viewer. |
| 51 | +
|
| 52 | +### Understanding the Tutorial Design |
| 53 | +
|
| 54 | +The basic function performed by the tutorial kernel is a vector dot product with a pre-adder. The loop is unrolled so that the core part of the algorithm is a feed-forward datapath. The coefficient array is implemented as a circular shift register and rotates by one for each iteration of the outer loop. |
| 55 | +
|
| 56 | +The optimization applied in this tutorial impacts the system f<sub>MAX</sub> or the maximum frequency that the design can run at. Since the compiler implements all kernels in a common clock domain, f<sub>MAX</sub> is a global system parameter. To see the impact of the `intel::fpga_reg` optimization in this tutorial, you will need to compile the design twice. |
| 57 | +
|
| 58 | +Part 1 compiles the kernel code without setting the `USE_FPGA_REG` macro, whereas Part 2 compiles the kernel while setting this macro. This chooses between two code segments that are functionally equivalent, but the latter version makes use of `intel::fpga_reg`. In the `USE_FPGA_REG` version of the code, the compiler is guaranteed to insert at least one register stage between the input and output of each of the calls to `intel::fpga_reg` function. |
| 59 | +
|
| 60 | +#### Part 1: Without `USE_FPGA_REG` |
| 61 | +
|
| 62 | +The compiler will generate the following hardware for Part 1. The diagram below has been simplified for illustration. |
| 63 | +
|
| 64 | +<img src="no_fpga_reg.png" alt="Part 1" title="Part 1" width="400" /> |
| 65 | +
|
| 66 | +Note the following: |
| 67 | +
|
| 68 | +* The compiler automatically infers a tree structure for the series of adders. |
| 69 | +* There is a large fanout (of up to 4 in this simplified example) from `val` to each of the adders. |
| 70 | +
|
| 71 | +The fanout grows linearly with the unroll factor in this tutorial. In FPGA designs, signals with large fanout can sometimes degrade system f<sub>MAX</sub>. This happens because the FPGA placement algorithm cannot place *all* of the fanout logic elements physically close to the fanout source, leading to longer wires. In this situation, it can be helpful to add explicit fanout control in your DPC++ code via `intel::fpga_reg`. This is an advanced optimization for FPGA power-users. |
| 72 | +
|
| 73 | +#### Part 2: with `USE_FPGA_REG` |
| 74 | +
|
| 75 | +In this part, we added two sets of `intel::fpga_reg` within the unrolled loop. The first is added to pipeline `val` once per iteration. This reduce the fanout of `val` from 4 in the example in Part 1 to just 2. The second `intel::fpga_reg` is inserted between accumulation into the `acc` value. This generates the following structure in hardware. |
| 76 | +
|
| 77 | +<img src="fpga_reg.png" alt="Part 2" title="Part 2" width="400" /> |
| 78 | +
|
| 79 | +In this version, the adder tree has been transformed into a vine-like structure. This increases latency, but it helps us achieve our goal of reducing the fanout and improving f<sub>MAX</sub>. |
| 80 | +Since the outer loop in this tutorial is pipelined and has a high trip count, the increased latency of the inner loop has negligible impact on throughput. The tradeoff pays off, as the f<sub>MAX</sub> improvement yields a higher performing design. |
| 81 | +
|
| 82 | +## Key Concepts |
| 83 | +
|
| 84 | +* How to use the `intel::fpga_reg` extension |
| 85 | +* How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware |
| 86 | +* Situations in which applying `intel::fpga_reg` might be beneficial |
| 87 | +
|
| 88 | +## License |
| 89 | +
|
| 90 | +This code sample is licensed under MIT license. |
| 91 | +
|
| 92 | +## Building the `fpga_reg` Design |
| 93 | +
|
| 94 | +### Include Files |
| 95 | +
|
| 96 | +The included header `dpc_common.hpp` is located at `%ONEAPI_ROOT%\dev-utilities\latest\include` on your development system. |
| 97 | +
|
| 98 | +### Running Samples in DevCloud |
| 99 | +
|
| 100 | +If running a sample in the Intel DevCloud, remember that you must specify the compute node (fpga_compile or fpga_runtime) as well as whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide ([https://devcloud.intel.com/oneapi/get-started/base-toolkit/](https://devcloud.intel.com/oneapi/get-started/base-toolkit/)). |
| 101 | +
|
| 102 | +When compiling for FPGA hardware, it is recommended to increase the job timeout to 12h. |
| 103 | +
|
| 104 | +### On a Linux* System |
| 105 | +
|
| 106 | +1. Install the design in `build` directory from the design directory by running `cmake`: |
| 107 | +
|
| 108 | + ```bash |
| 109 | + mkdir build |
| 110 | + cd build |
| 111 | + ``` |
| 112 | + |
| 113 | + If you are compiling for the Intel® PAC with Intel Arria® 10 GX FPGA, run `cmake` using the command: |
| 114 | + |
| 115 | + ```bash |
| 116 | + cmake .. |
| 117 | + ``` |
| 118 | + |
| 119 | + Alternatively, to compile for the Intel® PAC with Intel Stratix® 10 SX FPGA, run `cmake` using the command: |
| 120 | + |
| 121 | + ```bash |
| 122 | + cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10 |
| 123 | + ``` |
| 124 | + |
| 125 | +2. Compile the design using the generated `Makefile`. The following four build targets are provided that match the recommended development flow: |
| 126 | + |
| 127 | + * Compile and run for emulation (fast compile time, targets emulates an FPGA device) using: |
| 128 | + |
| 129 | + ```bash |
| 130 | + make fpga_emu |
| 131 | + ``` |
| 132 | + |
| 133 | + * Generate HTML optimization reports using: |
| 134 | + |
| 135 | + ```bash |
| 136 | + make report |
| 137 | + ``` |
| 138 | + |
| 139 | + * Compile and run on FPGA hardware (longer compile time, targets an FPGA device) using: |
| 140 | + |
| 141 | + ```bash |
| 142 | + make fpga |
| 143 | + ``` |
| 144 | + |
| 145 | +3. (Optional) As the above hardware compile may take several hours to complete, an Intel® PAC with Intel Arria® 10 GX FPGA pre-compiled binary can be downloaded <a href="https://software.intel.com/content/dam/develop/external/us/en/documents/fpga_reg.fpga.tar.gz" download>here</a>. |
| 146 | + |
| 147 | + |
| 148 | +### In Third-Party Integrated Development Environments (IDEs) |
| 149 | + |
| 150 | +You can compile and run this tutorial in the Eclipse* IDE (in Linux*). |
| 151 | +For instructions, refer to the following link: [Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs](https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-fpga-workflow-on-ide) |
| 152 | + |
| 153 | +## Examining the Reports |
| 154 | + |
| 155 | +Locate the pair of `report.html` files in either: |
| 156 | + |
| 157 | +* **Report-only compile**: `fpga_reg_report.prj` and `fpga_reg_registered_report.prj` |
| 158 | +* **FPGA hardware compile**: `fpga_reg.prj` and `fpga_reg_registered.prj` |
| 159 | + |
| 160 | +Open the reports in any of Chrome*, Firefox*, Edge*, or Internet Explorer*. Observe the structure of the design in the optimization report's System Viewer and notice the changes within `Cluster 2` of the `SimpleMath.B1` block. You can notice that in the report for Part 1, the viewer shows a much more shallow graph as compared to the one in Part 2. This is because the operations are performed much closer to one another in Part 1 as compared to Part 2. By transforming the code in Part 2, with more register stages, the compiler was able to achieve an higher f<sub>MAX</sub>. |
| 161 | +
|
| 162 | +>**NOTE**: Only the report generated after the FPGA hardware compile will reflect the performance benefit of using the `fpga_reg` extension. The difference is *not* apparent in the reports generated by `make report` because a design's f<sub>MAX</sub> cannot be predicted. The final achieved f<sub>MAX</sub> can be found in `fpga_reg.prj/reports/report.html` and `fpga_reg_registered.prj/reports/report.html` (after `make fpga` completes). |
| 163 | + |
| 164 | +## Running the Sample |
| 165 | + |
| 166 | +1. Run the sample on the FPGA emulator (the kernel executes on the CPU): |
| 167 | + |
| 168 | + ```bash |
| 169 | + ./fpga_reg.fpga_emu # Linux |
| 170 | + ``` |
| 171 | + |
| 172 | +2. Run the sample on the FPGA device |
| 173 | + |
| 174 | + ```bash |
| 175 | + ./fpga_reg.fpga # Linux |
| 176 | + ./fpga_reg_registered.fpga # Linux |
| 177 | + ``` |
| 178 | + |
| 179 | +### Example of Output |
| 180 | + |
| 181 | +```txt |
| 182 | +Throughput for kernel with input size 1000000 and coefficient array size 64: 2.819272 GFlops |
| 183 | +PASSED: Results are correct. |
| 184 | +``` |
| 185 | + |
| 186 | +### Discussion of Results |
| 187 | + |
| 188 | +You will be able to observe the improvement in the throughput going from Part 1 to Part 2. You will also note that the f<sub>MAX</sub> of Part 2 is significantly larger than of Part 1. |
0 commit comments