Skip to content

Commit fc69f8e

Browse files
authored
Add fpga_reg and loop_unroll tutorials, with Linux support only (#99)
Signed-off-by: Audrey Kertesz <[email protected]>
1 parent 6bb3678 commit fc69f8e

File tree

14 files changed

+1034
-0
lines changed

14 files changed

+1034
-0
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
set(CMAKE_CXX_COMPILER "dpcpp")
2+
3+
cmake_minimum_required (VERSION 2.8)
4+
5+
project(FPGARegister)
6+
7+
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
8+
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
9+
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
10+
11+
add_subdirectory (src)
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright Intel Corporation
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Explicit Pipeline Register Insertion with `fpga_reg`
2+
3+
This FPGA tutorial demonstrates how a power user can apply the DPC++ extension `intel::fpga_reg` to tweak the hardware generated by the compiler.
4+
5+
***Documentation***: The [oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide) provides comprehensive instructions for targeting FPGAs through DPC++. The [oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programming-guide) is a general resource for target-independent DPC++ programming.
6+
7+
| Optimized for | Description
8+
--- |---
9+
| OS | Linux* Ubuntu* 18.04
10+
| Hardware | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA; <br> Intel® Programmable Acceleration Card (PAC) with Intel Stratix® 10 SX FPGA
11+
| Software | Intel® oneAPI DPC++ Compiler (Beta) <br> Intel® FPGA Add-On for oneAPI Base Toolkit
12+
| What you will learn | How to use the `intel::fpga_reg` extension <br> How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware <br> Situations in which applying `intel::fpga_reg` might be beneficial
13+
| Time to complete | 20 minutes
14+
15+
_Notice: This code sample is not yet supported in Windows*_
16+
17+
## Purpose
18+
19+
This FPGA tutorial demonstrates an example of using the `intel::fpga_reg` extension to:
20+
21+
* Help reduce the fanout of specific signals in the DPC++ design
22+
* Improve the overall f<sub>MAX</sub> of the generated hardware
23+
24+
Note that this is an advanced tutorial for FPGA power users.
25+
26+
### Simple Code Example
27+
28+
The signature of `intel::fpga_reg` is as follows:
29+
30+
```cpp
31+
template <typenameT>
32+
T intel::fpga_reg(T input)
33+
```
34+
35+
To use this function in your code, you must include the following header:
36+
37+
```cpp
38+
#include <CL/sycl/intel/fpga_extensions.hpp>
39+
```
40+
41+
When you use this function on any value in your code, the compiler will insert at least one register stage between the input and output of `intel::fpga_reg` function. For example:
42+
43+
```cpp
44+
int func (int input) {
45+
int output = intel::fpga_reg(input)
46+
return output;
47+
}
48+
```
49+
50+
This forces the compiler to insert a register between the input and output. You can observe this in the optimization report's System Viewer.
51+
52+
### Understanding the Tutorial Design
53+
54+
The basic function performed by the tutorial kernel is a vector dot product with a pre-adder. The loop is unrolled so that the core part of the algorithm is a feed-forward datapath. The coefficient array is implemented as a circular shift register and rotates by one for each iteration of the outer loop.
55+
56+
The optimization applied in this tutorial impacts the system f<sub>MAX</sub> or the maximum frequency that the design can run at. Since the compiler implements all kernels in a common clock domain, f<sub>MAX</sub> is a global system parameter. To see the impact of the `intel::fpga_reg` optimization in this tutorial, you will need to compile the design twice.
57+
58+
Part 1 compiles the kernel code without setting the `USE_FPGA_REG` macro, whereas Part 2 compiles the kernel while setting this macro. This chooses between two code segments that are functionally equivalent, but the latter version makes use of `intel::fpga_reg`. In the `USE_FPGA_REG` version of the code, the compiler is guaranteed to insert at least one register stage between the input and output of each of the calls to `intel::fpga_reg` function.
59+
60+
#### Part 1: Without `USE_FPGA_REG`
61+
62+
The compiler will generate the following hardware for Part 1. The diagram below has been simplified for illustration.
63+
64+
<img src="no_fpga_reg.png" alt="Part 1" title="Part 1" width="400" />
65+
66+
Note the following:
67+
68+
* The compiler automatically infers a tree structure for the series of adders.
69+
* There is a large fanout (of up to 4 in this simplified example) from `val` to each of the adders.
70+
71+
The fanout grows linearly with the unroll factor in this tutorial. In FPGA designs, signals with large fanout can sometimes degrade system f<sub>MAX</sub>. This happens because the FPGA placement algorithm cannot place *all* of the fanout logic elements physically close to the fanout source, leading to longer wires. In this situation, it can be helpful to add explicit fanout control in your DPC++ code via `intel::fpga_reg`. This is an advanced optimization for FPGA power-users.
72+
73+
#### Part 2: with `USE_FPGA_REG`
74+
75+
In this part, we added two sets of `intel::fpga_reg` within the unrolled loop. The first is added to pipeline `val` once per iteration. This reduce the fanout of `val` from 4 in the example in Part 1 to just 2. The second `intel::fpga_reg` is inserted between accumulation into the `acc` value. This generates the following structure in hardware.
76+
77+
<img src="fpga_reg.png" alt="Part 2" title="Part 2" width="400" />
78+
79+
In this version, the adder tree has been transformed into a vine-like structure. This increases latency, but it helps us achieve our goal of reducing the fanout and improving f<sub>MAX</sub>.
80+
Since the outer loop in this tutorial is pipelined and has a high trip count, the increased latency of the inner loop has negligible impact on throughput. The tradeoff pays off, as the f<sub>MAX</sub> improvement yields a higher performing design.
81+
82+
## Key Concepts
83+
84+
* How to use the `intel::fpga_reg` extension
85+
* How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware
86+
* Situations in which applying `intel::fpga_reg` might be beneficial
87+
88+
## License
89+
90+
This code sample is licensed under MIT license.
91+
92+
## Building the `fpga_reg` Design
93+
94+
### Include Files
95+
96+
The included header `dpc_common.hpp` is located at `%ONEAPI_ROOT%\dev-utilities\latest\include` on your development system.
97+
98+
### Running Samples in DevCloud
99+
100+
If running a sample in the Intel DevCloud, remember that you must specify the compute node (fpga_compile or fpga_runtime) as well as whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide ([https://devcloud.intel.com/oneapi/get-started/base-toolkit/](https://devcloud.intel.com/oneapi/get-started/base-toolkit/)).
101+
102+
When compiling for FPGA hardware, it is recommended to increase the job timeout to 12h.
103+
104+
### On a Linux* System
105+
106+
1. Install the design in `build` directory from the design directory by running `cmake`:
107+
108+
```bash
109+
mkdir build
110+
cd build
111+
```
112+
113+
If you are compiling for the Intel® PAC with Intel Arria® 10 GX FPGA, run `cmake` using the command:
114+
115+
```bash
116+
cmake ..
117+
```
118+
119+
Alternatively, to compile for the Intel® PAC with Intel Stratix® 10 SX FPGA, run `cmake` using the command:
120+
121+
```bash
122+
cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10
123+
```
124+
125+
2. Compile the design using the generated `Makefile`. The following four build targets are provided that match the recommended development flow:
126+
127+
* Compile and run for emulation (fast compile time, targets emulates an FPGA device) using:
128+
129+
```bash
130+
make fpga_emu
131+
```
132+
133+
* Generate HTML optimization reports using:
134+
135+
```bash
136+
make report
137+
```
138+
139+
* Compile and run on FPGA hardware (longer compile time, targets an FPGA device) using:
140+
141+
```bash
142+
make fpga
143+
```
144+
145+
3. (Optional) As the above hardware compile may take several hours to complete, an Intel® PAC with Intel Arria® 10 GX FPGA pre-compiled binary can be downloaded <a href="https://software.intel.com/content/dam/develop/external/us/en/documents/fpga_reg.fpga.tar.gz" download>here</a>.
146+
147+
148+
### In Third-Party Integrated Development Environments (IDEs)
149+
150+
You can compile and run this tutorial in the Eclipse* IDE (in Linux*).
151+
For instructions, refer to the following link: [Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs](https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-fpga-workflow-on-ide)
152+
153+
## Examining the Reports
154+
155+
Locate the pair of `report.html` files in either:
156+
157+
* **Report-only compile**: `fpga_reg_report.prj` and `fpga_reg_registered_report.prj`
158+
* **FPGA hardware compile**: `fpga_reg.prj` and `fpga_reg_registered.prj`
159+
160+
Open the reports in any of Chrome*, Firefox*, Edge*, or Internet Explorer*. Observe the structure of the design in the optimization report's System Viewer and notice the changes within `Cluster 2` of the `SimpleMath.B1` block. You can notice that in the report for Part 1, the viewer shows a much more shallow graph as compared to the one in Part 2. This is because the operations are performed much closer to one another in Part 1 as compared to Part 2. By transforming the code in Part 2, with more register stages, the compiler was able to achieve an higher f<sub>MAX</sub>.
161+
162+
>**NOTE**: Only the report generated after the FPGA hardware compile will reflect the performance benefit of using the `fpga_reg` extension. The difference is *not* apparent in the reports generated by `make report` because a design's f<sub>MAX</sub> cannot be predicted. The final achieved f<sub>MAX</sub> can be found in `fpga_reg.prj/reports/report.html` and `fpga_reg_registered.prj/reports/report.html` (after `make fpga` completes).
163+
164+
## Running the Sample
165+
166+
1. Run the sample on the FPGA emulator (the kernel executes on the CPU):
167+
168+
```bash
169+
./fpga_reg.fpga_emu # Linux
170+
```
171+
172+
2. Run the sample on the FPGA device
173+
174+
```bash
175+
./fpga_reg.fpga # Linux
176+
./fpga_reg_registered.fpga # Linux
177+
```
178+
179+
### Example of Output
180+
181+
```txt
182+
Throughput for kernel with input size 1000000 and coefficient array size 64: 2.819272 GFlops
183+
PASSED: Results are correct.
184+
```
185+
186+
### Discussion of Results
187+
188+
You will be able to observe the improvement in the throughput going from Part 1 to Part 2. You will also note that the f<sub>MAX</sub> of Part 2 is significantly larger than of Part 1.
Loading
Loading
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
{
2+
"guid": "D661A5C2-5FE0-40F2-BFE7-70E3BA60F088",
3+
"name": "Explicit Pipeline Register Insertion with fpga_reg",
4+
"categories": ["Toolkit/Intel® oneAPI Base Toolkit/FPGA/Tutorials"],
5+
"description": "FPGA advanced tutorial demonstrating how to apply the DPC++ extension intel::fpga_reg",
6+
"toolchain": ["dpcpp"],
7+
"os": ["linux"],
8+
"targetDevice": ["FPGA"],
9+
"builder": ["cmake"],
10+
"languages": [{"cpp":{}}],
11+
"ciTests": {
12+
"linux": [
13+
{
14+
"id": "fpga_emu",
15+
"steps": [
16+
"mkdir build",
17+
"cd build",
18+
"cmake ..",
19+
"make fpga_emu",
20+
"./fpga_reg.fpga_emu"
21+
]
22+
},
23+
{
24+
"id": "report",
25+
"steps": [
26+
"mkdir build",
27+
"cd build",
28+
"cmake ..",
29+
"make report"
30+
]
31+
}
32+
]
33+
}
34+
}
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
set(SOURCE_FILE fpga_reg.cpp)
2+
set(TARGET_NAME fpga_reg)
3+
set(TARGET_NAME_REG fpga_reg_registered)
4+
set(EMULATOR_TARGET ${TARGET_NAME}.fpga_emu)
5+
set(FPGA_TARGET ${TARGET_NAME}.fpga)
6+
set(FPGA_TARGET_REG ${TARGET_NAME_REG}.fpga)
7+
8+
# Intel supported FPGA Boards and their names
9+
set(A10_PAC_BOARD_NAME "intel_a10gx_pac:pac_a10")
10+
set(S10_PAC_BOARD_NAME "intel_s10sx_pac:pac_s10")
11+
12+
# Assume target is the Intel(R) PAC with Intel Arria(R) 10 GX FPGA
13+
SET(_FPGA_BOARD ${A10_PAC_BOARD_NAME})
14+
15+
# Check if target is the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA
16+
IF (NOT DEFINED FPGA_BOARD)
17+
MESSAGE(STATUS "\tFPGA_BOARD was not specified. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for more information on how to run the design on the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA.")
18+
19+
ELSEIF(FPGA_BOARD STREQUAL ${A10_PAC_BOARD_NAME})
20+
MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA.")
21+
22+
ELSEIF(FPGA_BOARD STREQUAL ${S10_PAC_BOARD_NAME})
23+
MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Stratix(R) 10 SX FPGA.")
24+
SET(_FPGA_BOARD ${S10_PAC_BOARD_NAME})
25+
26+
ELSE()
27+
MESSAGE(STATUS "\tAn invalid board name was passed in using the FPGA_BOARD flag. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for the list of valid board names.")
28+
ENDIF()
29+
30+
set(HARDWARE_COMPILE_FLAGS "-fintelfpga")
31+
32+
# use cmake -D USER_HARDWARE_FLAGS=<flags> to set extra flags for FPGA backend compilation
33+
set(HARDWARE_LINK_FLAGS "-fintelfpga -Xshardware -Xsboard=${_FPGA_BOARD} ${USER_HARDWARE_FLAGS}")
34+
35+
set(EMULATOR_COMPILE_FLAGS "-fintelfpga -DFPGA_EMULATOR")
36+
set(EMULATOR_LINK_FLAGS "-fintelfpga")
37+
38+
# fpga emulator
39+
if(WIN32)
40+
set(WIN_EMULATOR_TARGET ${EMULATOR_TARGET}.exe)
41+
add_custom_target(fpga_emu DEPENDS ${WIN_EMULATOR_TARGET})
42+
separate_arguments(WIN_EMULATOR_COMPILE_FLAGS WINDOWS_COMMAND "${EMULATOR_COMPILE_FLAGS}")
43+
add_custom_command(OUTPUT ${WIN_EMULATOR_TARGET}
44+
COMMAND ${CMAKE_CXX_COMPILER} ${WIN_EMULATOR_COMPILE_FLAGS} /GX ${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${WIN_EMULATOR_TARGET}
45+
DEPENDS ${SOURCE_FILE})
46+
47+
else()
48+
add_executable(${EMULATOR_TARGET} ${SOURCE_FILE})
49+
add_custom_target(fpga_emu DEPENDS ${EMULATOR_TARGET})
50+
set_target_properties(${EMULATOR_TARGET} PROPERTIES COMPILE_FLAGS ${EMULATOR_COMPILE_FLAGS})
51+
set_target_properties(${EMULATOR_TARGET} PROPERTIES LINK_FLAGS ${EMULATOR_LINK_FLAGS})
52+
endif()
53+
54+
# fpga
55+
if(WIN32)
56+
add_custom_target(fpga
57+
COMMAND echo "FPGA hardware flow is not supported in Windows")
58+
else()
59+
add_executable(${FPGA_TARGET} EXCLUDE_FROM_ALL ${SOURCE_FILE})
60+
add_executable(${FPGA_TARGET_REG} EXCLUDE_FROM_ALL ${SOURCE_FILE})
61+
add_custom_target(fpga DEPENDS ${FPGA_TARGET} ${FPGA_TARGET_REG})
62+
63+
set_target_properties(${FPGA_TARGET} PROPERTIES COMPILE_FLAGS ${HARDWARE_COMPILE_FLAGS})
64+
set_target_properties(${FPGA_TARGET} PROPERTIES LINK_FLAGS ${HARDWARE_LINK_FLAGS})
65+
66+
set_target_properties(${FPGA_TARGET_REG} PROPERTIES COMPILE_FLAGS "${HARDWARE_COMPILE_FLAGS} -DUSE_FPGA_REG")
67+
set_target_properties(${FPGA_TARGET_REG} PROPERTIES LINK_FLAGS ${HARDWARE_LINK_FLAGS})
68+
endif()
69+
70+
# report
71+
if(WIN32)
72+
set(REPORT ${TARGET_NAME}_report.a)
73+
set(REPORT_REG ${TARGET_NAME_REG}_report.a)
74+
75+
add_custom_target(report DEPENDS ${REPORT} ${REPORT_REG})
76+
77+
separate_arguments(HARDWARE_LINK_FLAGS_LIST WINDOWS_COMMAND "${HARDWARE_LINK_FLAGS}")
78+
79+
configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${CMAKE_BINARY_DIR}/${TARGET_NAME}/${SOURCE_FILE} COPYONLY)
80+
configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${CMAKE_BINARY_DIR}/${TARGET_NAME_REG}/${SOURCE_FILE} COPYONLY)
81+
82+
add_custom_command(OUTPUT ${REPORT}
83+
COMMAND ${CMAKE_CXX_COMPILER} /EHsc ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${CMAKE_BINARY_DIR}/${TARGET_NAME}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT}
84+
DEPENDS ${SOURCE_FILE})
85+
86+
add_custom_command(OUTPUT ${REPORT_REG}
87+
COMMAND ${CMAKE_CXX_COMPILER} /EHsc ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -DUSE_FPGA_REG -fsycl-link ${CMAKE_BINARY_DIR}/${TARGET_NAME_REG}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT_REG}
88+
DEPENDS ${SOURCE_FILE})
89+
90+
else()
91+
set(REPORT ${TARGET_NAME}_report.a)
92+
set(REPORT_REG ${TARGET_NAME_REG}_report.a)
93+
94+
add_custom_target(report DEPENDS ${REPORT} ${REPORT_REG})
95+
96+
configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${SOURCE_FILE} COPYONLY)
97+
98+
separate_arguments(HARDWARE_LINK_FLAGS_LIST UNIX_COMMAND "${HARDWARE_LINK_FLAGS}")
99+
add_custom_command(OUTPUT ${REPORT}
100+
COMMAND ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT}
101+
DEPENDS ${SOURCE_FILE})
102+
103+
add_custom_command(OUTPUT ${REPORT_REG}
104+
COMMAND ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -DUSE_FPGA_REG -fsycl-link ${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT_REG}
105+
DEPENDS ${SOURCE_FILE})
106+
endif()
107+
108+
# run
109+
add_custom_target(run
110+
COMMAND ../${TARGET_NAME}.fpga_emu
111+
DEPENDS ${TARGET_NAME}.fpga_emu)

0 commit comments

Comments
 (0)