|
| 1 | +# Merge Sort |
| 2 | +This DPC++ reference design demonstrates a highly paramaterizable merge sort algorithm on an FPGA. |
| 3 | + |
| 4 | +***Documentation***: |
| 5 | +* [DPC++ FPGA Code Samples Guide](https://software.intel.com/content/www/us/en/develop/articles/explore-dpcpp-through-intel-fpga-code-samples.html) helps you to navigate the samples and build your knowledge of DPC++ for FPGA. <br> |
| 6 | +* [oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide) is the reference manual for targeting FPGAs through DPC++. <br> |
| 7 | +* [oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programming-guide) is a general resource for target-independent DPC++ programming. |
| 8 | + |
| 9 | +| Optimized for | Description |
| 10 | +--- |--- |
| 11 | +| OS | Linux* Ubuntu* 18.04/20.04, RHEL*/CentOS* 8, SUSE* 15; Windows* 10 |
| 12 | +| Hardware | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA <br> Intel® FPGA Programmable Acceleration Card (PAC) D5005 (with Intel Stratix® 10 SX) <br> Intel Xeon® CPU E5-1650 v2 @ 3.50GHz (host machine) |
| 13 | +| Software | Intel® oneAPI DPC++ Compiler <br> Intel® FPGA Add-On for oneAPI Base Toolkit |
| 14 | +| What you will learn | How to use the spatial compute of the FPGA to create a merge sort design that takes advantage of thread- and SIMD-level parallelism. |
| 15 | +| Time to complete | 1 hour |
| 16 | + |
| 17 | +<br> |
| 18 | + |
| 19 | +**Performance** |
| 20 | +The performance data below was gathered using the Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) sorting `2^24=16777216` elements using 1-16 merge units and the best throughput across 5 seeds. |
| 21 | + |
| 22 | +TODO: Update this. |
| 23 | + |
| 24 | +| Merge Units | Execution time (ms) | Throughput (Melements/s) | |
| 25 | +| :---------- | :-----------------: | :----------------------: | |
| 26 | +| 1 | 1476 | 11 | |
| 27 | +| 2 | 569.8 | 28 | |
| 28 | +| 4 | 195.2 | 82 | |
| 29 | +| 8 | 99.9 | 160 | |
| 30 | +| 16 | 69.9 | 228 | |
| 31 | + |
| 32 | +## Purpose |
| 33 | +This FPGA reference design demonstrates a highly paramaterizable merge sort design that utilizes the spatial computing of the FPGA. The basic merge sort algorithm is described [here](https://en.wikipedia.org/wiki/Merge_sort). See the [Additional Design Information Section](#additional-design-information) for more information on how the merge sort algorithm was implemented on the FPGA. |
| 34 | + |
| 35 | +## License |
| 36 | +Code samples are licensed under the MIT license. See |
| 37 | +[License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. |
| 38 | + |
| 39 | +## Building the Reference Design |
| 40 | + |
| 41 | +### Include Files |
| 42 | +The include folder is located at `%ONEAPI_ROOT%\dev-utilities\latest\include` on your development system. |
| 43 | + |
| 44 | +### Running Code Samples in DevCloud |
| 45 | +If running a sample in the Intel DevCloud, remember that you must specify the type of compute node and whether to run in batch or interactive mode. Compiles to FPGA are only supported on fpga_compile nodes. Executing programs on FPGA hardware is only supported on fpga_runtime nodes of the appropriate type, such as fpga_runtime:arria10 or fpga_runtime:stratix10. Neither compiling nor executing programs on FPGA hardware are supported on the login nodes. For more information, see the Intel® oneAPI Base Toolkit Get Started Guide ([https://devcloud.intel.com/oneapi/documentation/base-toolkit/](https://devcloud.intel.com/oneapi/documentation/base-toolkit/)). |
| 46 | + |
| 47 | +When compiling for FPGA hardware, it is recommended to increase the job timeout to 24h. |
| 48 | + |
| 49 | +### On a Linux* System |
| 50 | +1. Install the design into a directory `build` from the design directory by running `cmake`: |
| 51 | + |
| 52 | + ``` |
| 53 | + mkdir build |
| 54 | + cd build |
| 55 | + ``` |
| 56 | + |
| 57 | + If you are compiling for the Intel® PAC with Intel Arria® 10 GX FPGA, run `cmake` using the command: |
| 58 | + |
| 59 | + ``` |
| 60 | + cmake .. |
| 61 | + ``` |
| 62 | + |
| 63 | + If instead you are compiling for the Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX), run `cmake` using the command: |
| 64 | + |
| 65 | + ``` |
| 66 | + cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10 |
| 67 | + ``` |
| 68 | + |
| 69 | +2. Compile the design through the generated `Makefile`. The following targets are provided, and they match the recommended development flow: |
| 70 | + |
| 71 | + * Compile for emulation (fast compile time, targets emulated FPGA device). |
| 72 | + |
| 73 | + ``` |
| 74 | + make fpga_emu |
| 75 | + ``` |
| 76 | +
|
| 77 | + * Generate HTML performance report. Find the report in `merge_sort_report.prj/reports/report.html`directory. |
| 78 | +
|
| 79 | + ``` |
| 80 | + make report |
| 81 | + ``` |
| 82 | +
|
| 83 | + * Compile for FPGA hardware (longer compile time, targets FPGA device). |
| 84 | +
|
| 85 | + ``` |
| 86 | + make fpga |
| 87 | + ``` |
| 88 | +
|
| 89 | +3. (Optional) As the above hardware compile may take several hours to complete, FPGA precompiled binaries (compatible with Linux* Ubuntu* 18.04) can be downloaded <a href="https://iotdk.intel.com/fpga-precompiled-binaries/latest/merge_sort.fpga.tar.gz" download>here</a>. |
| 90 | +
|
| 91 | +### On a Windows* System |
| 92 | +1. Generate the `Makefile` by running `cmake`. |
| 93 | + ``` |
| 94 | + mkdir build |
| 95 | + cd build |
| 96 | + ``` |
| 97 | + To compile for the Intel® PAC with Intel Arria® 10 GX FPGA, run `cmake` using the command: |
| 98 | + ``` |
| 99 | + cmake -G "NMake Makefiles" .. |
| 100 | + ``` |
| 101 | + Alternatively, to compile for the Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX), run `cmake` using the command: |
| 102 | + ``` |
| 103 | + cmake -G "NMake Makefiles" .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10 |
| 104 | + ``` |
| 105 | + |
| 106 | +2. Compile the design through the generated `Makefile`. The following build targets are provided, matching the recommended development flow: |
| 107 | + * Compile for emulation (fast compile time, targets emulated FPGA device): |
| 108 | + ``` |
| 109 | + nmake fpga_emu |
| 110 | + ``` |
| 111 | + * Generate the optimization report: |
| 112 | + ``` |
| 113 | + nmake report |
| 114 | + ``` |
| 115 | + * An FPGA hardware target is not provided on Windows*. |
| 116 | +
|
| 117 | +*Note:* The Intel® PAC with Intel Arria® 10 GX FPGA and Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) do not yet support Windows*. Compiling to FPGA hardware on Windows* requires a third-party or custom Board Support Package (BSP) with Windows* support. |
| 118 | +
|
| 119 | +### In Third-Party Integrated Development Environments (IDEs) |
| 120 | +
|
| 121 | +You can compile and run this Reference Design in the Eclipse* IDE (in Linux*) and the Visual Studio* IDE (in Windows*). For instructions, refer to the following link: [Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs](https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-fpga-workflow-on-ide) |
| 122 | +
|
| 123 | +## Running the Reference Design |
| 124 | +
|
| 125 | + 1. Run the sample on the FPGA emulator (the kernel executes on the CPU). |
| 126 | + ``` |
| 127 | + ./merge_sort.fpga_emu (Linux) |
| 128 | + merge_sort.fpga_emu.exe (Windows) |
| 129 | + ``` |
| 130 | +
|
| 131 | +2. Run the sample on the FPGA device. |
| 132 | + ``` |
| 133 | + ./merge_sort.fpga (Linux) |
| 134 | + ``` |
| 135 | +
|
| 136 | +### Example of Output |
| 137 | +You should see output similar to the following in the console: |
| 138 | +``` |
| 139 | +Running sort 17 times for an input size of 16777216 using 8 4-way merge units |
| 140 | +Streaming data from device memory |
| 141 | +Execution time: 69.9848 ms |
| 142 | +Throughput: 228.621 Melements/s |
| 143 | +PASSED |
| 144 | +``` |
| 145 | +NOTE: When running on the FPGA emulator, the *Execution time* and *Throughput* do not reflect the design's actual hardware performance. |
| 146 | + |
| 147 | +
|
| 148 | +## Additional Design Information |
| 149 | +### Source Code Breakdown |
| 150 | +The following source files can be found in the `src/` sub-directory. |
| 151 | +
|
| 152 | +| File | Description |
| 153 | +|:--- |:--- |
| 154 | +|`main.cpp` | Contains the `main()` function and the top-level interfaces. |
| 155 | +|`merge_sort.hpp` | The function to submit all of the merge sort kernels (`SortingNetwork`, `Produce`, `Merge`, and `Consume`). |
| 156 | +|`consume.hpp` | The `Consume` kernel for the merge unit. This kernel reads from an input pipe and writes out to either a different output pipe, or to device memory. |
| 157 | +|`impu_math.hpp` | Metaprogramming math helper functions (*impu* = Intel Metaprogramming Utilities) |
| 158 | +|`merge.hpp` | The `Merge` kernel for the merge unit and the merge tree. This kernel streams in two sorted lists, merges them into a single sorted list of double the size, and streams the data out a pipe. |
| 159 | +|`pipe_array.hpp` | Header file containing the definition of an array of pipes. |
| 160 | +|`pipe_array_internal.hpp` | Helper for pipe_array.hpp. |
| 161 | +|`produce.hpp` | The `Produce` kernel for the merge unit. This kernel reads from input pipes or performs strided reads from device memory and writes the data to an output pipe. |
| 162 | +|`sorting_networks.hpp` | Contains all of the code relevant to sorting networks, including the `SortingNetwork` kernel, as well as the `BitonicSortingNetwork` and `MergeSortNetwork` helper functions. |
| 163 | +|`unrolled_loop.hpp` | A templated-based loop unroller that unrolls loops in the compiler front end. |
| 164 | +
|
| 165 | +### Merge Sort Details |
| 166 | +This section will describe how the merge sort design is structured and how it takes advantage of the spatial compute of the FPGA. <br/> |
| 167 | +
|
| 168 | +The figure below shows the conceptual view of the merge sort design to the user. The user streams data into a SYCL pipe (`InPipe`) and, after some delay, the elements are streamed out of a SYCL pipe (`OutPipe`), in sorted order. The number of elements that the merge sort design is capable of sorting is a runtime parameter, but it must be a power of 2. However, this restriction can be worked around by padding the input stream with min/max elements, depending on the direction of the sort (smallest-to-largest vs largest-to-smallest). This technique is demonstrated in this design (see the `fpga_sort` function in *main.cpp*). |
| 169 | +
|
| 170 | +<img src="sort_api.png" alt="sort_api" width="500"/> |
| 171 | +
|
| 172 | +The basis of the merge sort design is what we call a *merge unit*, which is shown in the figure below. A single merge unit streams in two sorted lists of size `count` in parallel and merges them into a single sorted list of size `2*count`. The lists are streamed in from device memory (e.g., DDR or HBM) by two `Produce` kernels. The `Consume` kernel can stream data out to either a SYCL pipe or to device memory. |
| 173 | +
|
| 174 | +<img src="merge_unit.png" alt="merge_unit" width="600"/> |
| 175 | +
|
| 176 | +A single merge unit requires `lg(N)` iterations to sort `N` elements. This requires the host to enqueue `lg(N)` iterations of the merge unit kernels that merge sublists of size {`1`, `2`, `4`, ...} into larger lists of size {`2`, `4`, `8`, ...}, respectively. This results in a timeline that looks like the figure below. |
| 177 | +
|
| 178 | +<img src="basic_runtime_graph.png" alt="basic_runtime_graph" width="800"/> |
| 179 | +
|
| 180 | +To achieve SIMD-level (**S**ingle **I**nstruction **M**ultiple **D**ata) parallelism, we enhance the merge unit to merge `k` elements per cycle. The figure below illustrates how this is done. In the following discussion, we will assume that we are sorting from smallest-to-largest, but the logic is very similar for sorting largest-to-smallest and is easily configurable at compile time in this design. <br/> |
| 181 | +
|
| 182 | +The merge unit looks at the two inputs of size `k` coming from the `ProduceA` and `ProduceB` kernels (in the figure below, `k=4`) and compares the first elements of each set; remember, these set of `k` elements are already sorted, so we are comparing the smallest elements of the set. Whichever set of elements has the *smaller of the smallest elements* is chosen and combined with `k` other elements from the `feedback` path. These `2*k` elements go through a merge sort network that sorts them in a single cycle. After the `2*k` elements are sorted, the smallest `k` elements are sent to the output (to the `Consume` kernel) and the largest `k` elements are fed back into the sorting network (the `feedback` path in the figure below), and the process repeats. This allows the merge unit to process `k` elements per cycle in the steady state. Note that `k` must be a power of 2. <br/> |
| 183 | +
|
| 184 | +More information on this design can be found in this paper by [R. Kobayashi and K. Kise](https://www.researchgate.net/publication/316604001_A_High_Performance_FPGA-Based_Sorting_Accelerator_with_a_Data_Compression_Mechanism). |
| 185 | +
|
| 186 | +<img src="k-way_merge_unit.png" alt="way_merge_unit" width="900"/> |
| 187 | +
|
| 188 | +To achieve thread-level parallelism, the merge sort design accepts a template parameter, `units`, which allows one to instantiate multiple instances of the merge unit, as shown in the figure below. Before the merge units start processing data, the incoming data coming from the input pipe is sent through a bitonic sorting network and written to the temporary buffer partitions in device memory. This sorting network sorts `k` elements per cycle in the steady state. Choosing the number of merge units is an area-performance tradeoff (note: the number of instantiated merge units must be a power of 2). Each merge unit sorts an `N/units`-sized partition of the input data in parallel. |
| 189 | +
|
| 190 | +<img src="parallel_tree_bitonic_k-way.png" alt="parallel_tree_bitonic_k-way" width="800"/> |
| 191 | +
|
| 192 | +After the merge units sort their `N/units`-sized partition, the partitions of each unit must be reduced into a single sorted list. There are two options to do this: (1) reuse the merge units to perform `lg(units)` more iterations to sort the partitions, or (2) create a merge tree to reduce the partitions into a single sorted list. Option (1) saves area at the expense of performance, since it has to perform additional sorting iterations. Option (2), which we choose for this design, improves performance by creating a merge tree to reduce the final partitions into a single sorted list. The `Merge` kernels in the merge tree (shown in the figure above) use the same kernel code that is used in the `Merge` kernel of the merge unit, which means they too can merge `k` elements per cycle. Once the merge units perform their last iteration, they output to a pipe (instead of writing to device memory) that feeds the merge tree. |
| 193 | +
|
| 194 | +### Performance disclaimers |
| 195 | +Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit [www.intel.com/benchmarks](www.intel.com/benchmarks). |
| 196 | +
|
| 197 | +Performance results are based on testing as of May 2021 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely secure. |
| 198 | +
|
| 199 | +Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at [intel.com](www.intel.com). |
0 commit comments