Skip to content

Commit bb0e972

Browse files
authored
Merge pull request #1 from oneapi-src/firstexamples
First examples and README
2 parents 6567713 + 7024faf commit bb0e972

File tree

7 files changed

+373
-0
lines changed

7 files changed

+373
-0
lines changed

CMakeLists.txt

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# SPDX-FileCopyrightText: Intel Corporation
2+
#
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
5+
cmake_minimum_required(VERSION 3.20)
6+
set(CMAKE_CXX_STANDARD 20)
7+
set(CMAKE_CXX_STANDARD_REQUIRED True)
8+
9+
project(
10+
distributed_ranges_tutorial
11+
VERSION 0.1
12+
DESCRIPTION "Distributed ranges tutorial")
13+
14+
15+
16+
find_package(MPI REQUIRED)
17+
18+
add_subdirectory(src)
19+
20+
option(ENABLE_CUDA "Build for cuda" OFF)
21+
# required by distributed-ranges
22+
option(ENABLE_FORMAT "Build with format library" ON)
23+
24+
include(FetchContent)
25+
26+
FetchContent_Declare(
27+
distributed-ranges
28+
GIT_REPOSITORY https://github.com/oneapi-src/distributed-ranges.git
29+
GIT_TAG c618154a1bceda33e5d61e1536179aeaa11b68f4)
30+
FetchContent_MakeAvailable(distributed-ranges)
31+
32+
FetchContent_Declare(
33+
cpp-format
34+
GIT_REPOSITORY https://github.com/fmtlib/fmt.git
35+
GIT_TAG 0b0f7cf)
36+
FetchContent_MakeAvailable(cpp-format)

LICENSE

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) 2023, Mateusz P. Nowak
4+
5+
Redistribution and use in source and binary forms, with or without
6+
modification, are permitted provided that the following conditions are met:
7+
8+
1. Redistributions of source code must retain the above copyright notice, this
9+
list of conditions and the following disclaimer.
10+
11+
2. Redistributions in binary form must reproduce the above copyright notice,
12+
this list of conditions and the following disclaimer in the documentation
13+
and/or other materials provided with the distribution.
14+
15+
3. Neither the name of the copyright holder nor the names of its
16+
contributors may be used to endorse or promote products derived from
17+
this software without specific prior written permission.
18+
19+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
20+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
21+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
22+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
23+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
24+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
25+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
26+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
27+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Distributed-ranges tutorial
2+
3+
## Introduction
4+
5+
The distributed-ranges (dr) library is a C++20 library for multi-CPU and multi-GPU computing environments. It provides algorithms, data structures and views tailored to use in multi-node HPC systems and servers with many CPUs and/or GPUs. It takes advantage of parallel processing and MPI communication in distributed memory model as well as parallel processing in shared memory model with many GPUs.
6+
The library is designed as replacement for chosen data structures, containers, and algorithms of the C++20 Standard Template Library. If you are familiar with the C++ Template Libraries, and in particular std::ranges (C++20) or ranges-v3 (C++11 -- C++17), switching to dr will be straightforward, but this tutorial will help you get started even if you have never used them. However, we assume that you are familiar with C++, at least in the C++11 standard (C++20 is recommended).
7+
8+
## Getting started
9+
10+
### Prerequisites
11+
12+
The distributed-ranges library can be used on any system with a working SYCL or g++ compiler. _Intel's DPC++ is recommended, and it is required by this tutorial_. g++ v. 10, 11 or 12 is also supported, but GPU usage is not possible.
13+
Distributed-ranges depends on MPI and oneDPL libraries. DPC++, oneDPL and oneMPI are part of the [oneAPI](whttps://www.oneapi.io/) - open-standards based industry initiative. OneAPI and the associated [Intel® oneAPI Toolkits and products](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html), help to provide a unified approach to mixed-architecture offload computing. Its approach also ensures interoperability with existing distributed computing standards. It is recommended to install oneAPI components before downloading distributed-ranges.
14+
15+
### First steps
16+
17+
Currently, there are two ways to start work with distributed-ranges.
18+
19+
#### Users
20+
21+
If you want to use dr in your application, and your development environment is connected to the Internet, we encourage you to clone the [distributed-ranges-tutorial repository](https://github.com/intel/distributed-ranges-tutorial) and modify examples provided. The cmake files provided in the skeleton repo will download the dr library as a source code and build the examples, there is no need for separate install.
22+
23+
In Linux system (bash shell) download distributed-ranges-tutorial from GitHub and build with the following commands
24+
25+
```shell
26+
git clone https://github.com/mateuszpn/distributed-ranges-tutorial
27+
cd distributed-ranges-tutorial
28+
CXX=icpx CC=icx cmake -B build
29+
cmake --build build
30+
mpirun -n N ./build/src/example_name
31+
```
32+
33+
If you have a compiler different than DPC++, change CXX and CC values respectively.
34+
Modify the call of mpirun, replacing N with number of mpi processes you want to start, and _example_name_ with an actual example name.
35+
36+
Now you can:
37+
38+
- modify provided examples
39+
- add new source files, modifying src/CMakeList.txt accordingly
40+
- start a new project, using the tutorial as a template
41+
42+
In case your environment is not configured properly or you just prefer a hassle-free code exploration you can use Docker.
43+
44+
```shell
45+
git clone https://github.com/mateuszpn/distributed-ranges-tutorial
46+
cd distributed-ranges-tutorial
47+
docker run -it -v $(pwd):/custom-directory-name -u root docker.io/intel/oneapi:latest /bin/bash
48+
cd custom-directory-name
49+
CXX=icpx CC=icx cmake -B build -DENABLE_SYCL=ON
50+
cmake --build build -j
51+
```
52+
53+
where 'custom-directory-name' stands for the name of a directory containing local repo data on a docker volume
54+
55+
#### Contributors
56+
57+
If you want to contribute to distributed-ranges or go through more advanced examples, please go to original [distributed-ranges GitHub repository](https://github.com/oneapi-src/distributed-ranges/)
58+
59+
```shell
60+
git clone https://github.com/oneapi-src/distributed-ranges
61+
cd distributed-ranges
62+
CXX=icpx CC=icx cmake -B build -DENABLE_SYCL=ON
63+
cmake --build build -j
64+
```
65+
66+
## Distributed-ranges library
67+
68+
The distributed-ranges library provides data-structures, algorithms and views designed to be used in two memory models - distributed memory and shared (common) memory. For distributed memory model, MPI is used as communication library between processes. Both model are able to use SYCL devices (GPUs and multi-core CPUs) for calculations.
69+
70+
Algorithms and data structures are designed to take the user off the need to worry about the technical details of their parallelism. An example would be the definition of a distributed vector in memory of multiple nodes connected using MPI.
71+
72+
```cpp
73+
dr::mhp::distributed_vector<double> dv(N);
74+
```
75+
76+
Such a vector, containing N elements, is automatically distributed among all the nodes involved in the calculation, with individual nodes storing an equal (if possible) amount of data.
77+
Then again, functions such as `for_each()` or `transform()` allow you to perform in parallel operations on each element of a data structure conforming to dr.
78+
79+
In this way, many of the technical details related to the parallel execution of calculations can remain hidden from the user. On the other hand, a programmer aware of the capabilities of the environment in which the application is run has access to the necessary information.
80+
81+
### Namespaces
82+
83+
General namespace used in the library is `dr::`
84+
For program using a single node with shared memory available for multiple CPUs and one or more GPUs, data structures and algoritms from `dr::shp::` namespace are provided.
85+
For distributed memory model, use the `dr::mhp::` namespace.
86+
87+
### Data structures
88+
89+
Content of distributes-ranges' data structures is distributed over available nodes. For example, segments of `dr::mhp::distributed_vector` are located in memory of different nodes (mpi processes). Still, global view of the `distributed_vector` is uniform, with contigous indices.
90+
<!-- TODO: some pictures here -->
91+
92+
#### Halo concept
93+
94+
When implementing an algorithm using a distributed data structure such as `distributed_vector`, its segmented internal structure must be kept in mind. The issue comes up when the algorithm references cells adjacent to the current one, and the local loop reaches the beginning or end of the segment. At this point, the neighboring cells are in the physical memory of another node!
95+
To support this situation, the concept of halo was introduced. A halo is an area into which the contents of the edge elements of a neighboring segment are copied. Also, changes in the halo are copied to cells in the corresponding segment to maintain the consistency of the entire vector.
96+
<!-- TODO: picture here -->
97+
98+
### Algorithms
99+
100+
Follwing algorithms are included in distributed-ranges, both in mhp and shp versions:
101+
102+
```cpp
103+
copy()
104+
exclusive_scan()
105+
fill()
106+
for_each()
107+
inclusive_scan()
108+
iota()
109+
reduce()
110+
sort()
111+
transform()
112+
```
113+
114+
Refer to C++20 documentation for detailed description of how the above functions work.
115+
116+
## Examples
117+
118+
The examples should be compiled with SYCL compiler and run with.
119+
120+
```shell
121+
mpirun -n N ./build/src/example_name
122+
```
123+
124+
where `N` - number of MPI processes. Replace _example_name_ with appropiate name of a file tu run.
125+
126+
### Example 1
127+
128+
[./src/example1.cpp](src/example1.cpp)
129+
130+
The example, performing very simple decoding of encoded string, presents copying data between local and distributed data structures, and a `for_each()` loop performing a lambda on each element of the `distributed_vector<>`. Please note, that the copy operation affects only local vector on the node 0 (the _root_ argument of `copy()` function is 0), and only the node prints the decoded message.
131+
132+
### Example 2
133+
134+
[./src/example2.cpp](src/example2.cpp)
135+
136+
The example shows the distributed nature of dr data structures. The distributed_vector has segments located in each of the nodes performing the example. The nodes introduce themselves at the beginning. You can try different numbers on MPI processes when calling `mpirun`.
137+
`iota()` function is aware what distributed_vector is, and fills the segments accordingly. Then node 0 prints out the general information about the vector, and every node presents size and content of its local part.
138+
139+
### Example 3
140+
141+
[./src/example3.cpp](src/example3.cpp)
142+
143+
The example simulates the elementary 1-d cellular automaton (ECA). Description of what the automaton is and how it works can be found in [wikipedia](https://en.wikipedia.org/wiki/Elementary_cellular_automaton). Visulisation of the automaton work is available in [ASU team webpage](https://elife-asu.github.io/wss-modules/modules/1-1d-cellular-automata).
144+
145+
The ECA calculates the new value of a cell using old value of the cell and old values of the cell's neighbors. Therefore a halo of 1-cell width is used, to get access to neighboring cells' values when the loop eaches end of local segment of a vector.
146+
Additionally, a use of a subrange is presented, and `transform()` function, which puts transformed values of input structure to the output structure, element by element. The transforming function is given as lambda `newvalue`.
147+
_Please note: after each loop the vector content is printed with `fmt::print()`. The formatter function for `distributed_vector` is rather slow, as it gets the vector element by element, both from local node and remote nodes. You can think about customised, more effective way of results presentation._
148+
149+
<!--
150+
Consider adding one more example:
151+
*Simple 2-D operation - Find a pattern in the randomly filled array*
152+
-->

src/CMakeLists.txt

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# SPDX-FileCopyrightText: Intel Corporation
2+
#
3+
# SPDX-License-Identifier: BSD-3-Clause
4+
5+
add_compile_options(-fsycl)
6+
add_link_options(-fsycl)
7+
8+
if(ENABLE_CUDA)
9+
add_compile_options(-fsycl-targets=nvptx64-nvidia-cuda
10+
-Wno-error=unknown-cuda-version)
11+
add_link_options(-fsycl-targets=nvptx64-nvidia-cuda
12+
-Wno-error=unknown-cuda-version)
13+
endif()
14+
15+
add_executable(example1 example1.cpp)
16+
17+
target_compile_definitions(example1 INTERFACE DR_FORMAT)
18+
target_link_libraries(example1 DR::mpi fmt::fmt)
19+
20+
add_executable(example2 example2.cpp)
21+
22+
target_compile_definitions(example2 INTERFACE DR_FORMAT)
23+
target_link_libraries(example2 DR::mpi fmt::fmt)
24+
25+
add_executable(example3 example3.cpp)
26+
27+
target_compile_definitions(example3 INTERFACE DR_FORMAT)
28+
target_link_libraries(example3 DR::mpi fmt::fmt)

src/example1.cpp

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
// SPDX-FileCopyrightText: Intel Corporation
2+
//
3+
// SPDX-License-Identifier: BSD-3-Clause
4+
5+
#include <dr/mhp.hpp>
6+
#include <fmt/core.h>
7+
8+
namespace mhp = dr::mhp;
9+
10+
int main(int argc, char **argv) {
11+
12+
mhp::init(sycl::default_selector_v);
13+
14+
mhp::distributed_vector<char> dv(81);
15+
std::string decoded_string(80, 0);
16+
17+
mhp::copy(
18+
0,
19+
std::string("Mjqqt%|twqi&%Ymnx%nx%ywfsxrnxnts%kwtr%ymj%tsj%fsi%tsq~%"
20+
"Inxywngzyji%Wfsljx%wjfqr&"),
21+
dv.begin());
22+
23+
mhp::for_each(dv, [](char &val) { val -= 5; });
24+
mhp::copy(0, dv, decoded_string.begin());
25+
26+
if (mhp::rank() == 0)
27+
fmt::print("{}\n", decoded_string);
28+
29+
mhp::finalize();
30+
31+
return 0;
32+
}

src/example2.cpp

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
// SPDX-FileCopyrightText: Intel Corporation
2+
//
3+
// SPDX-License-Identifier: BSD-3-Clause
4+
5+
#include <dr/mhp.hpp>
6+
#include <fmt/core.h>
7+
8+
namespace mhp = dr::mhp;
9+
10+
int main(int argc, char **argv) {
11+
12+
mhp::init(sycl::default_selector_v);
13+
14+
fmt::print(
15+
"Hello, World! Distributed ranges proces is running on rank {} / {} on "
16+
"host {}\n",
17+
mhp::rank(), mhp::nprocs(), mhp::hostname());
18+
19+
std::size_t n = 100;
20+
21+
mhp::distributed_vector<int> v(n);
22+
mhp::iota(v, 1);
23+
24+
if (mhp::rank() == 0) {
25+
auto &&segments = v.segments();
26+
fmt::print("Created distributed vector of size {} with {} segments.\n",
27+
v.size(), segments.size());
28+
}
29+
30+
fmt::print("Rank {} owns segment of size {} and content {}\n", mhp::rank(),
31+
mhp::local_segment(v).size(), mhp::local_segment(v));
32+
33+
mhp::finalize();
34+
35+
return 0;
36+
}

src/example3.cpp

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
// SPDX-FileCopyrightText: Intel Corporation
2+
//
3+
// SPDX-License-Identifier: BSD-3-Clause
4+
5+
#include <dr/mhp.hpp>
6+
#include <fmt/core.h>
7+
8+
namespace mhp = dr::mhp;
9+
10+
/* The example simulates the elementary 1-d cellular automaton. Description of
11+
* what the automaton is and how it works can be found at
12+
* https://en.wikipedia.org/wiki/Elementary_cellular_automaton
13+
* Visulisation of the automaton work is available
14+
* https://elife-asu.github.io/wss-modules/modules/1-1d-cellular-automata
15+
* (credit: Emergence team @ Arizona State University)*/
16+
17+
constexpr std::size_t asize = 60;
18+
constexpr std::size_t steps = 60;
19+
20+
constexpr uint8_t ca_rule = 28;
21+
22+
auto newvalue = [](auto &&p) {
23+
auto v = &p;
24+
uint8_t pattern = 4 * v[-1] + 2 * v[0] + v[1];
25+
return (ca_rule >> pattern) % 2;
26+
};
27+
28+
int main(int argc, char **argv) {
29+
30+
mhp::init(sycl::default_selector_v);
31+
32+
auto dist = dr::mhp::distribution().halo(1);
33+
mhp::distributed_vector<uint8_t> a1(asize + 2, 0, dist),
34+
a2(asize + 2, 0, dist);
35+
36+
auto in = rng::subrange(a1.begin() + 1, a1.end() - 1);
37+
auto out = rng::subrange(a2.begin() + 1, a2.end() - 1);
38+
39+
/* initial value of the automaton - customize it if you want to */
40+
in[0] = 1;
41+
42+
if (mhp::rank() == 0)
43+
fmt::print("{}\n", in);
44+
45+
for (std::size_t s = 0; s < steps; s++) {
46+
dr::mhp::halo(in).exchange();
47+
48+
mhp::transform(in, out.begin(), newvalue);
49+
50+
std::swap(in, out);
51+
52+
/* fmt::print() is rather slow here, as it gets element by element from
53+
* remote nodes. Use with care. */
54+
if (mhp::rank() == 0)
55+
fmt::print("{}\n", in);
56+
}
57+
58+
mhp::finalize();
59+
60+
return 0;
61+
}

0 commit comments

Comments
 (0)