The whole GPU MPI approach starts with the following idea: What if we can take an existing MPI program and run it on GPU? We would be able to have thousands of MPI processes if we map each GPU thread to an MPI process. This would allow us to harness the compute power of GPUs with small or no code changes, since virtually all HPC codes already use MPI parallelization.
At the first glance, it should not be difficult to provide a GPU-specific MPI implementation. For example, communication between GPU threads could be done by reading and writing into GPU global memory.
Many problems should be solved in the process of GPU MPI implementation. The first of them is the programming language. As a starting point, we limited ourselves to the C/C++ source programs and CUDA target programs. The reason for such a choice is that most of the existing MPI programs are written in C/C++ or Fortran. CUDA syntax is almost compatible with C++ and is the native language for GPU programming.
The next question is how the build toolchain should look like. The most apparent solution is providing our own mpicc compiler wrapper that hides the details of the GPU-specific implementation. Unfortunately, this will not work when mpicc is used only for generating object files that are linked with the default linker (for example, ld) later. The problem is that CUDA may require to invoke its own linker (nvlink) when multiple object files contain GPU code as part of the separate compilation of relocatable device code in CUDA. This implies that any possible solution will be platform-dependent (we would need to replace ld calls with something provided by us), so support for any possible source project build toolchain is problematic.
For now, we decided to stick to the design principle of supporting 90% of use cases without requiring the user to change its build system at all. For this, we rely on build dependencies that we can extract from the compiler_commands.json file. It can be properly generated for most Makefile- or CMake-based projects. The high-level overview of the toolchain is provided in Fig. 1.
Since we have no link-time dependencies available, we just link all object files without the main function in a large object file that is linked to each translation unit with the main function present. While such an approach may fail due to duplicate symbols, we assume it rarely occurs in practice.
After gathering information about the compilation dependencies, we still can’t run the CUDA nvcc compiler as is on the source files. If extracted dependencies contain compiler-specific flags (e.g. gcc-specific flags) that are not understood by nvcc, the build will just fail. We decided to drop all flags with the exception of include dirs (-I) and global definitions (-D).
If we want to map MPI processes to GPU threads, then we need to do it as early as possible, before the entry point in the main function. This implies that all the functions in the source program should be annotated with __device__ to force nvcc to generate assembly for GPU. For this, we picked clang LibTooling API, which allows implementing source-to-source converters based on accurate program AST.
In addition to these annotations, we also add explicit casts instead of implicit ones, because after the conversion, if C code is converted to CUDA (which is almost C++) then the code will be compiled with only a subset of allowed implicit casts.
The next semantical difference between the C++ MPI process and CUDA thread is the behavior of global variables and static variables in the function scope. They are annotated with __device__ by source converter as well. However, they are unique per the MPI process, so the user can write the code with the assumption that they do not interfere. Unfortunately, global variables are shared among CUDA threads, so there should be some mechanism to replace accesses to them with something private to the thread.
The straightforward solution is to replace the definition of the global variable (int x;) with the definition of an array of such variables (int x[nthreads];), with the size of the array equal to the number of threads. Then each access (x = 10;) is replaced by access to a thread-local copy of the variable (x[thread_rank] = 10;). Such a solution will not work as-is, because the number of threads is not fixed at compile-time. These global arrays should be recorded and allocated at the program start. The implementation became even more difficult with the presence of default initializers and the fact that the converter operates on the level of the translation unit, but multiple extern symbols referring to the same variable became known only at the linking stage.
This is why we picked a less performant solution: we establish a mapping from global variable pointers (&x) to the per-thread allocated variables and on each access, we make a lookup by enclosing global variables into function call (__gpu_global(x) = 10;). This solution can drastically decrease performance, but we assume that global variables are not used often in existing programs.
The next issue is that host library functions can’t be called from GPU, so any external library dependencies should undergo source conversion as well. Unfortunately, this means we need to convert entire C and C++ standard libraries and replace calls to existing host implementations in the source program by ours. There already exists a partial standard library implementation from NVIDIA called libcudacxx. Unfortunately, it is still in the early stages of development, so most of the API is not supported. More than that, there are principal limitations of some API: file I/O and POSIX-specific OS functions can’t be implemented on GPU, they should be redirected to the host CPU.
To avoid detection of standard library function names in source converter, we heavily utilize C++ preprocessor. First, the converter detects all standard includes enclosed into < and > and adds the suffix “.cuh” to them. For example, <string.h> becomes <string.h.cuh>. File <string.h.cuh> is provided by GPU MPI and contains name substitutions “#define strlen __gpu_strlen” that redirect all uses of functions in user code by our implementations. This file with macro substitutions first includes standard library headers to avoid default headers break due to our replacements. It also includes our implementation header (e.g. string.cuh) that has a corresponding .cu file (string.cu) with implementation of functions with “__gpu_” prefix.
Functions that require host CPU intervention (for example writing to file) post messages in the pinned host memory, that are extracted by the host and processed with the host-side function call. After that, the host posts the completion message into pinned memory, which is checked by the GPU thread to obtain the function return value. While such implementation is expected to be inefficient, we assume that existing programs don’t use them in computationally-heavy parts of code.
We decided to not make any mpirun-like launcher because it is not required for running the converted program in a single-host case. We provide all launch-specific parameters from the command line right after application-specific function args. To distinguish between parameters, GPU MPI parameters specified after ---gpumpi flag. For example, if the initial program was executed as “mpirun -np 4 ./a.out param1 param2”, then to specify the same launch parameters, a converted version of this program with GPU MPI will be launched as “./a.out param1 param2 ---gpumpi -g 2 -b 2”, where g and b parameters specify CUDA grid size and block size. In this example, 4 threads are launched in total, meaning that there will be 4 MPI processes.
The CUDA kernel that represents the application's main function can’t be called on its own. We provide host-side code that actually runs this kernel. In addition, this launcher code initializes the state that is used by MPI implementation on GPU.
The MPI state stores information required to implement MPI API. There are two kinds of state: the shared state which is accessible by all threads and the private state, which is unique per thread. For example, point-to-point operations utilize message queues that exist inside the shared state. All threads can post and get messages from there. In contrast, a private state contains information about pending operations that are needed to be accessible only by a particular thread that owns the copy of the private state.
There are two types of MPI implementations: synchronous and asynchronous. We decided to make our implementation synchronous due to the fact that any threads that only do MPI management are a potential loss of performance. First, because in the presence of warps the entire warp will be blocked by a single thread. Second, because the GPU scheduler can allocate time for processing management threads even if there is no useful work for them to do.
In the initial version, we structure our MPI implementation including collectives around point-to-point API. Such an approach is good for fast prototyping for its simplicity, but sometimes it is a lost optimization opportunity as some collectives (for example MPI_Reduce) can be implemented more efficiently on GPU. For example, CUDA library CUB uses warp-level intrinsics and block-level data exchange through shared memory that can make some collective operations almost as fast as GPU memory bandwidth.
Message matching and progress
Our MPI implementation tries not to block if it is not asked explicitly by the user, Pending sends and receives stored in “local” memory. “Local” memory actually is an array in GPU global memory which is accessed by the index of current process rank. Pending sends and receives contain the list of unfinished operations for which an attempt to make a progress is made each time a call to MPI_Test or another function of MPI API is done.
The memory layout of GPU MPI implementation data structures is shown in Fig. 2. The matching of sends and receives is done through the use of unexpected and receive queues following the approach of existing MPI implementations. To allocate temporary buffers for point to point data exchange “Pool of memory fragments'' is used. Threads participating in the data exchange also need to agree on the address of the temporary buffer in use. This is done through “Fragment pointers'' lists.
To measure the possible performance of inter-thread data transfers in CUDA we implemented ping-pong benchmark results of which are shown on Fig. 3. The numbers are gathered on NVIDIA V100 GPU. It is clearly visible that even the largest bandwidth (~37 MB/s) is far from the theoretical peak (900 GB/s) for this GPU. The reason for such poor performance is that only a pair of threads are used to initiate data transfer. It shows that MPI_Sendrecv-like functions are expected to be so slow in GPU MPI implementation.
In contrast, peak measured bandwidth of GPU memory copy in which all GPU threads participate in a coalesced way, achieves 380 GB/s with a theoretical peak of 450 GB/s since it involves both read and write in global memory for each byte of data. This leaves hope that collective MPI operations can be quite fast if MPI_COMM_WORLD is used.
To check that the issue is not in the small amount of send/recv operations that fail to saturate memory bandwidth, we made a cumulative ping-pong benchmark that involves all CUDA threads, but each pair of threads does send/recv unrelated to other threads. The peak performance that it shows is 6 GB/s which corresponds to the peak bandwidth from Fig. 3 multiplied by the number of available thread pairs.
The first memory issue is related to its amount. Users of existing MPI implementations can rely on the fact that each MPI process has access to a large amount of cluster node main memory, that is shared with a small amount of other processes on the same node or available to the process entirely. In contrast, the proposed GPU MPI design assumes that all MPI processes reside on a single GPU that has very limited amount of memory, usually even smaller than single node main memory. This immediately restricts the amount of existing MPI applications that will work with GPU MPI as is.
The second memory issue is related to the performance of data transfers. For example, it is not possible to trigger full bandwidth memory transfers just from a single thread, which can be important for point-to-point operations. An illustration of that is cudaMemcpy implementation for copies between device global memory locations. It launches CUDA kernel internally just to iterate over the memory region and access it in a coalesced way from all the threads. A single threaded loop can’t achieve the same performance even if other threads in the warp are not active at the moment.
When threads of the same CUDA warp (a group of 32 threads) are not executing the same instruction, GPU can’t run them in parallel. It causes a performance penalty if MPI application or GPU MPI implementation contains a lot of branch instructions that diverge these threads. The assumption is made that existing MPI programs contain small amounts of thread divergence due to extensive use of collective operations on communicators almost the same as MPI_COMM_WORLD.
It is possible to check how viable this assumption is by running the GPU MPI program that already underwent conversion with CUDA nvprof/nsight profilers and collecting numbers of useful instructions executed and all instructions that include no-ops due to waiting for other threads in the a to finish their diverged branch.
The GPU MPI idea can be broadened to the extent of multiple GPUs on the same node and even to the multi-node multi-GPU clusters. The former can be done through CUDA cooperative multi device kernel launch feature that is available in the latest CUDA runtimes and GPUs. The latter can be implemented through the cooperation with existing CPU MPI implementations. Launch command for the example in the beginning could look like “mpirun -np 200 ./a.out param1 param2 ---gpumpi -g 2 -b 2 -d 10”, that will use 200 nodes with 1 CPU MPI process per node with 10 gpus per node and and 4 = 2*2 GPU threads per GPU. From the user perspective, all these processes will look like uniform MPI processes in default communicator that spans across all nodes and GPUs. This idea raises new questions to solve. The most important is how to adapt a message matching interface that takes multiple levels of data transfers. Even seemingly efficient transfers between threads of the same warp can become much slower in the presence of MPI wildcards because message matching should notify all threads that such a message exists. Efficient GPU MPI implementation will have to take this into account and probably incorporate heuristics that will amortize performance penalties in such cases.