Evaluating GPU Programming Models for the LUMI Supercomputer
Posted by Julia Werner •
LUMI: "A European powerful green supercomputer to solve current and future scientific challenges"
June 14, 2022 - by CSCS
LUMI is the first pre-exascale supercomputer of the EuroHPC Joint Undertaking and is Europe’s most powerful supercomputer. LUMI stands for "Large Unified Modern Infrastructure" and will help scientists to solve global challenges and to promote a green transformation. It is run by a consortium of 10 countries with long traditions and knowledge of scientific computing.
The Swiss Confederation with the Swiss National Supercomputing Centre (CSCS) is a member of the LUMI consortium. On Monday 13 June 2022, CSCS was delighted to celebrate the inauguration of LUMI with its partners in Kajaani.
“For CSCS and our users, LUMI is a valuable addition to our Alps research infrastructure,” says CSCS-Director Thomas Schulthess. “Powerful systems like LUMI are needed by researchers to find answers to the challenges humanity is facing today. Finland, which produces cheap and CO2-free electricity, is an optimal location for installing such a large leadership-class system.”
CSCS users already have the opportunity to test LUMI and use it for their simulations. “LUMI is available to the user community through various European and CSCS calls for proposals. Being part of this consortium is of special importance for our country. It represents the channel that allows Switzerland to be part of the European landscape and to have a voice in the future of the European computing and data infrastructure,” says Maria Grazia Giuffreda, Associate Director at CSCS and responsible of the User Program. “Now it is up to our scientists to take full advantage of this modern and powerful infrastructure.”
Official press release: LUMI, Europe's most powerful supercomputer, is solving global challenges and promoting a green transformation >
Picture above: LUMI supercomputer (Image: Pekka Agarth)
AMD's next-gen CPU and GPUs power LUMI supercomputer in 2021
AMD's next-gen Zen 3-based EPYC and Radeon Instinct GPUs will power the new LUMI supercomputer in Kajaani, Finland in 2021.
Hewlett Packard Enterprise (HPE) have just unveiled its next-gen LUMI supercomputer, which is powered by AMD's next-gen Zen 3-based EPYC processors and Radeon Instinct GPUs.
The new LUMI supercomputer will find its new home in Kajaani, Finland in 2021 -- and will be using the HPE Cray EX architecture to spin up 550 Petaflops of peak horsepower. The new LUMI supercomputer will be a part of EuroHPC's GPU-accelerated supercomputing platform powered by next-gen AMD CPUs and GPUs.
Forrest Norrod, senior vice president and general manager, data center and embedded systems group, AMD explains:
"AMD is proud to join with HPE to power the upcoming LUMI supercomputer to advance scientific research in artificial intelligence, weather forecasting, pharmaceutical discovery, and more. Our next-generation AMD EPYC CPUs and AMD Instinct GPUs, coupled with HPE's unique supercomputing technologies, are fueling new capabilities in high performance computing, and we are excited to strengthen the European research community through our support".
Evaluating GPU Programming Models for the LUMI Supercomputer
In this section, we present a few programming models that we plan to use on LUMI, and later we describe in which situation to use them.
3.1 HIP
The Radeon Open Compute (ROCm) platform [12, 13] includes programming models to develop codes for AMD GPUs. Among those is the Heterogeneous-compute Interface for Portability (HIP) [14]. HIP is a C++ API and kernel language to create portable applications for the AMD ROCm platform as well as NVIDIA GPUs using the same source code. It is open source, it provides an API to port your code, and the syntax is very similar to CUDA. It supports a large subset of the CUDA runtime functionality and has almost no negative performance impact over coding directly in CUDA. HIP includes features such as C++11 lambdas, namespaces, classes, templates, etc. The HIPify [15] tools convert CUDA code to HIP. Of course, tuning will be required for each specific GPU.
Table 1 exemplifies some similarities between CUDA and HIP. For most cases, replacing the “cuda” in the function name with “hip” as well as for the arguments is enough to translate the API. However, not all the CUDA API is supported in HIP yet. Executing a GPU kernel is similar as you can see in the corresponding table but there is also a HIP API called hipLaunchKernelGGL.
Table 1. Convert CUDA code to HIP Full size table
With the HIP translation tool, a common approach is to semi-automatically port a CUDA code to HIP. For example, to convert a CUDA file called the command hipify-perl –inplace performs the translation of CUDA API calls to HIP API calls (a backup of the original code is kept as There is also hipconvertinplace-perl.sh to translate all source files of a directory as well as a version of the HIPify tool that is based on the clang compiler. For more details about porting codes there are a few sources such as [16, 17].
For CUDA Fortran codes, it is required to do the following steps (further details are available at [17]):
Port CUDA Fortran code to HIP kernels in C++. The hipfort API helps to invoke the HIP API from Fortran.
Wrap the kernel launch in function with C calling convention.
Call the launch function from Fortran through the Fortran 2003 C bindings.
3.2 The OpenMP Application Programming Interface
The OpenMP API supports offloading computation to accelerator devices since version 4.0 and has since then refined and extended the features continuously [18]. The OpenMP API supports a variety of target directives that control the transfer of data (if needed), transfer of control flow, as well as parallelism on the target device. OpenMP also offers low-level API interfaces for memory allocation and data transfers similar to the interfaces of the CUDA and HIP programming models.
This is a very basic example of an OpenMP offload region, running code on a GPU:
In the example above, the target construct transfers the control flow from the host device to the default target device (the host thread will await completion of the offload region). The map clauses are used to specify the data that is needed for execution as well as the direction of the data flow. If the host and accelerator have distinct memories, the OpenMP implementation will perform an actual transfer. If host and device have a shared memory (emulation), the map clauses do not issue an actual data transfer.
Since the OpenMP API does not only support GPU-like architectures as target devices, it has been a design decision by the OpenMP Language Committee to separate offload directives and parallelism from each other. Through this decision programmers can use the best matching OpenMP directives to create parallelism for a specific target architecture. Also, the OpenMP API supports a more descriptive approach via the loop construct instead of the teams distribute parallel for construct.
The teams distribute directive then partitions the loop iteration space across the available warps or wavefronts, while the parallel for simd constructs can parallelize the partitioned loop for the available GPU threads. Another approach is to map parallel for to a single GPU thread and use simd to create parallelism within the warp/wavefront. OpenMP explicitly allows for this flexibility in laying out the execution on the GPU, such that implementations can pick the best possible strategy.
Many compilers now have (partial) support for version 5.0 and version 5.1 of the OpenMP API. In this work, we use only OpenMP offloading as we benchmark GPU accelerators. For AMD GPUs, we rely on the AMD OpenMP compiler (AOMP).
3.3 SYCL
SYCL [19] is an open standard for heterogeneous programming. It is developed and maintained by the Khronos Group. Unlike other heterogeneous programming models, SYCL does not introduce any custom syntax extensions or pragmas. Instead, expresses heterogeneous data parallelism with pure C++. The latest SYCL version is SYCL 2020, which relies on C++17. Originally, SYCL was intended as a higher-level single-source model for OpenCL. This means that in contrast to OpenCL, host and device code reside in the same source file in SYCL, and are processed together by the SYCL compiler. Starting with SYCL 2020, a generalized backend architecture was introduced that allows for other backends apart from OpenCL. Backends used by current SYCL implementations include OpenCL, Level Zero, CUDA, HIP and others.
While a more task-oriented model is available as well, SYCL currently strongly focuses on data parallel kernels. The execution of these kernels is organized by a task graph that is maintained by the SYCL runtime. There are two memory management models in SYCL: the buffer-accessor model and the unified shared memory (USM) model.
In the buffer-accessor model, the SYCL runtime handles data transfers automatically according to data access specifications given by the programmer. These are also used by the SYCL runtime to automatically construct a task graph for the execution of kernels. In the pointer-based USM model, the programmer is responsible for correctly inserting dependencies between kernels and making sure that data is available on the device when necessary. While the buffer-accessor model may introduce overheads due to the evaluation of the access specifications and calculatation of kernel dependencies, if the scheduler receives detailed information that can be used to optimize the task graph execution.
The execution model in SYCL is largely inherited from OpenCL. Parallel work items are grouped into work groups, and synchronization is only possible within a work group. Starting with SYCL 2020, work groups are additionally subdivided into subgroups that are typically mapped to SIMD units. On GPUs, a SYCL work group usually corresponds to a thread block from HIP or a team in the OpenMP model. As such, the SYCL work-group size is a tuning parameter as in those other models. In SYCL, multiple methods exist to invoke kernels. In the simplest method, parallel_for, the work groups are not exposed and, on GPUs, a SYCL implementation automatically selects an appropriate work group size. In the more complex nd_range model, the user is responsible for choosing an appropriate work group size.
There are multiple implementations of SYCL. The most well-known implementations include ComputeCpp [20], DPC++ [21], hipSYCL [22] and triSYCL [23]. In this work, we will be using hipSYCL as it has mature support both for the GPUs investigated in this work. hipSYCL consists of a multi-backend runtime with support for CPUs and GPUs from AMD, NVIDIA and Intel, the SYCL kernel and runtime header library, as well as a compiler component with a unified compiler driver called syclcc. This compiler component is designed to integrate with existing compiler toolchains. For example, when compiling for NVIDIA and AMD GPUs, hipSYCL acts as an additional layer on top of CUDA and HIP. During compilation, hipSYCL loads an additional clang plugin that extends clang’s native HIP and CUDA support with support for SYCL-specific constructs, such as automatic kernel detection and outlining. This design not only allows a user to mix-and-match CUDA or HIP kernel code with SYCL code even within one kernel, it also allows using vendor-supported toolchains with hipSYCL since e.g. AMD’s official ROCm HIP compiler uses the same clang HIP toolchain. Consequently, hipSYCL can be deployed on top of the AMD HIP compiler.
3.4 OpenACC
OpenACC is a directive programming model for the GPUs that has evolved significantly since its beginning. Initially, there were two options for OpenACC support on LUMI. First, the HPE/Cray compiler supports only Fortran and OpenACC version 2.7, with potential for up to v3.1 until end of 2022. Second, the GNU compiler [24], which is not a contractual agreement. Thus, our guidance is not recommending OpenACC without also mentioning these caveats.
For illustration, the following OpenACC directive uses a few clauses. The gang clause corresponds to the thread blocks, while the worker clause is the warp or wavefront, and vector is the threads:
As GCC with offload to AMD MI100 GPUs is not focus on performance this moment, but more to functionality, we do not report OpenACC results. We mention though that GCC v10.3, v11.1, and later have fixed an issue that GPU memory was cleaned too often and as a result the performance on NVIDIA GPUs is improved by almost 30% for all BabelStream kernels except the dot kernel for which the performance remained similar. Moreover, in the future we plan to explore a research project called clacc [25, 26] that provides OpenACC support for Clang and LLVM. This will allow for simplified porting of OpenACC codes to the OpenMP API (amongst other benefits).
3.5 Alpaka
The Abstraction Library for Parallel Kernel Acceleration (alpaka) [27] is implemented as a header-only C++14 abstraction library for accelerator development and portability. Originally developed to support large-scale scientific applications like PIConGPU [28], alpaka enables an accelerator-agnostic implementation of hierarchical redundant parallelism, that is, the API allows a user to specify data and task parallelism at multiple levels of compute and memory for a particular platform. Currently, alpaka provides support for backends for OpenMP, (C++) threads, Intel Threading Building Blocks, CUDA, HIP, and SYCL for FPGA along with new backends for directives in development.
Alpaka code can be used to express hierarchical parallelism for both CPU-style and GPU devices. In addition to grids, blocks, and threads, alpaka also provides an element construct that represents an n-dimensional set of inputs that is amenable to vectorization by the compiler. This extra level of parallelism is key to achieve good performance when attempting to map GPU-style kernels to a CPU architectures that offer SIMD instructions as part of their instruction set architecture.
In addition to the optimized kernels via alpaka, users can also use the C++ User interface for the Platform independent Library Alpaka (cupla) [29] to port CUDA code to use the alpaka library. Cupla codes have a very similar syntax to regular CUDA kernels and can include calls to the CUDA API for data allocation and movement. While cupla introduces some host-side API call overhead compared to pure alpaka, it provides a suitable path to map existing codes to alpaka’s supported backends.
3.6 Kokkos
The Kokkos [30] C++ Performance Portability Ecosystem is a framework for writing modern C++ applications with portability across a variety of hardware. It is part of the Exascale Computing Project (ECP) and is used by many HPC users and packages. It supports several backends, such as CUDA, HIP, SYCL, and OpenMP offloading to target various accelerators, including NVIDIA and AMD GPUs.
The Kokkos abstraction layer maps C++ source code to the specific instructions required for the backend during build time. When compiling the source code, the binary will be built for the declared backends:
Tagged:
Leave a Reply
neueste Artikel
- How to Check The Serial Number...
Buying a huawei watch can sometimes be a dream come true, especially if you have had b... - Synology DiskStation Manager
Der DiskStation Manager (DSM) ist das Betriebssystem für die Network Attached Storage-Systeme (NAS... - SAN vs. NAS: Unterschied ...
„Big Data“ lässt grüßen! Laut einer Studie des Festplatten-Giganten Seagate soll sich der weltweit... - Online Banking Software: 4...
Wer Bankgeschäfte über das Internet erledigen möchte, muss dafür nicht zwingend über die Internets... - Ninite – das wohl nützlic...
System-Tools gibt es wie Sand am Meer. Die meisten versprechen viel und halten wenig. Wirklich gut... - Digitalisierung anpacken:...
Die Corona-Pandemie hat ohne Zweifel einen starken Beitrag zur Digitalisierung kleinerer und mitte...
populäre Artikel
- How to Check The Serial Number...
Buying a huawei watch can sometimes be a dream come true, especially if you have had b... - Synology DiskStation Manager
Der DiskStation Manager (DSM) ist das Betriebssystem für die Network Attached Storage-Systeme (NAS... - Online Banking Software: 4...
Wer Bankgeschäfte über das Internet erledigen möchte, muss dafür nicht zwingend über die Internets... - Einrichten einer lokalen...
Dieser Artikel richtet sich an alle Hobby-Webentwickler, die ihre erstellten Web-Projekte auf dem... - Digitalisierung anpacken:...
Die Corona-Pandemie hat ohne Zweifel einen starken Beitrag zur Digitalisierung kleinerer und mitte...
Lieblingsartikel
- SAN vs. NAS: Unterschied ...
„Big Data“ lässt grüßen! Laut einer Studie des Festplatten-Giganten Seagate soll sich der weltweit... - Online Banking Software: 4...
Wer Bankgeschäfte über das Internet erledigen möchte, muss dafür nicht zwingend über die Internets... - Ninite – das wohl nützlic...
System-Tools gibt es wie Sand am Meer. Die meisten versprechen viel und halten wenig. Wirklich gut... - Einrichten einer lokalen...
Dieser Artikel richtet sich an alle Hobby-Webentwickler, die ihre erstellten Web-Projekte auf dem...