Vectorization

A multicore processor-level vectorization application reduces wall-clock time to process a \(512^3\) grid by 42% and improves performance by 72%.
Tested in CPU environment proposed by Kim et al. (2021)

CPU Parallel benchmark

Precondition

The scalability of parallelization, as exemplified by OpenMP and MPI, is optimal when the number of grids is high (\(512^3\)). However, when the grid number is low (\(64^3\)), the parallel performance is compromised.

CPU
Note
[CPU Environment]
All the computations were executed on the Nurion manycore cluster at the Korea Institute of Science and Technology Information (KISTI). The Nurion consists of 8,305 Cray CS500 nodes interconnected by the Intel Omni-Path Architecture. Each node has a 68-Core Intel Xeon Phi 7250 processor with 16GB of high bandwidth on a chip and 96 GB of main memory. Each simulation is repeated for 10 time steps to average any small fluctuations in the execution time. An executable program was built using an Intel compiler (version 19.0.1.144) with the flags of full optimization (-O3) and automatic vectorization according to the hardware architecute (-xMIC-AVX512).
This environment proposed by Kim et al. (2021)
GPU
Note
[GPU Environment]
We evaluated the computational performance and energy efficiency of the GPU implementation of PaScaL_TDMA 2.0 on the NEURON cluster at the Korea Institute of Science and Technology Information (KISTI). The cluster consisted of two AMD EPYC 7543 processors (hosts) and eight NVLnk-connected NVIDIA A100 GPUs (devices) per compute node. The results were compared with those obtained on the NURION cluster at KISTI, which features an Intel Xeon Phi 7750 Knight Landing (KNL) processor per compute node. Intel OneAPI 22.2 and NVIDIA HPC SDK 22.7 were used to compile PaScaL_TDMA 2.0 on the NURION and NEURON clusters, respectively.
This environment proposed by Yang et al. (2023)

Previous	Next
Installation	Home