KOKKOS with GPUs

Overview

Teaching: 15 min
Exercises: 15 min

Questions

How do I use KOKKOS together with a GPU?

Objectives

What is the performance like?

Using GPU acceleration through the KOKKOS package

In this episode, we shall learn to how to use GPU acceleration using the KOKKOS package in LAMMPS. In a previous episode, we have learnt the basic syntax of the package command that is used to invoke the KOKKOS package in a LAMMPS run. The main arguments and the corresponding keywords were discussed briefly in that chapter. In this episode, we shall do practical exercises to get further hands-on experiences on using those commands.

Command-line options to submit a KOKKOS GPU job in LAMMPS

In this episode, we’ll learn to use KOKKOS package with GPUs. As we have seen, to run the KOKKOS package the following three command-line switches are very important:

-k on : This enables KOKKOS at runtime
-sf kk : This appends the “/kk” suffix to KOKKOS-supported LAMMPS styles
-pk kokkos : This is used to modify the default package KOKKOS options

To invoke the GPU execution mode with KOKKOS, the -k on switch takes additional arguments for hardware settings as shown below:

-k on g Ngpu: Using this switch you can specify the number of GPU devices, Ngpu, that you want to use per node.

Before you start

Know your host: get the number of physical cores per node available to you.

Know your device: know how many GPUs are available on your system and know how to ask for them from your resource manager (SLURM, etc.)

CUDA-aware MPI: Check if you can use a CUDA-aware MPI runtime with your LAMMPS executable. If not then you will need to add cuda/aware off to your <arguments>.

Creating a KOKKOS GPU job script

Create a job script to submit a LAMMPS job for the LJ system that you studied for the GPU package such that it invokes the KOKKOS GPU to

accelerate the job using 1 node,

uses all available GPU devices on the node,

use the same amount of MPI ranks per node as there are GPUs, and

uses the default package options.
Solution
#!/bin/bash -x

# Ask for 1 nodes of resources for an MPI/GPU job for 5 minutes

#SBATCH --account=ecam
#SBATCH --nodes=1
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:10:00

# Configure the GPU usage (we request to use all 4 GPUs on a node)
#SBATCH --partition=develgpus
#SBATCH --gres=gpu:4

# Use this many MPI tasks per node (maximum 24)
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

module purge
module use /usr/local/software/jureca/OtherStages
module load Stages/Devel-2019a
module load intel-para/2019a
# Note we are loading a different LAMMPS package
module load LAMMPS/3Mar2020-gpukokkos

srun lmp -in in.lj -k on g 4 -sf kk -pk kokkos cuda/aware off
If you run it how does the execution time compare to the times you have seen for the GPU package?

A few tips on gaining speedup from KOKKOS/GPU

This information is collected from the LAMMPS website

Hardware comptibility: For better performance, you must use Kepler or later generations of GPUs.

MPI tasks per GPU: You should use one MPI task per GPU because KOKKOS tries to run everything on the GPU, including the integrator and other fixes/computes. One may get better performance by assigning multiple MPI tasks per GPU if some styles used in the input script have not yet been KOKKOS-enabled.

CUDA-aware MPI library: Using this can provide significant performance gain. If this is not available, set it off using the -pk kokkos cuda/aware off switch.

neigh and newton: For KOKKOS/GPU, the default is neigh = full and newton = off. For Maxwell and Kepler generations of GPUs, the default settings are typically the best. For Pascal generations, setting neigh = half and newton = on might produce faster runs.

binsize: For many pair styles, setting the value of binsize to twice that used for the CPU styles could offer speedup (and this is the default for the KOKKOS/GPU style)

Avoid mixing KOKKOS and non-KOKKOS styles: In the LAMMPS input file, if you use styles that are not ported to use KOKKOS, you may experience a significant loss in performance. This performance penalty occurs because it causes the data to be copied back and forth from the CPU repeatedly.

In the following discussion, we’ll work on a few exercises to get familiarized on some of these aspects (to some extent).

Exercise: Performance penalty due to use of mixed styles
First, let us take the input and job script for the LJ-system in the last exercise. Make a copy of the job script that uses the following additional settings for KOKKOS:

newton off

neigh full

comm device

Use the number of MPI tasks that equals to the number of devices. Measure the performance of of this run in timesteps/s.
Make a copy of the LJ-input file called in.mod.lj and replace the line near the end of the file:
thermo_style custom step time  temp press pe ke etotal density
with
compute 1 all coord/atom cutoff 2.5
compute 2 all reduce sum c_1
variable acn equal c_2/atoms

thermo_style custom step time  temp press pe ke etotal density v_acn
Using the same KOKKOS setting as before, and the identical number of GPU and MPI tasks as previously, run the job script using the new input file. Measure the performance of this run in timesteps/s and compare the performance of these two runs. Comment on your observations.
Solution

Taking an example from a HPC system with 2x12 cores per node and 4 GPUs, using 1 MPI task per GPU, the following was observed.

First, we ran with in.lj. Second, we modified this input as mentioned above (to become in.mod.lj) and performance for both of these runs are measured in units of timesteps/s. We can get this information from the log/screen output files. The comparison of performance is given in this table:

Input Performance (timesteps/sec) Performance loss

in.lj (all KOKKOS enabled styles used) 8.097

in.mod.lj (non-KOKKOS style used: compute coord/atom) 3.022 2.68

In in.mod.lj we have used styles that are not yet ported to KOKKOS. We can check this from the log/screen output files:
(1) pair lj/cut/kk, perpetual
    attributes: full, newton off, kokkos_device
    pair build: full/bin/kk/device
    stencil: full/bin/3d
    bin: kk/device
(2) compute coord/atom, occasional
    attributes: full, newton off
    pair build: full/bin/atomonly
    stencil: full/bin/3d
    bin: standard
In this case, the pair style is KOKKOS-enabled (pair lj/cut/kk) while the compute style compute coord/atom is not. Whenever you make such a mix of KOKKOS and non-KOKKOS styles in the input of a KOKKOS run, it costs you dearly since this requires the data to be copied back to the host incurring a performance penalty.

Input	Performance (timesteps/sec)	Performance loss
`in.lj` (all KOKKOS enabled styles used)	8.097
`in.mod.lj` (non-KOKKOS style used: `compute coord/atom`)	3.022	2.68

We have already discussed that the primary aim of developing the KOKKOS package is to be able to write a single C++ code that will run on both devices (like GPU) and hosts (CPU) with or without multi-threading. Targeting portability without losing the functionality and the performance of a code is the primary objective of KOKKOS.

Performance comparison of CPU and GPU package (using KOKKOS)

Let us see now see how the current KOKKOS/GPU implementation within LAMMPS (version 3Mar20) achieves this goal by comparing its performance with the CPU and GPU package. For this, we shall repeat the same set of tasks as described in episode 5. Take an LJ-system with ~11 million atoms by choosing x = y = z = 140 and t = 500.

KOKKOS/GPU is also specially designed to run everything on the GPUs (in this case there are 4 visible devices). We shall offload the entire force computation and neighbour list building to the GPUs using the <arguments>:
-k on g 4 -sf kk -pk kokkos newton off neigh full comm device
or, if CUDA-aware MPI is not available to you,
-k on g 4 -sf kk -pk kokkos newton off neigh full comm device cuda/aware off
We have created a plot to compare the performance of the KOKKOS/GPU runs with the CPU runs (i.e. without any accelerator package) and the GPU runs (i.e. with the GPU package enabled) with various numbers of nodes:

Discuss the main observations you can make from this plots.

Solution

There is only marginal difference in the performance of the GPU and KOKKOS packages. The hardware portability provided by KOKKOS therefore make it an attractive package to become familiar with since it is actively maintained and developed and likely to work reasonably well on the full spectrum of available HPC architectures (ARM CPUs, AMD graphics cards,…) going forward.

A caveat on these results however, at the time of their generation mixed precision support in the LAMMPS KOKKOS package was still under development. When running large number of atoms per GPU, KOKKOS is likely faster than the GPU package when compiled for double precision. It is likely that there is additional benefit of using single or mixed precision with the GPU package (depending significantly on the hardware in use and the simulated system and pair style).

Key Points

Knowing the capabilities of your host, device and if you can use a CUDA-aware MPI runtime is required before starting a GPU run

KOKKOS compares very well with the GPU package in double precision

KOKKOS aims to be performance portable and is worth pursuing because of this

previous episode

Running LAMMPS on HPC systems

next episode

KOKKOS with GPUs

Overview

Using GPU acceleration through the KOKKOS package

Command-line options to submit a KOKKOS GPU job in LAMMPS

Before you start

Creating a KOKKOS GPU job script

Solution

A few tips on gaining speedup from KOKKOS/GPU

Exercise: Performance penalty due to use of mixed styles

Solution

Performance comparison of CPU and GPU package (using KOKKOS)

Solution

Key Points

previous episode

next episode