# Accelerated computing with CUDA C/C++

Dr. Jony Castagna

FAST team leader - Hartree Centre
NVidia Deep Learning Institute Ambassador



# **Agenda**

- What are GPUs?
- What is CUDA?
- CUDA Threads
- Memory management
- CUDA Streams
- Alternative programming models
- GPU implementations in HEP



## Why we have GPUs?



Hartree Centre

## Modern and future architectures

Single Core Processor



Long pipelined, out-of-order execution Many-core Processor



Short pipelined, cache coherent

**GPU** 



Shared instruction control, small cache

Field-Programmable Gate Arrays



Neuromorphic Computing



Quantum Computing





# First 10 of top 500 supercomputers

9/10 have **GPUs!** 

2 AMD 1 Intel 6 Nvidia

| Rank | System                                                                                                                                                                                        | Cores     | Rmax<br>(PFlop/s) | Rpeak<br>(PFlop/s) | Power<br>(kW) |
|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-------------------|--------------------|---------------|
| 1    | Frontier - HPE Cray EX235a and Optimized 3rd<br>Generation EPY0 640 2GHz (MID Instinct MI250X,<br>Slingshot-11, HPE<br>D0E/SC/Oak Ridge National Laboratory<br>United States                  | 8,699,904 | 1,206.00          | 1,714.81           | 22,786        |
| 2    | Aurora - HPE Cray EX - Intel Exagence Compute Blac<br>Xeon CPU Max 9470 52C 2.4GHz (Intel Data Center GPU)<br>Max, Slingshot-11, Intel<br>DOE/SC/Argonne National Laboratory<br>United States | 9,264,128 | 1,012.00          | 1,980.01           | 38,698        |
| 3    | agle - Micro. 1t NDv5, Xeon Platinum 8480C 48C 2GHz,<br>NVIDIA H100, NVIDIA Infiniband NDR, Microsoft Azure<br>Acrosoft Azur<br>United States                                                 | 2,073,600 | 561.20            | 846.84             |               |
| 4    | Supercomputer Fugaku - Supercomputer Fugaku,<br>A64FX 48C 2.2GHz, Tofu interconnect D, Fujitsu<br>RIKEN Center for Computational Science<br>Japan                                             | 7,630,848 | 442.01            | 537.21             | 29,899        |
| 5    | LUMI - HPE Cray EXZZ a, AMD Optimed 3rd Generation<br>EPYC 64C 2GHz, AMU Instinct MI250X, 2 ingshot-11, HPE<br>EuroHPC/CSC<br>Finland                                                         | 2,752,704 | 379.70            | 531.51             | 7,107         |
| 6    | Alps - HPE Cray EV356n, NVIDIA Grace 720 3.16Hz,<br>NVIDIA Gr. 200 Superchip, Ningshot-11, HPE<br>Swiss M tional Supercomp Ling Centre (CSCS)<br>Switzerland                                  | 1,305,600 | 270.00            | 353.75             | 5,194         |
| 7    | Leonardo - BullSequana XH2000, Xeon Platinum 8358<br>32C 2.66Hz, NVIDIA A100, 2mm 64 GB, God-rail NVIDIA<br>HDR100 Infiniband, EVIJEN<br>EuroHPC/CINECA<br>Italy                              | 1,824,768 | 241.20            | 306.31             | 7,494         |
| 8    | MareNostrum 5 ACC - Bull S. Juna XH30s. Xeon<br>Platinum 8460Y+ 32C 2.3G 4z, NVIDIA H100 64 SB,<br>Infiniband NDR, EVIDEN<br>EuroHPC/BSC<br>Spain                                             | 663,040   | 175.30            | 249.44             | 4,159         |
| 9    | Sumpet - IBM Power Datem AC922, IBM POWER9 22C<br>3.07 (Hz., NVIDIA Volta 2, 100, Dual-rail Mellanox EDR<br>Infinibates-YBM<br>DOE/SC/Oak Ridge National Laboratory<br>United States          | 2,414,592 | 148.60            | 200.79             | 10,096        |
| 10   | Eos NVIDIA DOX SuperPOD - NVIDIA DOX H100, Xeon<br>Platinum 84608-56C 3.8GHz, NVIDIA H100, Infiniband<br>NDRAO0, Nvidia<br>NDIA Cossistion<br>United States                                   | 485,888   | 121.40            | 188.65             |               |





6 Nvidia

4 AMD

## Typical hybrid CPU-GPU supercomputer



Each node is made of

1 (or more) CPUs and

1 (or more ) GPUs

**Example: Frontier** 

1 CPUs AMD EPYC (64-core)

4 GPUs AMD Instinct 250X



## What are GPUs?



you don't necessarily need a cluster!



37,888 <u>Instinct</u> MI250X GPUs!



Science & Technology Facilities Council

## **NVidia GPUs**

# Tesla/Data Center (HPC)

- Tesla (1.x)
- Fermi (2.x)
- Kepler (3.x)
- Maxwell (5.x)
- Pascal (6.x)
- Volta (7.x)

**Turing (7.5)** 

- Ampere (8.x)

Lovelace (8.9)

- Hopper (9.x)
- Blackwell (10.x)

**Quadro** RTX (Visualization)

GeForce (Gaming)

Jetson/Tegra (edge/auto)

Architecture identifier also corresponding to the major number of Compute Capability index



## **NVidia H100**



NVidia Tesla H100: 14592 CUDA cores!

- 8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU
- 128 FP32 CUDA Cores per SM, 18432 FP32 CUDA Cores per full GPU
- 4 Fourth-Generation Tensor Cores per SM, 576 per full GPU
- 6 HBM3 or HBM2e stacks, 12 512-bit Memory Controllers
- 60 MB L2 Cache
- Fourth-Generation NVLink and PCIe Gen 5

H100 white paper: https://resources.nvidia.com/en-us-tensor-core



## **NVidia H100**



Full GPU has 144 SMs!

#



## The Grace-Hopper superchip



High bandwidth and memory coherence!



#### **GH200** White paper

https://www.aspsys.com/wp-content/uploads/2023/09/nvidia-grace-hopper-cpu-whitepaper.pdf

#### **Useful blog**

https://developer.nvidia.com/blog/simplifying-gpu-programming-for-hpc-with-the-nvidia-grace-hopper-superchip/

Hartree Centre
Science & Technology Facilities Council

# H100 vs H200

Table 2. NVIDIA MGX Grace Hopper Superchip vs. NVIDIA x86+Hopper

| Feature per GPU                                                                        | HGX H100<br>4-GPU<br>(x86) | NVIDIA MGX<br>GH200 with<br>HBM3 | NVIDIA MGX<br>GH200 with<br>HBM3e                                                  | NVIDIA DGX<br>GH200 |  |  |
|----------------------------------------------------------------------------------------|----------------------------|----------------------------------|------------------------------------------------------------------------------------|---------------------|--|--|
| CPU Memory<br>bandwidth (GB/s /<br>GPU)                                                | Up to 150                  | Up to 500                        | Up to 500                                                                          | Up to 500           |  |  |
| GPU Memory<br>bandwidth (GB/s /<br>GPU)                                                | 3000                       | 4000                             | 4800                                                                               | 4000                |  |  |
| CPU Memory 5% bandwidth to GPU Memory bandwidth ratio                                  |                            | 12.5%                            | 10.4%                                                                              | 12.5%               |  |  |
| GPU-CPU Link bi-<br>directional bandwidth<br>(GB/s / GPU)                              | 128<br>(x16 PCle<br>Gen5)  | 900<br>(NVLink-C2C)              | 900<br>(NVLink-C2C)                                                                | 900<br>(NVLink-C2C) |  |  |
| GPU-GPU bi-<br>directional bandwidth<br>inter node (GB/s / GPU) (InfiniBand<br>NDR400) |                            | 100<br>(InfiniBand<br>NDR400)    | 900<br>(NVLink 4 for<br>dual GH200<br>with HBM3e)<br>100<br>(InfiniBand<br>NDR400) | 900<br>(NVLink 4)   |  |  |

These improvements in CPU ratio, and NVLink-C2C and NVLink Switch System performance redefine how we achieve maximum performance from heterogeneous systems, enabling new applications, and efficient solutions to challenging problems.



## DGX-GH200



Figure 4. NVIDIA DGX GH200 with NVLink Switch System for strongscaling giant ML workloads

up to 256 GPUs!

1 exaFlops FP8!

~£10M!



# HPC and AI (I)





## HPC and AI (II)



Deep Learning does <u>NOT</u> need double precision!

Science & Technology Facilities Council



## Why is important to know the GPU hardware? (I)

Solve a Tridiagonal matrix in parallel:



Thomas (sequential) algorithm for 1 CPU



## Why is important to know the hardware? (II)

Solve a Tridiagonal matrix in parallel:



For MPI/OpenMP you can use a partition method

Hartree Centre
Science & Technology Facilities Council

## Why is important to know the hardware? (III)

Solve a Tridiagonal matrix in parallel:



Each thread simplify a pair of equations.

On GPU you MUST use a cyclic reduction method!

Hartree Centre

## Your role!





- counts from 1
- likes Fortran
- interested in solvingPDE
- does not like AI solving everything







# Computational scientist (usually):

- likes all languages
- wants to solve PDE efficiently (fast) and everywhere (portability)
- Al can help if combined with a physics background

## (usually):

- counts from 0
- likes C/C++/Python
- interested in performance/hardware
- likes an Al superintelligence

# Modern and future computational scientist!



### 1<sup>st</sup> Main Concepts: host (CPU) and device (GPU)







#### 2<sup>nd</sup> Main Concepts: software abstraction

| Technical specifications                                                  |  | Compute capability (version) |     |     |     |                     |     |     |     |     |     |     |     |     |     |               |     |
|---------------------------------------------------------------------------|--|------------------------------|-----|-----|-----|---------------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|---------------|-----|
|                                                                           |  | 1.1                          | 1.2 | 1.3 | 2.x | 3.0                 | 3.2 | 3.5 | 3.7 | 5.0 | 5.2 | 5.3 | 6.0 | 6.1 | 6.2 | 7.0<br>(7.2?) | 7.5 |
| Maximum number of resident grids per device (concurrent kernel execution) |  | t.b.d.                       |     | 16  |     | 4                   | 32  |     | 16  | 128 | 32  | 16  | 128 |     |     |               |     |
| Maximum dimensionality of grid of thread blocks                           |  | 2                            |     |     |     | 3                   |     |     |     |     |     |     |     |     |     |               |     |
| Maximum x-dimension of a grid of thread blocks                            |  | 65535                        |     |     |     | 2 <sup>31</sup> – 1 |     |     |     |     |     |     |     |     |     |               |     |
| Maximum y-, or z-dimension of a grid of thread blocks                     |  | 65535                        |     |     |     |                     |     |     |     |     |     |     |     |     |     |               |     |
| Maximum dimensionality of thread block                                    |  | 3                            |     |     |     |                     |     |     |     |     |     |     |     |     |     |               |     |
| Maximum x- or y-dimension of a block                                      |  | 512 1024                     |     |     |     |                     |     |     |     |     |     |     |     |     |     |               |     |
| Maximum z-dimension of a block                                            |  | 64                           |     |     |     |                     |     |     |     |     |     |     |     |     |     |               |     |
| Maximum number of threads per block                                       |  | 512 1024                     |     |     |     |                     |     |     |     |     |     |     |     |     |     |               |     |
| Warp size                                                                 |  | 32                           |     |     |     |                     |     |     |     |     |     |     |     |     |     |               |     |

you can have billions of CUDA threads despite you only have ~15000 CUDA cores on a single GPU!



### 3<sup>rd</sup> Main Concepts: Fine Grain Parallelism (I)





#### 3<sup>rd</sup> Main Concepts: Fine Grain Parallelism (I)





#### 3<sup>rd</sup> Main Concepts: Fine Grain Parallelism (II)



you can have millions of CUDA threads despite you only have ~15000 CUDA cores on a single GPU!



### 4<sup>th</sup> Main Concepts: memory coalescent



**Coalescent Access** 



**Uncoalescent Access** 



## Resume Main Concepts

Host (CPU) and device (GPU)

Software Abstraction

Fine Grain Parallelism Paradigm

Memory Coalescent Access

The real challenge is **NOT** porting to CUDA-C

The real challenge is to satisfy those main concepts: <u>an algorithm</u> change may be required!



Very often from the algorithm change will benefit also others architectures (vectorization, etc...)

# Questions?



## What is CUDA?

**Compute Unified Device Architecture**: is a parallel computing platform and application programming interface (API) model created by NVidia.

- CUDA is based on C/C++ language
- CUDA Fortran is a Fortran wrapper for CUDA C
- OpenACC are directives to offload kernels on GPU. It translates to CUDA-C.

#### Main References:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html



## Why CUDA?





## The future of parallel programming

#### Standard Languages | Directives | Specialized Languages

```
do concurrent (i = 1:n)
   y(i) = y(i) + a*x(i)
enddo
```

GPU Accelerated C++ and Fortran

Incremental Performance Optimization with Directives

Maximize GPU Performance with CUDA C++/Fortran

The NVIDIA HPC compilers split execution of an application across multicore CPUs and NVIDIA GPUs using standard language constructs, directives, or CUDA.



### However...

**CPU** AMD EPYC 7713 processor 2 GHz 256 MB L3

£1.097.41 ex VA

1.316.89 inc VAT

#### **Specifications**

| PROCESSOR                |          |
|--------------------------|----------|
| Processor manufacturer ③ | AMD      |
| Processor model ②        | 7713     |
| Processor base frequency | 2 GHz    |
| Processor family ?       | AMD EPYC |
| Processor cores ②        | 64       |
|                          |          |

| DDR4-SDRAM   |
|--------------|
| 3200 MHz     |
| Octa-channel |
| 204.8 GB/s   |
|              |

price ratio ~ 10 energy ratio ~ 2 memory bandwidth ~ 15



and ideally >10x!!!



Hartree Centre



**GPU** 

The price of an AMD instinct MI250X is about 10k USD, so 10000 of them are about 100 million USD. This would suggest that the actual workhorse of the supercomputer is only 20% of its cost (since Frontier cost 600 million USD). 18 Jun 2022

#### AMD Instinct™ MI250X Accelerators

AMD Instinct™ MI250X accelerators are designed to supercharge HPC workloads and r era of exascale

**Peak Double Precision** 

95.7 TFLOPs

Matrix (FP64) Performance

#### **GPU Memory**

about 10k USD

**Dedicated Memory Size** 128 GB

**Dedicated Memory Type** HBM2e

**Memory Interface** 8192-bit

**Memory Clock** 1.6 GHz

**Peak Memory Bandwidth** 

3.2 TB/s

## How it works



Transfer CPU (Host) to **GPU** (Device)

is slow: try to avoid as much as possible!



## How it works

#### Standard C Code

#### C with CUDA extensions

```
void saxpy(int n, float a,
           float *x, float *y)
  int i = blockIdx.x*blockDim.x + threadIdx.x;
 if (i < n) y[i] = a^{i}x[i] + y[i];
int N = 1<<20:
cudaMemcpy(x, d_x, N, cudaMemcpyHostToDevice);
cudaMemcpy(y, d_y, N, cudaMemcpyHostToDevice);
// Perform 5AXPY on IM elements
saxpy << <4096,256>>> (N, 2.0, x, y);
cudaMemcpy(d_y, y, N, cudaMemcpyDeviceToHost);
```



## How it works

#### C with CUDA extensions

```
global
void saxpy(int n, float a,
           float *x, float *y)
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a^{a}x[i] + y[i];
int N = 1 << 20;
cudaMemcpy(x, d_x, N, cudaMemcpyHostToDevice);
cudaMemcpy(y, d_y, N, cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<4096,256>>>(N, 2.0, x, y);
cudaMemcpy(d_y, y, N, cudaMemcpyDeviceToHost);
```





# CUDA compiler

#### **NVCC (NVidia CUDA compiler)**

- Very robust!
- Backward compatible
- Released with the CUDA toolkit (12.6) or NVidia HPC SDK (24.7)
- Current version 12.6

CUDA toolkit version: do <a href="NOT">NOT</a> confuse with CUDA architecture or Compute Capability index!



## CUDA software abstraction I



### CUDA software abstraction II



From CUDA compute capability 9.0 onwards only!

also these can be 3D!



# CUDA Development Ecosystem (I)





# CUDA Development Ecosystem (II)

### Tools & Ecosystem



#### Accelerated Solutions

GPUs are accelerating many applications across numerous industries.

Learn more >



#### **GPU-Accelerated Libraries**

Application accelerating can be as easy as calling a library function.

Learn more >



#### Language and APIs

GPU acceleration can be accessed from most popular programming languages.

Learn more >



#### Performance Analysis Tools

Find the best solutions for analyzing your application's performance profile.

Learn more >



#### **Debugging Solutions**

Powerful tools can help debug complex parallel applications in intuitive ways.

Learn more >



#### Data Center Tools

Software Tools for every step of the HPC and Al software life cycle.

Learn more >



#### Key Technologies

Learn more about parallel computing technologies and architectures.

Learn more >



#### Accelerated Web Services

Micro services with visual and intelligent capabilities using deep learning.

Learn more >



#### Cluster Management

Managing your cluster and job scheduling can be simple and intuitive.

Learn more >

# https://developer.nvidia.com/tools-ecosystem



# CUDA Development Ecosystem (III)



https://developer.nvidia.com/tools-ecosystem



### The CUDA tools

- NSight (debugger)
- NVPP (performance)
- Code samples
- ...and many more!!!





# Ex.: Deep Learning Frameworks





# Questions?



### TOPICS

# **CUDA** threads

GPU-accelerated vs. CPU-only Applications

**CUDA Kernel Execution** 

**Parallel Memory Access** 

GPU occupancy

Kernel occupancy

Appendix: Glossary

# GPU-accelerated vs. CPU-only Applications





...and all work is performed on CPU



...and all work is performed on CPU



allocated with cudaMallocManaged() **DATA GPU** Time

In accelerated applications data is

... where it can be accessed and worked on by the CPU











# **Coordinating Parallel Threads**











0 4

1 5

2 6

3 | '









0 4

1 | 5

2 6

3

Recall that each thread has access to the size of its block via blockDim.x





0 4

1 5

2 6

3 | 7

...and the index of its block within the grid via blockIdx.x





0 4

1 | 5

2 6

3 | |

...and its own index within its block via threadIdx.x





0 4

5 1

2 6

3

Using these variables, the formula threadIdx.x + blockIdx.x \* blockDim.x will map each thread to one element in the vector





| 0 4 |
|-----|
|-----|

| threadIdx.x | + | blockIdx.x | * | blockDim.x |
|-------------|---|------------|---|------------|
| 0           |   | 0          |   | 4          |

5 1

dataIndex

6

3









0 4

threadIdx.x + blockIdx.x \* blockDim.x

1 0 4

1 | 5

dataIndex

2

6

3

7











0 4

threadIdx.x + blockIdx.x \* blockDim.x

2 0 4

1 5

dataIndex

2

6

?

3

7









0 4

threadIdx.x + blockIdx.x \* blockDim.x

3 0 4

1 5

dataIndex

2

6

?

3

7









0 4

threadIdx.x + blockIdx.x \* blockDim.x

0 1 4

1 5

dataIndex

2 6

?

3

7









0 4

threadIdx.x + blockIdx.x \* blockDim.x

1 1 4

1 5

dataIndex

2 6

3

7











0 4

threadIdx.x + blockIdx.x \* blockDim.x

2 1 4

1 5

dataIndex

2

6

?

3

7











0 4

threadIdx.x + blockIdx.x \* blockDim.x

3 1 1 4

1 5

dataIndex

2

6

?

3

7











# Grid Size Work Amount Mismatch





GPU DATA 0

4

What if there are more threads than work to be done?

1

2

3











0 4

threadIdx.x blockIdx.x blockDim.x 0 4

dataIndex Can work N 5 4

3

1

2









0 4

threadIdx.x blockIdx.x blockDim.x 4

1

dataIndex Can work N

2

5 5

3





0 4

threadIdx.x + blockIdx.x \* blockDim.x

1 1 4

1

2

 dataIndex
 <</th>
 N
 =
 Can work

 5
 5
 false

3





0 4

2

threadIdx.x

blockIdx.x blockDim.x

4

Can work

1

2

dataIndex N 6 5

3





0 4

threadIdx.x + blockIdx.x \* blockDim.x

2 1 1 4

2

1

dataIndex < N = Can work

6 5 false

3





0 4

2

threadIdx.x blockIdx.x blockDim.x 4

1

dataIndex

Can work N 6 5

3

2





0 4

threadIdx.x + blockIdx.x \* blockDim.x

2 1 4

2

1

dataIndex < N = Can work

6 5 false

3

nerformW





# **Grid-Stride Loops**



Often there are more data elements than there are threads in the grid













... or else work is left undone









One way to address this programmatically is with a **grid-stride loop** 



































With all threads working in this way, all elements are covered











With all threads working in this way, all elements are covered











With all threads working in this way, all elements are covered











With all threads working in this way, all elements are covered











With all threads working in this way, all elements are covered









CUDA runs as many blocks in parallel at once as the GPU hardware supports, for massive parallelization























































### Occupancy

- GPU Occupancy

- Kernel Occupancy







NVIDIA GPUs contain functional units called **Streaming Multiprocessors**, or **SMs** 

Blocks of threads are scheduled to run on SMs







kernel<<<24, 4>>>()

Depending on the number of SMs on a GPU, and the requirements of a block, more than one block can be scheduled on an SM





Depending on the number of SMs on a GPU, and the requirements of a block, more than one block can be scheduled on an SM





Grid dimensions divisible by the number of SMs on a GPU can promote full SM utilization





### **Kernel Occupancy**

compile your code with the following nvcc option:

#### --ptxas-options=-v

you will get something like this:

ptxas info: 0 bytes gmem

ptxas info: Compiling entry function
' Z28MyKernelhS0 S0 PK3CDRS0 Ph' for 'sm 20'

ptxas info: Function properties for \_Z28MyKernelhS0\_S0\_PK3CDRS0\_Ph

24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info: Used 29 registers, 80 bytes cmem[0], 4 bytes cmem[16]

#### Remember:

In the H100 you have 256KB register per multiprocessor.

The total number of "threads X register per threads" cannot exceed this value!!





### **Kernal Occupancy**

#### Warp = 32 threads



|                                                                           |        |     | Compute capability (version) |                |       |                     |     |     |     |      |     |      |      |     |     |               |     |
|---------------------------------------------------------------------------|--------|-----|------------------------------|----------------|-------|---------------------|-----|-----|-----|------|-----|------|------|-----|-----|---------------|-----|
| Technical specifications                                                  |        | 1.1 | 1.2                          | 1.3            | 2.x   | 3.0                 | 3.2 | 3.5 | 3.7 | 5.0  | 5.2 | 5.3  | 6.0  | 6.1 | 6.2 | 7.0<br>(7.2?) | 7.5 |
| Maximum number of resident grids per device (concurrent kernel execution) | t.b.d. |     |                              | 1              | 6     | 4 32 16 128 32 16   |     |     |     | 128  | 3   |      |      |     |     |               |     |
| Maximum dimensionality of grid of thread blocks                           | 2      |     |                              | 3              |       |                     |     |     |     |      |     |      |      |     |     |               |     |
| Maximum x-dimension of a grid of thread blocks                            | 65535  |     |                              |                |       | 2 <sup>31</sup> – 1 |     |     |     |      |     |      |      |     |     |               |     |
| Maximum y-, or z-dimension of a grid of thread blocks                     | 65535  |     |                              |                |       |                     |     |     |     |      |     |      |      |     |     |               |     |
| Maximum dimensionality of thread block                                    | 3      |     |                              |                |       |                     |     |     |     |      |     |      |      |     |     |               |     |
| Maximum x- or y-dimension of a block                                      | 512    |     |                              |                | 1024  |                     |     |     |     |      |     |      |      |     |     |               |     |
| Maximum z-dimension of a block                                            | 64     |     |                              |                |       |                     |     |     |     |      |     |      |      |     |     |               |     |
| Maximum number of threads per block                                       | 512    |     |                              | 1024           |       |                     |     |     |     |      |     |      |      |     |     |               |     |
| Warp size                                                                 | 32     |     |                              |                |       |                     |     |     |     |      |     |      |      |     |     |               |     |
| Maximum number of resident blocks per multiprocessor                      | 8      |     |                              | 16             |       |                     |     |     | 32  |      |     |      |      | 16  |     |               |     |
| Maximum number of resident warps per multiprocessor                       | 2      | 24  | 3                            | 32             | 48 64 |                     |     |     |     |      |     |      | 32   |     |     |               |     |
| Maximum number of resident threads per multiprocessor                     | 7      | 68  | 10                           | 1024 1536 2048 |       |                     |     |     |     |      |     | 1024 |      |     |     |               |     |
| Number of 32-bit registers per multiprocessor                             | 8      | K   | 10                           | 6 K            | 32 K  | 64 K 128 K 64 K     |     |     |     |      |     |      |      |     |     |               |     |
| Maximum number of 32-bit registers per thread block                       | N/A    |     | 32 K                         | 64 K           | 32 K  |                     | 6   | 4 K |     | 32 K | 64  | l K  | 32 K | 64  | K   |               |     |
| Maximum number of 32-bit registers per thread                             | 124    |     |                              | 6              | 3     | 255                 |     |     |     |      |     |      |      |     |     |               |     |

Occupancy =

max number of active warps

max number of warps per SM

Block size should be a multiple of 32 (128,256 usually)!



### Spreadsheet calculator

https://xmartlabs.github.io/cuda-calculator/

Or google for:

CUDA spreadsheet calculator excel







### Glossary

- cudaMallocManaged(): CUDA function to allocate memory accessible by both the CPU and GPUs. Memory allocated this way is called *unified memory* and is automatically migrated between the CPU and GPUs as needed.
- cudaDeviceSynchronize(): CUDA function that will cause the CPU to wait until the GPU is finished working.
- Kernel: A CUDA function executed on a GPU.
- Thread: The unit of execution for CUDA kernels.
- Block: A collection of threads.
- Grid: A collection of blocks.
- Execution context: Special arguments given to CUDA kernels when launched using the <<<...>>> syntax. It defines the number of blocks in the grid, as well as the number of threads in each block.
- gridDim.x: CUDA variable available inside executing kernel that gives the number of blocks in the grid
- blockDim.x: CUDA variable available inside executing kernel that gives the number of threads in the thread's block
- blockIdx.x: CUDA variable available inside executing kernel that gives the index the thread's block within the grid
- threadIdx.x: CUDA variable available inside executing kernel that gives the index the thread within the block
- threadIdx.x + blockIdx.x \* blockDim.x: Common CUDA technique to map a thread to a data element
- Grid-stride loop: A technique for assigning a thread more than one data element to work on when there are more elements than the number of threads in the grid. The stride is calculated by gridDim.x \* blockDim.x, which is the number of threads in the grid.

# Questions?





#### How to access to the course online

- 1. WIFI Info: eduroam
- 2. Browser Recommendation: Chrome
- 3. websocketstest.courses.nvidia.com
- 4. <a href="https://learn.nvidia.com/dli-event">https://learn.nvidia.com/dli-event</a>
- 5. Create an Nvidia Developer Account (if you have not done yet)
- Event code: STFC\_CUDA\_AMBASSADOR\_SE24
- 7. Work through the Introduction Section and 'Start' launching your first GPU task



### **Memory Management**

- Type of Memories
- Unified memory behaviour
- Non-Unified memory
- cudaMemcpyAsync

# Type of Memories

#### **Device Memory:**

- Global Memory
- Texture Memory
- Local Memory
- Constant Memory



**DRAM** 

#### **On Chip Memory:**

- Shared Memory (L1 Cache)
- Registers





### **Device Memory**

| Memory   | Location | Access | Scope                | Lifetime    | CUDA qualifier                                    |
|----------|----------|--------|----------------------|-------------|---------------------------------------------------|
| Global   | DRAM     | RW     | All threads and host | Application | cudaMalloc<br>cudaMallocManaged<br>cudaMallocHost |
| Local    | DRAM     | RW     | 1 thread             | Thread      |                                                   |
| Constant | DRAM     | R      | All threads and host | Application | constant                                          |
| Texture  | DRAM     | R      | All threads and host | Application | texture                                           |

- Global memory is the largest memory on the GPU
- Local memory are local registers spilled into the global memory
- Constant memory is useful to store constant read in values (ex. g, c, R, etc.)
- Texture memory allows interpolation on 2D constant matrix values (ex. PVT table)



# On chip Memory

| Memory   | Location | Access | Scope                  | Lifetime | CUDA qualifier |
|----------|----------|--------|------------------------|----------|----------------|
| Register | On chip  | RW     | 1 thread               | Thread   |                |
| Shared   | On chip  | RW     | All threads in a block | Block    | shared         |

- Registers per thread can be found at compile time
- Shared memory can enhance performance when locality is high



# Unified Memory (2014)

Before CUDA 6

After CUDA 6



### Simplifies code writing

# Applies to Global Memory only!



#### CPU Code CUDA 6 Code with Unified Memory void sortfile(FILE ofp, int N) { void sortfile(FILE ofp, int N) { char \*data: char \*data; data = (char \*) malloc(N); cudaMallocManaged(&data, N); fread(data, 1, N, fp); fread(data, 1, N, fp); qsort(data, N, 1, compare); qs ort <<<...>>> (data ,N ,1, compare); cu daDeviceSynchronize(); use\_data(data); use\_data(data); free(data); cudaFree(data);

# Grace-Hopper Unified Memory (2022)

Careful!! This concept is NOT the unified memory referred here!



#### In CUDA:

- -gpu=unified
- -gpu=unified (implies managed)
  https://developer.nvidia.com/blog/uni
  fied-memory-cuda-beginners/

In OpenACC

-ta=tesla:managed





# New unified memory (I)





# New unified memory (II)







# New unified memory (III)







# Unified Memory Behavior

resident initially on the CPU or the GPU **DATA GPU** cudaMallocManaged() Time



When **UM** is allocated, it may not be

for the first time, a page fault will occur **DATA GPU** cudaMallocManaged() init() Time



When some work asks for the memory

The page fault will trigger the migration of the demanded memory



This process repeats anytime the memory is requested somewhere in the system where it is not resident **DATA GPU** work<<>>>>() cudaMallocManaged() init() Time





This moves the memory in larger batches, and prevents page faulting



### **Non-Unified Memory**

Memory can be allocated directly to the GPU with `cudaMalloc`



Memory can be allocated directly to the host with `cudaMallocHost`







# cudaMemcpyAsync

asynchronously transfer memory over a non-default stream **GPU** cudaMallocHost() cudaMemcpyAsync(HtoD) Time



`cudaMemcpyAsync` can















### **CUDA** streams

- Default stream
- Concurrent streams

## **Default Stream**

A **stream** is a series of instructions, and CUDA has a **default stream** 

### **DEFAULT STREAM**

By default, CUDA kernels run in the default stream

**DEFAULT STREAM** 

kernel 1



In any stream, including the default, an instruction in it (here a kernel launch) must complete before the next can begin

kernel 1 kernel 2

In any stream, including the default, an instruction in it (here a kernel launch) must complete before the next can begin



## **Concurrent streams**

Non-default streams can also be created for kernel execution

**NON-DEFAULT STREAM 1** 

**DEFAULT STREAM** 

Non-default streams can also be created for kernel execution



Kernels within any single stream must execute in order











The default stream is special: it blocks all kernels in all other streams



The default stream is special: it blocks all kernels in all other streams



The default stream is special: it blocks all kernels in all other streams





Time













Time

# Questions?



## Profiler and other tools

- NSight
- Debugger, cuda-memcheck, etc.

# The deployment cycle

 Analyze your code to determine most likely places needing parallelization or optimization.

 Parallelize your code by starting with the most time consuming parts, check for correctness and then analyze it again.

 Optimize your code to improve observed speed-up from parallelization.



# **NVidia tools**

Nsight Systems - Analyze application algorithm system-wide

Nsight Compute Debug/optimize CUDA
kernel

Nsight Graphics -Debug/optimize graphics workloads





# Profiling via command line

## Using Command Line Interface (CLI)

**NVIDIA Nsight Systems CLI provides** 

- Simple interface to collect data
- Can be copied to any system and analysed later
- Profiles both serial and parallel code
- For more info enter nsys --help on the terminal

To profile a serial application with NVIDIA Nsight Systems, we use NVIDIA Tools Extension (NVTX) API functions in addition to collecting backtraces while sampling.



# **Profiling using NVTX (I)**

## NVIDIA Tools Extension API (NVTX) library

#### What is it?

- A C-based Application Programming Interface (API) for annotating events
- Can be easily integrated to the application
- Can be used with NVIDIA Nsight Systems

#### Why?

- Allows manual instrumentation of the application
- Allows additional information for profiling (e.g: tracing of CPU events and time ranges)

#### How?

- Import the header only C library nvToolsExt.h
- Wrap the code region or a specific function with nvtxRangePush() and nvtxRangPop()



# Profiling using NVTX (II) SEQUENTIAL

```
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include "laplace2d.h"
#include <nvtx3/nvToolsExt.h>
int main(int argc, char** argv)
    const int n = 4096;
    const int m = 4096;
    const int iter max = 1000:
    const double tol = 1.0e-6:
    double error = 1.0;
    double *restrict A = (double*)malloc(sizeof(double)*n*m);
    double *restrict Anew = (double*)malloc(sizeof(double)*n*m);
    nvtxRangePushA("init");
    initialize(A, Anew, m, n);
    nvtxRangePop();
    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
    double st = omp get wtime();
    int iter = 0;
    nvtxRangePushA("while"):
    while ( error > tol && iter < iter max )
        nvtxRangePushA("calc");
       error = calcNext(A, Anew, m, n);
        nvtxRangePop():
        nvtxRangePushA("swap");
       swap(A, Anew, m, n);
        nvtxRangePop();
       if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
       iter++:
    nvtxRangePop();
    double runtime = omp_get_wtime() - st;
   printf(" total: %f s\n", runtime);
    deallocate(A, Anew);
    return 0:
```

```
Selects the APIs to be traced (nvtx in this example)
       -t
                                      if true, generates summary of statistics after the collection
       --status
                                      Selects the backtrace method to use while sampling. The option dwarf
                                      uses DWARF's CFI (Call Frame Information).
       --force-overwrite
                                      if true, overwrites the existing results
                                      sets the output (adrep) filename
   gank@prm-dgx-32:-/Code/openacc-training-materials/labs/module4/English/C/solutions/parallels nsys profile -t nvtx --stats=true -b dwarf --force-overwrite true -o laplace-seq ./laplace-seq
Collecting data...
Jacobi relaxation Calculation: 4096 x 4096 mesh
   0. 0.250000
  100, 0.002397
      0.001204
      0.000804
  500, 0.000483
      0.000403
  700, 0.000345
      0.000302
      0.000269
total: 55.754501 s
 rocessing events...
Capturing symbol files...
Saving intermediate "/home/mozhgank/Code/openacc-training-materials/labs/module4/English/C/solutions/parallel/laplace-seq.qdstrm" file to disk...
Saved report file to "/home/mozhgank/Code/openacc-training-materials/labs/module4/English/C/solutions/paralle<mark>/</mark>/laplace-seq.qdrep
Exporting 70802 events: [=====
Exported successfully to /home/mozhgank/Code/openacc-training-materials/labs/module4/English/C/solutions/paralle1/laplace-seq.sqlite
Generating NVTX Push-Pop Range Statistics..
 VTX Push-Pop Range Statistics (nanoseconds)
            Total Time Instances
                                                                        Maximum Range ___
                                         Average
                                                                                                 NVTX range
                                                                   55754497966 while
  49.9
           55754497966
                                   55754497966.0
                                                    55754497966
           29577817696
                             1000
                                      29577817.7
                                                       29092956
                                                                      65008545 calc
                                                                                                   statistics
                                                      25761418
137489808
                                                                      60129514 swap
137489808 init
           26163892482
                                     26163892.5
137489808.0
    "calc" region (calcNext function) takes 26.6%
    "swap" region (swap function) takes 23.4% of
                    total execution time
                                                                                                                                          Open laplace-seq.qdrep with
                                                                                                                                         Nsight System GUI to view the
                                                                                                                                                        timeline
```



jacobi.c (starting and ending of ranges are **highlighted** with the **same** color) Profiling using NVTX (III) PARALLEL

```
#include <math.h>
#include <stdlib.h>
#define OFFSET(x, y, m) (((x)*(m)) + (y))
void initialize(double *restrict A, double *restrict Anew, int m, int n)
   memset(A, 0, n * m * sizeof(double));
   memset(Anew, 0, n * m * sizeof(double));
   for(int i = 0; i < m; i++){
        A[i] = 1.0;
        Anew[i] = 1.0;
double calcNext(double *restrict A, double *restrict Anew, int m, int n)
    double error = 0.0;
    #pragma acc parallel loop reduction(max:err)
   for( int j = 1; j < n-1; j++)
        #pragma acc loop
        for( int i = 1; i < m-1; i++ )
           Anew[OFFSET(j, i, m)] = 0.25 * (A[OFFSET(j, i+1, m)] + A[OFFSET(j, i-1, m)]
                                           + A[OFFSET(j-1, i, m)] + A[OFFSET(j+1, i, m)]);
           error = max( error, fabs(Anew[OFFSET(j, i, m)] - A[OFFSET(j, i, m)]));
   return error;
void swap(double *restrict A, double *restrict Anew, int m, int n)
    #pragma acc parallel loop
    for( int j = 1; j < n-1; j++)
        for( int i = 1; i < m-1; i++ )
           A[OFFSET(j, i, m)] = Anew[OFFSET(j, i, m)];
void deallocate(double *restrict A, double *restrict Anew)
    free(A);
    free(Anew);
```



laplace2d.c (Parallelised using OpenACC parallel directives (pragmas highlighted)



# **Profiling using Nsight**

## Using Nsight Systems

Open the generated report files (\*.qdrep) from command line in the Nsight Systems profiler.

File > Open







# Other tools



#### Arm Forge Debugger (formerly Allinea DDT)

Provides application developers with a single tool that can debug hybrid MPI, OpenMP, CUDA and OpenACC applications on a single workstation or GPU cluster.



#### **CUDA-GDB**

Delivers a seamless debugging experience that allows you to debug both the CPU and GPU portions of your application simultaneously. Use CUDA-GDB on Linux or MacOS, from the command line, DDD or EMACS.



#### TotalView

A GUI-based tool that allows you to debug one or many processes/threads with complete control over program execution, from basic debugging operations like stepping through code to concurrent programs that take advantage of threads, OpenMP, MPI, or GPUs.



#### **COMPUTE SANITIZER**

Compute Sanitizer is a functional correctness checking suite included in the CUDA toolkit. This suite contains multiple tools that can perform different type of checks. The memcheck tool is capable of precisely detecting and attributing out of bounds and misaligned memory access errors in CUDA applications. The tool can

#### If you can, do not debug via print!!!

### You could use instead **Visual Studio Code with Nsight**



https://developer.nvidia.com/nsight-visualstudio-code-edition

# **Examples**

- Porting DL\_MESO to CUDA
- DL\_MESO on multi-GPU

# An Hartree application: porting DL\_MESO to CUDA







# Porting DL\_MESO to GPUs

#### What is DPD and DL\_MESO?

Dissipative Particle Dynamics (**DPD**)...



...coarse grain representation using beads



**DL\_MESO**: software package developed by Michael Seaton at DL



**Vesicle Formation** 



DL\_MESO: highly scalable mesoscale simulations Molecular Simulation 39 (10) pp. 796-821, 2013



## Main problem: memory access pattern



particle locations are stored in a continuous order... Very uncoalescent access to the memory!



## Reorganize the cell-linked array





## Speedup on single GPU

#### Speedup on GPU vs:

- AMD EPYC 7402 (Rome) **24 cores**
- Intel Xeon(R) W-2133 CPU @ 3.60GHz 6 cores



Castagna et al. "Towards Extreme Scale Dissipative Particle Dynamics Simulations using Multiple GPGPUs" Comput. Physi. Comm. (2020)



## Multi GPU version



in 3D you have 26 GPUs/GPU!!



# Overlap computation with communication



# Scaling on different supercomputers

Strong scaling: 4096 GPUs (Piz Daint)



proper overlap computation-communication has a strong impact on scaling!

# Largest simulation: 14 billion particles!

**Adding Load Balance** 



Animation **Water Drop** formation on 8 GPUs showing the impact of load balancing routine ALL (ALL is from Julich Supercomputer Centre)

D. Di Giusto and J. Castagna et al. "Scalable algorithm for many-body Dissipative Particle Dynamics using multiple General Purpose Graphic Processing Units" Comput. Physi. Comm. (2022)



### Resume

- DL\_MESO has been ported to single and multi-GPU
   nNVidia GPUs using CUDA language
- Good scaling up to 4096 GPU
- We can now run very large **DPD** simulations (**14 billions**)
- Load balance allows to run simulations without out of memory on the GPU, as well as save computational time
- 2 publications on **journal paper** (CPC)

# Questions?



