# Accelerated computing with CUDA C/C++

Dr. Jony Castagna

### FAST team leader - Hartree Centre NVidia Deep Learning Institute Ambassador



## Agenda

- What are GPUs?
- What is CUDA?
- CUDA Threads
- Memory management
- CUDA Streams
- Alternative programming models
- GPU implementations in HEP



## Why we have GPUs?





## **Modern and future architectures**



GPU



Shared instruction control, small cache

> Quantum Computing

> > $-\hat{\mathbf{z}} = |1\rangle$



Hartree Centre Science & Technology Facilities Council

## First 10 of top 500 supercomputers

### 2 AMD 1 Intel 6 Nvidia

| Rank | System                                                                                                                                                                                       | Cores     | Rmax<br>(PFlop/s) | Rpeak<br>(PFlop/s) | Power<br>(kW) |
|------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-------------------|--------------------|---------------|
| 1    | Frontier - HPE Cray EX235a, 2000 Optimized 305<br>Generation EPYC 640 20Hz KMD Instinct MI250X,<br>Slingshot-11, HPE<br>DOE/SC/Oak Ridge National Laboratory<br>United States                | 8,699,904 | 1,206.00          | 1,714.81           | 22,786        |
| 2    | Aurora - HPE Cray EX - Intel Exagence Compute Black<br>Xeon CPU Max 9470 520 2.40Hz Intel Data Center GPU<br>Max, Slingshot-11, Intel<br>DOE/SC/Argonne National Laboratory<br>United States | 9,264,128 | 1,012.00          | 1,980.01           | 38,698        |
| 3    | ugle - Microsoft NDV5, Xeon Platinum 84800 480 20Hz,<br>NVIDIA H100, NV DIA Infiniband NDR, Microsoft Azure<br>Vicrosoft Azure<br>United States                                              | 2,073,600 | 561.20            | 846.84             |               |
| 4    | Supercomputer Fugaku - Supercomputer Fugaku,<br>A64FX 48C 2.2GHz, Tofu interconnect D, Fujitsu<br>RIKEN Center for Computational Science<br>Japan                                            | 7,630,848 | 442.01            | 537.21             | 29,899        |
| 5    | LUMI - HPE Cray EX22.43, AMD Optimized 3rd Generation<br>EPYC 64C 2GHz, AMU Instinct MI250X, 2 ingshot-11, HPE<br>EuroHPC/CSC<br>Finland                                                     | 2,752,704 | 379.70            | 531.51             | 7,107         |
| 6    | Alps - HPE Crav EV256n, NVIDIA Grace 720 3.16Hz,<br>NVIDIA Gr200 Superchip, Ningshot-11, HPE<br>Swiss Mitional Supercompiling Centre (CSCS)<br>Switzerland                                   | 1,305,600 | 270.00            | 353.75             | 5,194         |
| 7    | Leonardo - BullSequana XH2000, Xeon Platinum 8358<br>32C 2.66Hz, NVIDIA A100, cm4 64 GB, G. d-rail NVIDIA<br>HDR100 Infiniband, EVILEN<br>EuroHPC/CINECA<br>Italy                            | 1,824,768 | 241.20            | 306.31             | 7,494         |
| 8    | MareNostrum 5 ACC - BullSequena XH300 Xeon<br>Platinum 8460Y+ 320 2.30 Hz, NVIDIA H100 66 B,<br>Infiniband NDR, EVIDEN<br>EuroHPC/BSC<br>Spain                                               | 663,040   | 175.30            | 249.44             | 4,159         |
| 9    | Summit - IBM Powerspatern AC922, IBM POWER9 22C<br>3.07 Uz, NVIDIA Volta (* 100, Dual-rail Mellanox EDR<br>Infinibano, IDM<br>DOE/SC/Oak Ridge National Laboratory<br>United States          | 2,414,592 | 148.60            | 200.79             | 10,096        |
| 10   | Eos NVIDIA DOX Super POD - NVIDIA DOX H100, Xeon<br>Hatinum 84605 S6C 3.8GHz, NVIDIA H100, Infiniband<br>NDRA00, Nvidia<br>NUDIA Constration<br>United States                                | 485,888   | 121.40            | 188.65             |               |



## **Typical hybrid CPU-GPU supercomputer**



Each node is made of 1 (or more) CPUs and 1 (or more ) GPUs

Example: Frontier 1 CPUs AMD EPYC (64-core) 4 GPUs AMD Instinct 250X



## What are GPUs?



you don't necessarily need a cluster!



37,888 Instinct MI250X **GPUs!** 



Science & Technology Facilities Council

## NVidia GPUs

Tesla/Data Center (HPC)

- Tesla (1.x)
- Fermi (2.x)
- Kepler (3.x)
- Maxwell (5.x)
- Pascal (6.x)
- Volta (7.x) Turing (7.5)
- Ampere (8.x) Lovelace (8.9)
- Hopper (9.x)
- Blackwell (10.x)



Jetson/Tegra (edge/auto)

Architecture identifier also corresponding to the major number of Compute Capability index



## NVidia H100



#### NVidia Tesla H100: 14592 CUDA cores!

- 8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU
- 128 FP32 CUDA Cores per SM, 18432 FP32 CUDA Cores per full GPU
- 4 Fourth-Generation Tensor Cores per SM, 576 per full GPU
- 6 HBM3 or HBM2e stacks, 12 512-bit Memory Controllers
- 60 MB L2 Cache
- Fourth-Generation NVLink and PCIe Gen 5

H100 white paper: https://resources.nvidia. com/en-us-tensor-core



## NVidia H100



Full GPU has 144 SMs!



## **The Grace-Hopper superchip**



**GH200 White paper** 

https://www.aspsys.com/wpcontent/uploads/2023/09/nvi dia-grace-hopper-cpuwhitepaper.pdf

#### High bandwidth and memory coherence!



#### **Useful blog**

https://developer.nvidia.com/blog/simplifyi ng-gpu-programming-for-hpc-with-thenvidia-grace-hopper-superchip/



## H100 vs H200

| Table 2. | NVIDIA MGX | Grace Hopper | Superchip vs. | . NVIDIA | x86+Hopper |
|----------|------------|--------------|---------------|----------|------------|
|----------|------------|--------------|---------------|----------|------------|

| Feature per GPU                                                 | HGX H100<br>4-GPU<br>(x86)    | NVIDIA MGX<br>GH200 with<br>HBM3 | NVIDIA MGX<br>GH200 with<br>HBM3e                                                  | NVIDIA DGX<br>GH200 |
|-----------------------------------------------------------------|-------------------------------|----------------------------------|------------------------------------------------------------------------------------|---------------------|
| CPU Memory<br>bandwidth (GB/s /<br>GPU)                         | Up to 150                     | Up to 500                        | Up to 500                                                                          | Up to 500           |
| GPU Memory<br>bandwidth (GB/s /<br>GPU)                         | 3000                          | 4000                             | 4800                                                                               | 4000                |
| CPU Memory<br>bandwidth to GPU<br>Memory bandwidth<br>ratio     | 5%                            | 12.5%                            | 10.4%                                                                              | 12.5%               |
| GPU-CPU Link bi-<br>directional bandwidth<br>(GB/s / GPU)       | 128<br>(x16 PCle<br>Gen5)     | 900<br>(NVLink-C2C)              | 900<br>(NVLink-C2C)                                                                | 900<br>(NVLink-C2C) |
| GPU-GPU bi-<br>directional bandwidth<br>inter node (GB/s / GPU) | 100<br>(InfiniBand<br>NDR400) | 100<br>(InfiniBand<br>NDR400)    | 900<br>(NVLink 4 for<br>dual GH200<br>with HBM3e)<br>100<br>(InfiniBand<br>NDR400) | 900<br>(NVLink 4)   |

These improvements in CPU ratio, and NVLink-C2C and NVLink Switch System performance redefine how we achieve maximum performance from heterogeneous systems, enabling new applications, and efficient solutions to challenging problems.



## DGX-GH200



### up to 256 GPUs!

### 1 exaFlops FP8!

### ~£10M!

Figure 4. NVIDIA DGX GH200 with NVLink Switch System for strongscaling giant ML workloads



## HPC and AI (I)





## HPC and AI (II)



### Why is important to know the GPU hardware? (I)

### Solve a Tridiagonal matrix in parallel:



Thomas (sequential) algorithm for 1 CPU



### Why is important to know the hardware? (II)

### Solve a Tridiagonal matrix in parallel:



For MPI/OpenMP you can use a partition method



Hartree Centre

### Why is important to know the hardware? (III)

Solve a Tridiagonal matrix in parallel:



## Each thread simplify a pair of equations.

On GPU you MUST use a cyclic reduction method!

#### Scientist (usually):

- counts from 1
- likes Fortran
- interested in solving PDE
- does not like AI solving everything

Computational scientist (usually):

- likes all languages
- wants to solve PDE efficiently (fast) and everywhere (portability)
- AI can help if combined with a physics background

Your role!

Computer scientist (usually):

- counts from 0
- likes C/C++/Python
- interested in performance/hardware
- likes an AI superintelligence

## Modern and future computational scientist!



### 1<sup>st</sup> Main Concepts: host (CPU) and device (GPU)





### 2<sup>nd</sup> Main Concepts : software abstraction

| Technical specifications                                                     |     | Compute capability (version) |          |     |      |     |     |     |     |     |     |     |     |     |     |
|------------------------------------------------------------------------------|-----|------------------------------|----------|-----|------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|                                                                              |     | 1.1                          | 1.2      | 1.3 | 2.x  | 3.0 | 3.2 | 3.5 | 3.7 | 5.0 | 5.2 | 5.3 | 6.0 | 6.1 | 6.2 |
| Maximum number of resident grids per device<br>(concurrent kernel execution) |     | t.b.d.                       |          | 1   | 16 4 |     | 32  |     | 16  | 128 | 32  | 16  | 128 |     |     |
| Maximum dimensionality of grid of thread blocks                              | 2   |                              |          |     |      |     |     |     |     |     |     |     |     |     |     |
| Maximum x-dimension of a grid of thread blocks                               |     | 65535 2 <sup>31</sup> - 1    |          |     |      |     |     |     |     |     |     |     |     |     |     |
| Maximum y-, or z-dimension of a grid of thread blocks                        |     | 65535                        |          |     |      |     |     |     |     |     |     |     |     |     |     |
| Maximum dimensionality of thread block                                       |     | 3                            |          |     |      |     |     |     |     |     |     |     |     |     |     |
| Maximum x- or y-dimension of a block                                         | 512 |                              | 512 1024 |     |      |     |     |     |     |     |     |     |     |     |     |
| Maximum z-dimension of a block                                               |     | 64                           |          |     |      |     |     |     |     |     |     |     |     |     |     |
| Maximum number of threads per block                                          | 512 |                              | 1024     |     |      |     |     |     |     |     |     |     |     |     |     |
| Warp size                                                                    |     | 32                           |          |     |      |     |     |     |     |     |     |     |     |     |     |

you can have billions of CUDA threads despite you only have ~15000 CUDA cores on a single GPU!



### 3<sup>rd</sup> Main Concepts : Fine Grain Parallelism (I)





### 3<sup>rd</sup> Main Concepts : Fine Grain Parallelism (I)





### 3<sup>rd</sup> Main Concepts : Fine Grain Parallelism (II)



1 CUDA threads per each element of the array!

## you can have millions of CUDA threads despite you only have ~15000 CUDA cores on a single GPU!



### 4<sup>th</sup> Main Concepts : memory coalescent







### **Uncoalescent Access**



### **Coalescent Access**

**Resume Main Concepts** 

- Host (CPU) and device (GPU)
- Software Abstraction
- Fine Grain Parallelism Paradigm

The real challenge is NOT porting to CUDA-C

The real challenge is to satisfy those main concepts: an algorithm change may be required!

Memory Coalescent Access

Very often from the algorithm change will benefit also others architectures (vectorization, etc...) Hartree



Science & Technology Facilities Counci

## **Questions?**



## What is CUDA?

**Compute Unified Device Architecture**: is a parallel computing platform and application programming interface (API) model created by NVidia.

- CUDA is based on C/C++ language
- CUDA Fortran is a Fortran wrapper for CUDA C
- OpenACC are directives to offload kernels on GPU. It translates to CUDA-C.

Main References:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html



## Why CUDA?





## The future of parallel programming

#### Standard Languages | Directives | Specialized Languages



The NVIDIA HPC compilers split execution of an application across multicore CPUs and NVIDIA GPUs using standard language constructs, directives, or CUDA.



### However...



| PROCESSOR                |                         | MEMORY                                        |              |  |  |  |
|--------------------------|-------------------------|-----------------------------------------------|--------------|--|--|--|
| Processor manufacturer 📀 | ssor manufacturer ⑦ AMD |                                               | DDR4-SDRAM   |  |  |  |
| Processor model (?)      | 7713                    | by processor                                  |              |  |  |  |
| Processor base frequency | 2 GHz                   | Memory clock speeds<br>supported by processor | 3200 MHz     |  |  |  |
| Processor family ⑦       | AMD EPYC                | Memory channels                               | Octa-channel |  |  |  |
| Processor cores 💿 64     |                         | Memory bandwidth (max)                        | 204.8 GB/s   |  |  |  |
|                          |                         |                                               |              |  |  |  |

price ratio ~ 10 energy ratio ~ 2 memory bandwidth ~ 15



Hartree Centre



#### about 10k USD

The price of an AMD instinct MI250X is about **10k USD**, so 10000 of them are about 100 million USD. This would suggest that the actual workhorse of the supercomputer is only 20% of its cost (since Frontier cost 600 million USD). 18 Jun 2022

**GPU** 

#### AMD Instinct<sup>™</sup> MI250X Accelerators

AMD Instinct^ MI250X accelerators are designed to supercharge HPC workloads and  $\mathfrak k$  era of exascale

| Peak Double Precision     | 95.7 TFLOPs |
|---------------------------|-------------|
| Matrix (FP64) Performance |             |

#### **GPU Memory**

| Dedicated Memory Size | 128 GB   |
|-----------------------|----------|
| Dedicated Memory Type | HBM2e    |
| Memory Interface      | 8192-bit |
| Memory Clock          | 1.6 GHz  |
| Peak Memory Bandwidth | 3.2 TB/s |

## How it works



Transfer CPU (Host) to GPU (Device) is slow: try to avoid as much as possible!



## How it works





## How it works

#### C with CUDA extensions

```
_global__
void saxpy(int n, float a,
                              float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}</pre>
```

```
int N = 1<<20;
cudaMemcpy(x, d_x, N, cudaMemcpyHostToDevice);
cudaMemcpy(y, d_y, N, cudaMemcpyHostToDevice);
```

```
// Perform SAXPY on 1M elements
saxpy<<<4096,256>>>(N, 2.0, x, y);
```

```
cudaMemcpy(d_y, y, N, cudaMemcpyDeviceToHost);
```





## **CUDA** compiler

NVCC (NVidia CUDA compiler)

- Very robust!
- Backward compatible
- Released with the CUDA toolkit (12.5) or NVidia HPC SDK (24.5)
- Current version 12.5

CUDA toolkit version: do <u>NOT</u> confuse with CUDA architecture or Compute Capability index!



# Main Concepts I: software abstraction





# CUDA Development Ecosystem (I)





# CUDA Development Ecosystem (II)

### Tools & Ecosystem



#### Accelerated Solutions

GPUs are accelerating many applications across numerous industries. Learn more >



### Performance Analysis Tools

Find the best solutions for analyzing your application's performance profile. Learn more >



#### Key Technologies

Learn more about parallel computing technologies and architectures.

Learn more >



#### **GPU-Accelerated Libraries**

Application accelerating can be as easy as calling a library function. Learn more >



#### **Debugging Solutions**

Powerful tools can help debug complex parallel applications in intuitive ways. Learn more >

### Accelerated Web Services

Micro services with visual and intelligent capabilities using deep learning. Learn more >



### Language and APIs

GPU acceleration can be accessed from most popular programming languages. Learn more >



#### Data Center Tools

Software Tools for every step of the HPC and AI software life cycle. Learn more >



#### Cluster Management

Managing your cluster and job scheduling can be simple and intuitive.

Learn more >

### https://developer.nvidia.com/tools-ecosystem



# CUDA Development Ecosystem (III)

| End user application                            | ons 150+ SD                     | Ks HPC                    | DATABASES AERIAL<br>5G     | METROPOLIS ISAA<br>SMART CITY ROBOT                                                |                   | CLARA NVIDIA<br>SENOMICS DRIVE   |
|-------------------------------------------------|---------------------------------|---------------------------|----------------------------|------------------------------------------------------------------------------------|-------------------|----------------------------------|
|                                                 | APF                             | LICATION SPE              | CIFIC LIBRARIES            | & FRAMEWOR                                                                         | RKS               |                                  |
| Simulation<br>Libraries                         | Deep Learning<br>Libraries      | Training<br>(DLFW)        | Inference                  | RAS<br>Data Al                                                                     |                   | Visualization                    |
| Modulus                                         | CUTLASS                         | PyTorch                   | Triton Inference<br>Server | cuDF                                                                               | cuSignal          | Omniverse                        |
| AmgX                                            | cuDNN                           | TensorFlow                | TensorRT                   | cuxfilter                                                                          | cuGraph           | cuQuantum                        |
| PhysX                                           | DALI                            | MxNet TLT                 | RAPIDS<br>Spark            | cuSpatial                                                                          | cuML              | MANY<br>OTHERS                   |
|                                                 |                                 | DEVE                      | LOPMENT & AN               | ALYSIS                                                                             |                   |                                  |
| Programming<br>Models                           | Compilers                       | Core<br>Libraries         | Math<br>Libraries          | Communication &<br>Storage Libraries                                               | DPUs & DOCA       | Profilers &<br>Debuggers         |
|                                                 | NVC++<br>NVC NVCC               | libcu++                   | cuBLAS cuTENSOR            | HPC-X<br>MPI                                                                       | DPI FLOW          | NSight NSight<br>Systems Compute |
| ISO C++ ISO<br>Fortran                          |                                 |                           |                            | UCK SHMEM                                                                          | The second second | CUPTI compute                    |
| ISO C++                                         | NVFORTRAN                       | Thrust & CUB              | cuSPARSE cuSOLVER          | SHARP HCOLL                                                                        | RegEx DPA         | sanitize                         |
| ISO C++ Fortran OpenACC Open                    | NVFORTRAN<br>libnvv nvrt<br>M C | Thrust & CUB<br>cuNumeric | cuSPARSE cuSOLVER          | SHARP         HCDLL           NVSHMEM         NCCL           MAGNUM         cuFile | RegEx DPA         | cuda-gdb                         |
| ISO C++ Fortran OpenACC Open MP CUDA CUDA C++ & | libNVV NVRT                     | cuNumeric                 |                            | NVSHMEM NCCL<br>MAGNUM<br>10 cuFile                                                |                   | sanitize                         |

### https://developer.nvidia.com/tools-ecosystem



### The CUDA tools

- NSight (debugger)
- NVPP (performance)
- Code samples
- ...and many more!!!











# **Questions?**



### TOPICS

## **CUDA threads**

GPU-accelerated vs. CPU-only Applications

**CUDA Kernel Execution** 

**Parallel Memory Access** 

GPU occupancy

Kernel occupancy

Appendix: Glossary

### **GPU-accelerated vs. CPU-only Applications**

In **CPU-only applications** data is allocated on CPU

| DATA |              |
|------|--------------|
|      |              |
| CPU  | initialize() |
|      | Time         |



...and all work is performed on CPU

| DATA |                                       |
|------|---------------------------------------|
| CPU  | <pre>initialize() performWork()</pre> |
|      | Time                                  |



...and all work is performed on CPU





|      |            | In <b>accelerated applications</b> data is allocated with <b>cudaMallocManaged()</b> |
|------|------------|--------------------------------------------------------------------------------------|
| DATA | GPU<br>CPU |                                                                                      |
| GPU  |            |                                                                                      |
| CPU  |            | ••••••                                                                               |
|      | Time       |                                                                                      |



... where it can be accessed and worked on by the CPU

| DATA |              |
|------|--------------|
| GPU  |              |
| CPU  | initialize() |
|      | Time         |





















### **CUDA Kernel Execution**

| DATA | GPU<br>CPU   |                     |              |
|------|--------------|---------------------|--------------|
| GPU  |              | performWork()       |              |
| CPU  | initialize() | cpuWork() cpuWork() | verifyWork() |
|      |              |                     | Time         |



GPUs do work in parallel







GPU work is done in a thread







Many threads run in parallel







A collection of threads is a **block** 







There are many blocks







A collection of blocks is a grid







GPU functions are called kernels







Kernels are **launched** with an **execution configuration** 







The execution configuration defines the number of blocks in the grid







... as well as the number of threads in each block



GPU



Every block in the grid contains the same number of threads







### **CUDA-Provided Thread Hierarchy Variables**

Inside kernels definitions, CUDAprovided variables describe its executing thread, block, and grid



### GPU



gridDim.x is the number of blocks in the grid, in this case 2









blockIdx.x is the index of the current block within the grid, in this case 0









blockIdx.x is the index of the current block within the grid, in this case 1









Inside a kernel **blockDim.x** describes the number of threads in a block. In this case **4** 

performWork<<2, 4>>>()





All blocks in a grid contain the same number of threads



























performWork<<2, 4>>>()





























## **Coordinating Parallel Threads**

| DATA | GPU<br>CPU   |               |              |
|------|--------------|---------------|--------------|
| GPU  |              | performWork() |              |
| CPU  | initialize() | cpuWork()     | verifyWork() |
|      |              |               | Time         |



























Recall that each thread has access to the size of its block via **blockDim.x** 























Using these variables, the formula threadIdx.x + blockIdx.x \* blockDim.x will map each thread to one element in the vector



| 0        | 4 | threadIdx.x | + blockIdx.x |   | * | blockDim.x |
|----------|---|-------------|--------------|---|---|------------|
|          |   | 0           |              | 0 |   | 4          |
| 1        | 5 | dataIndex   |              |   |   |            |
| 2        | 6 | ?           |              |   |   |            |
| <b>ि</b> | 7 |             |              |   |   |            |
| 3        | 7 |             |              |   |   |            |

































| 0 | 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |
|---|---|-------------|---|------------|---|------------|
|   |   | 2           |   | 0          |   | 4          |
| 1 | 5 | dataIndex   |   |            |   |            |
| 2 | 6 | ?           |   |            |   |            |
| 3 | 7 |             |   |            |   |            |



















| 0 | 4    | threadIdx.x | + | blockIdx.x | * | blockDim.x |
|---|------|-------------|---|------------|---|------------|
|   |      | 3           |   | 0          |   | 4          |
| 1 | 5    | dataIndex   | ] |            |   |            |
| 2 | 6    | ?           |   |            |   |            |
|   | ] [] |             | J |            |   |            |
| 3 | 7    |             |   |            |   |            |





















GPU

GPU DATA

























DATA



| ре<br>0 | <pre>performWork&lt;&lt;&lt;2, 4&gt;&gt;&gt;() 0</pre> |   |   |   |   |   |   |   |   |  |
|---------|--------------------------------------------------------|---|---|---|---|---|---|---|---|--|
|         | 0                                                      | 1 | 2 | 3 | I | 0 | 1 | 2 | 3 |  |
|         |                                                        |   |   |   |   |   |   |   |   |  |















## **Grid Size Work Amount Mismatch**























Code must check that the **dataIndex** calculated by threadIdx.x + blockIdx.x \* blockDim.x is less than  $\mathbf{N}$ , the number of data elements. GPU DATA performWork <<< 2, 4>>> ()**GPU** 



| 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |
|---|-------------|---|------------|---|------------|
|   | 0           |   | 1          |   | 4          |
|   | dataIndex   | < | N          | = | Can work   |
|   | 4           |   | 5          |   | ?          |











| 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |
|---|-------------|---|------------|---|------------|
|   | 1           |   | 1          |   | 4          |
|   | dataIndex   | < | N          | = | Can work   |
|   | 5           |   | 5          |   | ?          |







| 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |
|---|-------------|---|------------|---|------------|
|   | 1           |   | 1          |   | 4          |
|   | dataIndex   | < | N          | = | Can work   |
|   | 5           |   | 5          |   | false      |







| 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |
|---|-------------|---|------------|---|------------|
|   | 2           |   | 1          |   | 4          |
|   | dataIndex   | < | N          | = | Can work   |
|   | 6           |   | 5          |   | ?          |







|   | 6           |   | 5          |   | false      |
|---|-------------|---|------------|---|------------|
|   | dataIndex   | < | N          | = | Can work   |
|   |             |   |            |   |            |
|   | 2           |   | 1          |   | 4          |
| 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |







| 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |
|---|-------------|---|------------|---|------------|
|   | 2           |   | 1          |   | 4          |
|   | dataIndex   | < | N          | = | Can work   |
|   | 6           |   | 5          |   | ?          |







|   | 6           |   | 5          |   | false      |
|---|-------------|---|------------|---|------------|
|   | dataIndex   | < | N          | = | Can work   |
|   | 2           |   | 1          |   | 4          |
| 4 | threadIdx.x | + | blockIdx.x | * | blockDim.x |





# **Grid-Stride Loops**





Often there are more data elements than there are threads in the grid









In such scenarios threads cannot work on only one element





















One way to address this programmatically is with a grid-stride loop









In a grid-stride loop, the thread's first element is calculated as usual, with threadIdx.x + blockIdx.x \* blockDim.x









The thread then strides forward by the number of threads in the grid (blockDim.x \* gridDim.x), in this case 8









It continues in this way until its data index is greater than the number of data elements









It continues in this way until its data index is greater than the number of data elements









With all threads working in this way, all elements are covered









With all threads working in this way, all elements are covered



















With all threads working in this way, all elements are covered











With all threads working in this way, all elements are covered









With all threads working in this way, all elements are covered









With all threads working in this way, all elements are covered









With all threads working in this way, all elements are covered









With all threads working in this way, all elements are covered







CUDA runs as many blocks in parallel at once as the GPU hardware supports, for massive parallelization



LEARNING



DEEP LEARNING INSTITUTE









DEEP LEARNING INSTITUTE







| ╺╺┰┰┰┰┓┏┰┰┰┓ |  |  |  |
|--------------|--|--|--|
|              |  |  |  |

DEEP LEARNING INSTITUTE









DEEP LEARNING INSTITUTE







#### Occupancy

### - GPU Occupancy

### - Kernel Occupancy







NVIDIA GPUs contain functional units called **Streaming Multiprocessors**, or **SMs** 



Blocks of threads are scheduled to run on SMs









kernel<<<24, 4>>>() 

Depending on the number of SMs on a GPU, and the requirements of a block, more than one block can be scheduled on an SM







Depending on the number of SMs on a GPU, and the requirements of a block, more than one block can be scheduled on an SM







Grid dimensions divisible by the number of SMs on a GPU can promote full SM utilization



Here there are fallow SMs







### **Kernel Occupancy**

compile your code with the following nvcc option:

--ptxas-options=-v

you will get something like this:

ptxas info : 0 bytes gmem

ptxas info : Compiling entry function '\_Z28MyKernelhS0\_S0\_PK3CDRS0\_Ph' for 'sm\_20'

ptxas info : Function properties for \_Z28MyKernelhS0\_S0\_PK3CDRS0\_Ph

24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 29 registers, 80 bytes cmem[0], 4 bytes cmem[16]

#### **Remember:**

In the H100 you have 256KB register per multiprocessor.

The total number of "threads X register per threads" cannot exceed this value!!





### **Kernal Occupancy**

#### Warp = 32 threads

| IVI                                                  | M                                                                 |                                                                    |                                                                        |                                                                                              |                                                          |            |                                                                          |                                                                          |                                                                    |                                                                         |                                                                                              |                                 |                       |        |
|------------------------------------------------------|-------------------------------------------------------------------|--------------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|----------------------------------------------------------|------------|--------------------------------------------------------------------------|--------------------------------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|---------------------------------|-----------------------|--------|
|                                                      |                                                                   |                                                                    |                                                                        |                                                                                              |                                                          | L1 Instruc | tion Cache                                                               |                                                                          |                                                                    |                                                                         |                                                                                              |                                 |                       |        |
|                                                      | L0 Instruction Cache                                              |                                                                    |                                                                        |                                                                                              |                                                          |            | L0 Instruction Cache                                                     |                                                                          |                                                                    |                                                                         |                                                                                              |                                 |                       |        |
|                                                      | Wa                                                                |                                                                    |                                                                        |                                                                                              | hread/clk)                                               |            | Warp Scheduler (32 thread/clk)                                           |                                                                          |                                                                    |                                                                         |                                                                                              |                                 |                       |        |
|                                                      | Di                                                                | spatc                                                              | h Unit                                                                 | (32 th                                                                                       | read/clk)                                                |            |                                                                          | Di                                                                       | spatc                                                              | h Unit                                                                  | (32 th                                                                                       | read/o                          | cik)                  |        |
|                                                      | Reg                                                               | jister                                                             | File (                                                                 | 16,384                                                                                       | 4 x 32-bit)                                              |            |                                                                          | Reg                                                                      | jister                                                             | File (                                                                  | 16,38                                                                                        | 4 x 32                          | 2-bit)                |        |
| FP64                                                 | INT                                                               | INT                                                                | FP32                                                                   | FP32                                                                                         |                                                          |            | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         | F                               |                       |        |
| FP64                                                 | INT                                                               |                                                                    | FP32                                                                   | FP32                                                                                         |                                                          |            | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         | $\vdash$                        |                       |        |
| FP64                                                 | INT                                                               | INT                                                                | FP32                                                                   | FP32                                                                                         |                                                          |            | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         |                                 |                       |        |
| FP64                                                 | INT                                                               | INT                                                                | FP32                                                                   | FP32                                                                                         | TENSOF                                                   | TENSOR     | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         | TEN                             | TENSOR                | TENSOR |
| FP64                                                 | INT                                                               | INT                                                                | FP32                                                                   | FP32                                                                                         | CORE                                                     | CORE       | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         |                                 | DRE                   | CORE   |
| FP64                                                 | INT                                                               | INT                                                                | FP32                                                                   | FP32                                                                                         |                                                          |            | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         |                                 |                       |        |
| FP64                                                 | INT                                                               | INT                                                                | FP32                                                                   | FP32                                                                                         |                                                          |            | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         | $\vdash$                        |                       |        |
| FP64                                                 | INT                                                               |                                                                    | FP32                                                                   | FP32                                                                                         | $\square$                                                |            | FP64                                                                     | INT                                                                      | INT                                                                | FP32                                                                    | FP32                                                                                         | $\square$                       |                       |        |
| LD/ LD/<br>ST ST                                     | LD/<br>ST                                                         | LD/<br>ST                                                          | LD/<br>ST                                                              | LD/<br>ST                                                                                    | LD/ LD/<br>ST ST                                         | SFU        | LD/ LD/<br>ST ST                                                         | LD/<br>ST                                                                | LD/<br>ST                                                          | LD/<br>ST                                                               | LD/<br>ST                                                                                    | LD/<br>ST                       | LD/<br>ST             | SFU    |
|                                                      | L0 Instruction Cache                                              |                                                                    |                                                                        |                                                                                              |                                                          |            |                                                                          |                                                                          |                                                                    |                                                                         |                                                                                              |                                 |                       |        |
| Warp Scheduler (32 thread/clk)                       |                                                                   |                                                                    |                                                                        |                                                                                              |                                                          |            |                                                                          | LOI                                                                      |                                                                    | tion C                                                                  |                                                                                              |                                 |                       |        |
|                                                      | Wa                                                                | _                                                                  | _                                                                      | _                                                                                            |                                                          |            |                                                                          | Wai                                                                      | _                                                                  | nstruc<br>nedule                                                        | _                                                                                            | _                               | /clk)                 | _      |
|                                                      |                                                                   | rp Sch                                                             | nedule                                                                 | r (32 t                                                                                      |                                                          |            |                                                                          |                                                                          | rp Sch                                                             | _                                                                       | r (32 t                                                                                      | hread                           |                       |        |
|                                                      | Di                                                                | rp Scl<br>spatc                                                    | hedule<br>h Unit                                                       | r (32 t<br>(32 th                                                                            | hread/clk)                                               |            |                                                                          | Di                                                                       | rp Sch<br>spatci                                                   | nedule                                                                  | r (32 t<br>(32 th                                                                            | hread<br>read/o                 | cik)                  |        |
| FP64                                                 | Di                                                                | rp Scl<br>spatc                                                    | hedule<br>h Unit                                                       | r (32 t<br>(32 th<br>16,384                                                                  | hread/clk)<br>read/clk)                                  |            | FP64                                                                     | Di                                                                       | rp Sch<br>spatci                                                   | hedule<br>h Unit<br>File ('                                             | r (32 t<br>(32 th                                                                            | hread<br>read/o                 | cik)                  |        |
| FP64<br>FP64                                         | Di<br>Reg                                                         | rp Sch<br>spatc<br>jister                                          | File (                                                                 | r (32 t<br>(32 th<br>16,384                                                                  | hread/clk)<br>read/clk)                                  |            | FP64<br>FP64                                                             | Di<br>Reg                                                                | rp Sch<br>spatci<br>gister                                         | File (*                                                                 | r (32 t<br>(32 th<br>16,384                                                                  | hread<br>read/o                 | cik)                  |        |
|                                                      | Di<br>Reg                                                         | rp Sch<br>spatc<br>lister<br>INT                                   | File (<br>FP32                                                         | r (32 th<br>(32 th<br>16,384<br>FP32                                                         | hread/clk)<br>read/clk)                                  |            |                                                                          | Di<br>Reg                                                                | rp Sch<br>spatc<br>jister<br>INT                                   | File (<br>File (<br>FP32                                                | r (32 th<br>(32 th<br>16,384<br>FP32                                                         | hread<br>read/o                 | cik)                  |        |
| FP64                                                 | Di<br>Reg<br>INT<br>INT                                           | rp Sch<br>spatc<br>lister<br>INT<br>INT                            | File (<br>FP32                                                         | r (32 th<br>(32 th<br>16,384<br>FP32<br>FP32<br>FP32                                         | hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOF         | TENSOR     | FP64                                                                     | Di<br>Reg<br>INT<br>INT                                                  | rp Sch<br>spatc<br>jister<br>INT<br>INT                            | File (*<br>File (*<br>FP32<br>FP32<br>FP32                              | r (32 t<br>(32 th<br>16,38<br>FP32<br>FP32                                                   | hread<br>read/<br>4 x 32<br>TEN | 2-bit)                | TENSOR |
| FP64<br>FP64                                         | Reg<br>INT<br>INT<br>INT                                          | ister                                                              | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                         | r (32 th<br>(32 th<br>16,384<br>FP32<br>FP32<br>FP32                                         | hread/clk)<br>read/clk)<br>4 x 32-bit)                   |            | FP64<br>FP64                                                             | Di<br>Reg<br>INT<br>INT<br>INT                                           | rp Sch<br>spatc<br>jister<br>INT<br>INT                            | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                          | r (32 th<br>(32 th<br>16,38-<br>FP32<br>FP32<br>FP32                                         | hread<br>read/<br>4 x 32<br>TEN | elk)<br>2-bit)        | TENSOR |
| FP64<br>FP64<br>FP64                                 | Di<br>Reg<br>INT<br>INT<br>INT                                    | ister<br>INT<br>INT<br>INT<br>INT                                  | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                 | r (32 th<br>(32 th<br>16,38<br>FP32<br>FP32<br>FP32<br>FP32                                  | hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOF         | TENSOR     | FP64<br>FP64<br>FP64                                                     | Di<br>Reg<br>INT<br>INT<br>INT                                           | ister<br>INT<br>INT<br>INT<br>INT                                  | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                  | r (32 th<br>(32 th<br>16,38<br>FP32<br>FP32<br>FP32<br>FP32                                  | hread<br>read/<br>4 x 32<br>TEN | 2-bit)                |        |
| FP64<br>FP64<br>FP64<br>FP64                         | Di<br>Reg<br>INT<br>INT<br>INT<br>INT                             | INT<br>INT<br>INT<br>INT<br>INT<br>INT                             | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                 | r (32 th<br>(32 th<br>16,38<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                  | hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOF         | TENSOR     | FP64<br>FP64<br>FP64<br>FP64                                             | Di<br>Reg<br>INT<br>INT<br>INT<br>INT                                    | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT                      | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32            | r (32 th<br>(32 th<br>16,38-<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                         | hread<br>read/<br>4 x 32<br>TEN | 2-bit)                |        |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64                 | Di<br>Reg<br>INT<br>INT<br>INT<br>INT<br>INT                      | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT                      | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32           | r (32 th<br>(32 th<br>16,384<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32         | hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOF         | TENSOR     | FP64<br>FP64<br>FP64<br>FP64<br>FP64                                     | Di<br>Reg<br>INT<br>INT<br>INT<br>INT<br>INT                             | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT                      | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32            | r (32 th<br>(32 th<br>16,38<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32          | hread<br>read/<br>4 x 32<br>TEN | 2-bit)                |        |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64         | Di<br>Reg<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT               | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT               | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | r (32 th<br>(32 th<br>16,384<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32         | hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOF         | TENSOR     | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                             | Di<br>Reg<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT                      | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT               | File (*<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | r (32 th<br>(32 th<br>16,38<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32          | hread<br>read/<br>4 x 32<br>TEN | 2-bit)                |        |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64 | Di<br>Reg<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | r (32 th<br>(32 th<br>16,38-<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | hread/clk)<br>read/clk)<br>4 x 32-bit)<br>TENSOF<br>CORE | TENSOR     | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>LD/LD/<br>ST LD/ | Di<br>Reg<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT | File (<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32  | r (32 th<br>(32 th<br>16,38-<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | tread/<br>4 x 32                | 2-bit)<br>ISOR<br>DRE | CORE   |

| Technical specifications                                                     |          | Compute capability (version) |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |
|------------------------------------------------------------------------------|----------|------------------------------|------|------|-------|----------|------|------|-------|------|-----|----------------|------------------|------|-----|---------------|-----|
|                                                                              |          | 1.1                          | 1.2  | 1.3  | 2.x   | 3.0      | 3.2  | 3.5  | 3.7   | 5.0  | 5.2 | 5.3            | 6.0              | 6.1  | 6.2 | 7.0<br>(7.2?) | 7.5 |
| Maximum number of resident grids per device<br>(concurrent kernel execution) |          | t.b                          | .d.  |      | 1     | 6        | 4    |      |       | 32   |     | 16             | 128              | 32   | 16  | 12            | 8   |
| Maximum dimensionality of grid of thread blocks                              |          | :                            | 2    |      |       |          |      |      |       |      |     | 3              |                  |      |     |               |     |
| Maximum x-dimension of a grid of thread blocks                               |          |                              | 6553 | 5    |       |          |      |      |       |      |     | 2 <sup>3</sup> | <sup>1</sup> – 1 |      |     |               |     |
| Maximum y-, or z-dimension of a grid of thread blocks                        | 65535    |                              |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |
| Maximum dimensionality of thread block                                       |          | 3                            |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |
| Maximum x- or y-dimension of a block                                         | 512 1024 |                              |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |
| Maximum z-dimension of a block                                               |          | 64                           |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |
| Maximum number of threads per block                                          |          | 512 1024                     |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |
| Warp size                                                                    | 32       |                              |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |
| Maximum number of resident blocks per multiprocessor                         |          |                              | 8    |      | 16 32 |          |      |      |       |      | 16  |                |                  |      |     |               |     |
| Maximum number of resident warps per multiprocessor                          |          | 24 32                        |      |      | 48    | 48 64 32 |      |      |       |      |     |                |                  |      | 32  |               |     |
| Maximum number of resident threads per multiprocessor                        |          | 768 1024                     |      | 24   | 1536  |          |      | 2048 |       |      |     |                | 1024             |      |     |               |     |
| Number of 32-bit registers per multiprocessor                                | 8        | К                            | 16   | к    | 32 K  |          | 64 K |      | 128 K |      |     |                |                  | 64 ł | ¢   |               |     |
| Maximum number of 32-bit registers per thread block                          | N/A      |                              | 32 K | 64 K | 32 K  |          | 6    | 4 K  |       | 32 K | 64  | K              | 32 K             | 64   | к   |               |     |
| Maximum number of 32-bit registers per thread                                |          | 13                           | 24   |      | 6     | 63 255   |      |      |       |      |     |                |                  |      |     |               |     |
|                                                                              |          |                              |      |      |       |          |      |      |       |      |     |                |                  |      |     |               |     |

Occupancy =

#### max number of active warps

max number of warps per SM

Block size should be a multiple of 32 (128,256 usually)!



#### Spreadsheet calculator

https://xmartlabs.github.io/cuda-calculator/

Or google for:

### **CUDA spreadsheet calculator excel**





### Glossary

### Glossary

- cudaMallocManaged(): CUDA function to allocate memory accessible by both the CPU and GPUs. Memory allocated this way is called *unified memory* and is automatically migrated between the CPU and GPUs as needed.
- cudaDeviceSynchronize(): CUDA function that will cause the CPU to wait until the GPU is finished working.
- Kernel: A CUDA function executed on a GPU.
- Thread: The unit of execution for CUDA kernels.
- **Block:** A collection of threads.
- Grid: A collection of blocks.
- Execution context: Special arguments given to CUDA kernels when launched using the <<<...>>> syntax. It defines the number of blocks in the grid, as well as the number of threads in each block.
- gridDim.x: CUDA variable available inside executing kernel that gives the number of blocks in the grid
- blockDim.x: CUDA variable available inside executing kernel that gives the number of threads in the thread's block
- blockIdx.x: CUDA variable available inside executing kernel that gives the index the thread's block within the grid
- threadIdx.x: CUDA variable available inside executing kernel that gives the index the thread within the block
- threadIdx.x + blockIdx.x \* blockDim.x: Common CUDA technique to map a thread to a data element
- Grid-stride loop: A technique for assigning a thread more than one data element to work on when there are more elements than the number of threads in the grid. The stride is calculated by gridDim.x \* blockDim.x, which is the number of threads in the grid.



## **Questions**?





### How to access to the course online

- 1. WIFI Info: eduroam
- 2. Browser Recommendation: Chrome
- 3. websocketstest.courses.nvidia.com
- 4. https://learn.nvidia.com/dli-event
- 5. Create an Nvidia Developer Account (if you have not done yet)
- Event code: STFC\_CUDA\_AMBASSADOR\_MAY24
- 7. Work through the Introduction Section and 'Start' launching your first GPU task





### **Memory Management**

- Type of Memories
- Unified memory behaviour
- Non-Unified memory
- cudaMemcpyAsync

## Type of Memories

#### **Device Memory:**

- Global Memory
- Texture Memory
- Local Memory
- Constant Memory



#### **On Chip Memory:**

- Shared Memory (L1 Cache)
- Registers





### **Device Memory**

| Memory   | Location | Access | Scope                | Lifetime    | CUDA qualifier                                    |
|----------|----------|--------|----------------------|-------------|---------------------------------------------------|
| Global   | DRAM     | RW     | All threads and host | Application | cudaMalloc<br>cudaMallocManaged<br>cudaMallocHost |
| Local    | DRAM     | RW     | 1 thread             | Thread      |                                                   |
| Constant | DRAM     | R      | All threads and host | Application | constant                                          |
| Texture  | DRAM     | R      | All threads and host | Application | texture                                           |

- Global memory is the largest memory on the GPU
- Local memory are local registers spilled into the global memory
- Constant memory is useful to store constant read in values (ex. g, c, R, etc.)
- Texture memory allows interpolation on 2D constant matrix values (ex. PVT table)



## On chip Memory

| Memory   | Location | Access | Scope                  | Lifetime | CUDA qualifier |
|----------|----------|--------|------------------------|----------|----------------|
| Register | On chip  | RW     | 1 thread               | Thread   |                |
| Shared   | On chip  | RW     | All threads in a block | Block    | shared         |
|          |          |        |                        |          |                |

- Registers per thread can be found at compile time
- Shared memory can enhance performance when locality is high



## Unified Memory (2014)





## Grace-Hopper Unified Memory (2022)

# Careful!! This concept is NOT the unified memory referred here!



In CUDA: -gpu=unified -gpu=unified (implies managed) https://developer.nvidia.com/blog/uni fied-memory-cuda-beginners/

In OpenACC -ta=tesla:managed





### New unified memory (I)







## New unified memory (II)







## New unified memory (III)







### **Unified Memory Behavior**

When **UM** is allocated, it may not be resident initially on the CPU or the GPU

|     | <b>T</b> <u>?</u>   |  |
|-----|---------------------|--|
| DAT | GPU                 |  |
|     | СРО                 |  |
|     |                     |  |
| GPU |                     |  |
|     |                     |  |
| CPU | cudaMallocManaged() |  |
|     |                     |  |
|     | Time                |  |
|     |                     |  |



| When some work asks for the memory for the first time, a <b>page fault</b> will |
|---------------------------------------------------------------------------------|
| occur                                                                           |
|                                                                                 |

| DATA | ?                                       |       |
|------|-----------------------------------------|-------|
| GPU  |                                         |       |
| CPU  | <pre> cudaMallocManaged() init() </pre> | ••••• |
|      | Time                                    |       |



The page fault will trigger the migration of the demanded memory

|      | ,                                     |        |
|------|---------------------------------------|--------|
| DATA | GPU<br>CPU                            |        |
| GPU  |                                       |        |
| CPU  | <pre>cudaMallocManaged()</pre> init() | •••••• |
|      | Time                                  |        |









This process repeats anytime the memory is requested somewhere in the system where it is not resident

| DATA | ?                   |              |  |
|------|---------------------|--------------|--|
| GPU  |                     | work<<<>>>() |  |
| CPU  | cudaMallocManaged() | init()       |  |
|      |                     | Time         |  |



If it is known that the memory **will be** accessed somewhere it is not resident, asynchronous prefetching can be used

| DATA | ?                   |                           |         |
|------|---------------------|---------------------------|---------|
| GPU  |                     | work<<<>>>()              |         |
| CPU  | cudaMallocManaged() | init() cudaMemPrefetchAsy | nc(cpu) |
|      |                     | Time                      |         |



This moves the memory in larger batches, and prevents page faulting





### **Non-Unified Memory**

Memory can be allocated directly to the GPU with **`cudaMalloc`** 

| DATA | GPU<br>CPU   |
|------|--------------|
| GPU  |              |
| CPU  | cudaMalloc() |
|      | Time         |



Memory can be allocated directly to the host with **`cudaMallocHost`** 

| DATA | GPU<br>CPU       |
|------|------------------|
| GPU  |                  |
| CPU  | cudaMallocHost() |
|      | Time             |



|      |                                   | Memory allocated in either of these<br>ways can be <b>copied</b> to other locations<br>in the system with <b>`cudaMemCpy</b> ` |
|------|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| DATA | GPUCPU                            |                                                                                                                                |
| GPU  |                                   |                                                                                                                                |
| CPU  | cudaMallocHost() cudaMemcpy(HtoD) |                                                                                                                                |
|      | Time                              |                                                                                                                                |



|      |                                   | Copying leaves 2 copies in of in the system |
|------|-----------------------------------|---------------------------------------------|
|      |                                   |                                             |
| DATA | GPU<br>CPU                        |                                             |
|      |                                   |                                             |
| GPU  |                                   |                                             |
|      |                                   |                                             |
| CPU  | cudaMallocHost() cudaMemcpy(HtoD) |                                             |
|      | Time                              |                                             |



# cudaMemcpyAsync

|     |                                        | `cudaMemcpyAsync` can<br>asynchronously transfer memory over<br>a non-default stream |
|-----|----------------------------------------|--------------------------------------------------------------------------------------|
| ΔΤΑ | GPU<br>CPU                             |                                                                                      |
| GPU |                                        |                                                                                      |
| CPU | cudaMallocHost() cudaMemcpyAsync(HtoD) |                                                                                      |
|     | Time                                   |                                                                                      |





| DATA | GPU<br>CPU           |  |
|------|----------------------|--|
| GPU  |                      |  |
| CPU  | cudaMallocHost() cpy |  |
|      | Time                 |  |

























## **CUDA streams**

- Default stream
- Concurrent streams

# **Default Stream**

A **stream** is a series of instructions, and CUDA has a **default stream** 

## **DEFAULT STREAM**



#### By default, CUDA kernels run in the default stream

|          | DEFAULT STREAM |               |
|----------|----------------|---------------|
| kernel 1 |                |               |
|          |                | $\rightarrow$ |



In any stream, including the default, an instruction in it (here a kernel launch) must complete before the next can begin

|          |          | DEFAULT STREAM |  |
|----------|----------|----------------|--|
| kernel 1 | kernel 2 |                |  |
|          |          |                |  |



In any stream, including the default, an instruction in it (here a kernel launch) must complete before the next can begin

| DEFAULT STREAM |          |          |          |          |
|----------------|----------|----------|----------|----------|
| kernel 1       | kernel 2 | kernel 3 | kernel 4 | kernel 5 |
|                |          | Time     |          |          |





## **Concurrent streams**

Non-default streams can also be created for kernel execution

| NON-DEFAULT STREAM 1 |   |
|----------------------|---|
|                      |   |
| DEFAULT STREAM       |   |
|                      |   |
|                      | Ì |



Non-default streams can also be created for kernel execution

| NON-DEFAULT STREAM 2 |
|----------------------|
|                      |
|                      |
|                      |
| NON-DEFAULT STREAM 1 |
|                      |
|                      |
|                      |
| DEFAULT STREAM       |
|                      |
|                      |
|                      |
|                      |























|          | NON-DEFAULT STREAM 2 |  |
|----------|----------------------|--|
|          |                      |  |
|          | NON-DEFAULT STREAM 1 |  |
| kernel 1 |                      |  |
|          | DEFAULT STREAM       |  |
|          |                      |  |
|          |                      |  |















| DEFAULT STREAM 2 |
|------------------|
|                  |
| DEFAULT STREAM 1 |
|                  |
|                  |
|                  |
| Time             |
|                  |























# **Questions?**



## **Profiler and other tools**

- NSight
- Debugger, cuda-memcheck, etc.

## The deployment cycle

- Analyze your code to determine most likely places needing parallelization or optimization.
- Parallelize your code by starting with the most time consuming parts, check for correctness and then analyze it again.
- Optimize your code to improve observed speed-up from parallelization.





## **NVidia tools**





## **Profiling via command line**

### Using Command Line Interface (CLI)

NVIDIA Nsight Systems CLI provides

- Simple interface to collect data
- Can be copied to any system and analysed later
- Profiles both serial and parallel code
- For more info enter nsys --help on the terminal

To profile a serial application with NVIDIA Nsight Systems, we use NVIDIA Tools Extension (NVTX) API functions in addition to collecting backtraces while sampling.



## **Profiling using NVTX (I)**

### NVIDIA Tools Extension API (NVTX) library

#### What is it?

- A C-based Application Programming Interface (API) for annotating events
- Can be easily integrated to the application
- Can be used with NVIDIA Nsight Systems

### Why?

- Allows manual instrumentation of the application
- Allows additional information for profiling (e.g: tracing of CPU events and time ranges)

#### How?

- Import the header only C library nvToolsExt.h
- Wrap the code region or a specific function with nvtxRangePush() and nvtxRangPop()



## Profiling using NVTX (II) SEQUENTIAL

| tinclude <string.h><br/>tinclude <stdio.h></stdio.h></string.h>                                                                       | -t                                                                                                                | Selects the APIs to be traced (nvtx in this example)                                                      |                                                              |
|---------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
| tinclude <stdlib.h><br/>tinclude <omp.h></omp.h></stdlib.h>                                                                           | status                                                                                                            | if true, generates summary of statistics after the collection                                             |                                                              |
| include "laplace2d.h"<br>include <nvtx3 nvtoolsext.h=""></nvtx3>                                                                      | - b                                                                                                               | Selects the backtrace method to use while sampling. The option dwarf                                      |                                                              |
| nt main(int argc, char** argv)                                                                                                        |                                                                                                                   | uses DWARF's CFI (Call Frame Information).                                                                |                                                              |
| const int n = 4096;<br>const int m = 4096;                                                                                            | force-overwrite                                                                                                   | if true, overwrites the existing results                                                                  |                                                              |
| const int iter_max = 1000;                                                                                                            | -0                                                                                                                | sets the output (qdrep) filename                                                                          |                                                              |
| const double tol = 1.0e-6;<br>double error = 1.0;                                                                                     |                                                                                                                   |                                                                                                           |                                                              |
| <pre>double *restrict A = (double*)malloc(sizeof(double)*n*m);<br/>double *restrict Anew = (double*)malloc(sizeof(double)*n*m);</pre> | Collecting data<br>Jacobi relaxation Calculation: 4096 x                                                          | aining-materials/labs/module4/English/C/solutions/parallelS nsys profile -t nvtxstats=true -<br>4096 mesh | b dwarfforce-overwrite true -o laplace-seq ./laplace-        |
| <pre>nvtxRangePushA("init");<br/>initialize(A, Anew, m, n);<br/>nvtxRangePop();</pre>                                                 | 0, 0.250000<br>100, 0.002397<br>200, 0.001204<br>300, 0.000804<br>400, 0.000603<br>500, 0.000483<br>600, 0.000403 |                                                                                                           |                                                              |
| printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);                                                                        | 600, 0.000403<br>700, 0.000345<br>800, 0.000302                                                                   |                                                                                                           |                                                              |
| <pre>double st = omp_get_wtime();<br/>int iter = 0;</pre>                                                                             | 900, 0.000269<br>total: 55.754501 s<br>Processing events                                                          |                                                                                                           |                                                              |
| <pre>nvtxRangePushA("while");<br/>while ( error &gt; tol &amp;&amp; iter &lt; iter max )</pre>                                        |                                                                                                                   | ode/openacc-training-materials/labs/module4/English/C/solutions/parallel/laplace-seq.qdstrm" fi           | le to disk                                                   |
| <pre>nvtxRangePushA("calc");</pre>                                                                                                    | Saved report file to "/home/mozhgank/                                                                             | ======================================                                                                    |                                                              |
| error = calcNext(A, Anew, m, n);<br>nvtxRangePop();                                                                                   | Exported successfully to                                                                                          | -<br>materials/labs/module4/English/C/solutions/parallel/laplace-seq.sqlite                               |                                                              |
| <pre>nvtxRangePushA("swap");</pre>                                                                                                    | Generating NVTX Push-Pop Range Statis<br>NVTX Push-Pop Range Statistics (nanos                                    | tics                                                                                                      |                                                              |
| swap(A, Anew, m, n);<br>nvtxRangePop();                                                                                               | Time(%) Total Time Instances                                                                                      | Average Minimum Maximum Range                                                                             |                                                              |
| if(iter % 100 == 0)                                                                                                                   | 49.9         55754497966         1           26.5         29577817696         1000                                | 55754497966.0 55754497966 55754497966 while NVTX range                                                    | $\langle \rangle$                                            |
| iter++;<br>}                                                                                                                          | 23.4         26163892482         1000           0.1         137489808         1                                   | 26163892.5 25761418 60129514 swap<br>137489808.0 137489808 137489808 init Statistics                      |                                                              |
| nvtxRangePop();                                                                                                                       |                                                                                                                   |                                                                                                           |                                                              |
| <pre>double runtime = omp_get_wtime() - st;</pre>                                                                                     | "calc" region (calcNext fu                                                                                        | inction) takes 26.6%                                                                                      |                                                              |
| <pre>printf(" total: %f s\n", runtime);</pre>                                                                                         | "swap" region (swap func                                                                                          | ction) takes 23.4% of                                                                                     |                                                              |
| deallocate(A, Anew);                                                                                                                  | total executi                                                                                                     | on time                                                                                                   |                                                              |
| return 0;                                                                                                                             |                                                                                                                   |                                                                                                           | Open laplace-seq.qdrep with<br>Nsight System GUI to view the |
| jacobi.c                                                                                                                              | 1                                                                                                                 |                                                                                                           | timeline                                                     |
| jacobi.c<br>(starting and ending of ranges are<br>highlighted with the same color)                                                    |                                                                                                                   |                                                                                                           | umenne                                                       |

## **Profiling using NVTX (III) PARALLEL**

total execution time

#include <math.h> #include <stdlib.h> #define OFFSET(x, y, m) (((x)\*(m)) + (y)) void initialize(double \*restrict A, double \*restrict Anew, int m, int n) { memset(A, 0, n \* m \* sizeof(double)); memset(Anew, 0, n \* m \* sizeof(double)); for(int i = 0; i < m; i++){</pre> A[i] = 1.0;Anew[i] = 1.0; double calcNext(double \*restrict A, double \*restrict Anew, int m, int n) double error = 0.0; #pragma acc parallel loop reduction(max:err) for( int j = 1; j < n-1; j++)</pre> #pragma acc loop for( int i = 1; i < m-1; i++ )</pre> Anew[OFFSET(j, i, m)] = 0.25 \* ( A[OFFSET(j, i+1, m)] + A[OFFSET(j, i-1, m)] + A[OFFSET(j-1, i, m)] + A[OFFSET(j+1, i, m)]); error = max( error, fabs(Anew[OFFSET(j, i, m)] - A[OFFSET(j, i , m)])); 3 return error; void swap(double \*restrict A, double \*restrict Anew, int m, int n) #pragma acc parallel loop for( int j = 1; j < n-1; j++)</pre> { #pragma acc loor for( int i = 1; i < m-1; i++ )</pre> A[OFFSET(j, i, m)] = Anew[OFFSET(j, i, m)]; void deallocate(double \*restrict A, double \*restrict Anew) free(A); free(Anew);

laplace2d.c (Parallelised using OpenACC parallel directives (pragmas highlighted)



This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)

## **Profiling using Nsight**

### Using Nsight Systems

Open the generated report files (\*.qdrep) from command line in the Nsight Systems profiler.

File > Open





NVIDIA System Profiler 4.0

File Yew Help

Select device for profiling.



## **Other tools**



#### Arm Forge Debugger (formerly Allinea DDT)

Provides application developers with a single tool that can debug hybrid MPI, OpenMP, CUDA and OpenACC applications on a single workstation or GPU cluster.



#### CUDA-GDB

Delivers a seamless debugging experience that allows you to debug both the CPU and GPU portions of your application simultaneously. Use CUDA-GDB on Linux or MacOS, from the command line, DDD or EMACS.



#### TotalView

A GUI-based tool that allows you to debug one or many processes/threads with complete control over program execution, from basic debugging operations like stepping through code to concurrent programs that take advantage of threads, OpenMP, MPI, or GPUs.



#### **COMPUTE SANITIZER**

Compute Sanitizer is a functional correctness checking suite included in the CUDA toolkit. This suite contains multiple tools that can perform different type of checks. The memcheck tool is capable of precisely detecting and attributing out of bounds and misaligned memory access errors in CUDA applications. The tool can

#### If you can, do not debug via print!!!

#### You could use instead Visual Studio Code with Nsight

| Outcome         Operation         Operation |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Interpretation         Control                                              |
| 227<br>228 for(int 1 + 0; 1 < num_lights; ++1) =<br>10 % +1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Lock + # X Callock + # X Callock + # X Balapoint                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Officialized 4 contractinged long     (MVCA interval)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ▶ ● blocklidk (s=0 y=1 z=0) const sint3           S ● blocklidk (s=0 y=1 z=0)         const sint3         S ● optioPathTasce.cu, line 207         break always                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| ► W threadds (xx+5 yx9 zx0) cost unt3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| b @ gidQim [stokytokizi] cont.dm]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |

#### https://developer.nvidia.com/nsight-visualstudio-code-edition



## - Porting DL\_MESO to CUDA

- DL\_MESO on multi-GPU

## An Hartree application: porting DL\_MESO to CUDA







## Porting DL\_MESO to GPUs

#### What is DPD and DL\_MESO?





...coarse grain representation using beads





DL\_MESO: highly scalable mesoscale simulations Molecular Simulation 39 (10) pp. 796-821, 2013



### Main problem: memory access pattern



particle locations are stored in a continuous order... Very uncoalescent access to the memory!

### Reorganize the cell-linked array





### Speedup on single GPU

Speedup on GPU vs:

- AMD EPYC 7402 (Rome) 24 cores
- Intel Xeon(R) W-2133 CPU @ 3.60GHz 6 cores



Castagna et al. "Towards Extreme Scale Dissipative Particle Dynamics Simulations using Multiple GPGPUs" Comput. Physi. Comm. (2020)

## Multi GPU version





### Overlap computation with communication





### Scaling on different supercomputers



proper overlap computation-communication has a strong impact on scaling!

## Largest simulation: 14 billion particles!

Water drop between two surfaces on 8 GPUs 100 CPU 1 CPU 2 CPU



Animation **Water Drop** formation on 8 GPUs showing the impact of load balancing routine ALL (ALL is from Julich Supercomputer Centre)

D. Di Giusto and J. Castagna et al. "Scalable algorithm for many-body Dissipative Particle Dynamics using multiple General Purpose Graphic Processing Units" Comput. Physi. Comm. (2022)

#### **Adding Load Balance**



### Resume

- DL\_MESO has been ported to single and multi-GPU
   nNVidia GPUs using CUDA language
- Good scaling up to 4096 GPU
- We can now run very large **DPD** simulations (**14 billions**)
- Load balance allows to run simulations without out of memory on the GPU, as well as save computational time
- 2 publications on journal paper (CPC)



# **Questions?**



