Check device characteristics

$ nvaccelinfo

CUDA Driver Version:           11040
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  470.199.02  Thu May 11 11:46:56 UTC 2023

Device Number:                 0
Device Name:                   NVIDIA T400
Device Revision Number:        7.5
Global Memory Size:            1967259648
Number of Multiprocessors:     6
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
**Warp Size:                     32**
**Maximum Threads per Block:     1024**
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1425 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             5001 MHz
Memory Bus Width:              64 bits
L2 Cache Size:                 524288 bytes
Max Threads Per SMP:           1024
Async Engines:                 3
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
Default Target:                cc75

Since the warp size is 32, we should have multiples of 32 as threads per block in our kernel choice.

<aside> 🎯 The minimum number is 1 thead per block, but the number should be a multiple of 32 (a warp), because the warp scheduler schedules them in blocks of 32.

</aside>

$Threads\ Per\ Block = \cfrac{Num.\ Of\ Elements}{Threads\ Per\ Block}$

The maximum number of threads per block is 1024.

The block size should be:

$NumberOfBlocks = \cfrac{Problem\ Size}{Num.\ Of\ Threads\ Per\ Block}$

To decide this, the key is benchmarking.

CUDA

Benchmarking

First of all, we have to declare the blocks and the number of threads per block.

dim3 blocksPerGrid((ni+threadDim-1)/threadDim, (nj+threadDim-1)/threadDim);
dim3 threadsPerBlock(threadDim, threadDim);

(ni + threadDim - 1) this formula is used to handle cases where ni, nj are not evenly divisible by threadDim. The idea is to ensure that you have enough blocks to cover the entire grid.

Allocation

Then allocate in the CUDA memory the space for the array used for the calculation and copy the arrays from the host to the device (the GPU):

// Allocate the right space in the memory of the GPU
cudaMalloc((void **)&dev_temp1, size);
cudaMalloc((void **)&dev_temp2, size);

cudaMemcpy(dev_temp1, temp1, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_temp2, temp2, size, cudaMemcpyHostToDevice);

... // perform the computations on the GPU ...

// Copy the final result back to the host
cudaMemcpy(temp1, dev_temp1, size, cudaMemcpyDeviceToHost);

cudaDeviceSynchronize() are useless ****after copies, as copies are intrinsically synchronous.

Change the pointer swapping from the host pointers to the device pointer.

float *temp_tmp_dev = dev_temp1;
    dev_temp1 = dev_temp2;
    dev_temp2 = temp_tmp_dev;