MareArts Computer Vision Study.: CUDA

Showing posts with label CUDA. Show all posts

3/07/2025

Check my torch support GPU

checkgpu.py

import torch

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Check if CUDA/ROCm is available (unified API in newer PyTorch)
print(f"Is GPU available: {torch.cuda.is_available()}")

# Check how many GPUs are available
if torch.cuda.is_available():
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    
    # Print device properties for each GPU
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"\nDevice {i}: {props.name}")
        print(f"  Total memory: {props.total_memory / 1024**3:.2f} GB")
        if hasattr(props, 'major'):
            print(f"  Compute capability: {props.major}.{props.minor}")

# Try a simple operation on GPU
if torch.cuda.is_available():
    device = torch.device("cuda:0")  # Use the first GPU
    x = torch.ones(5, 5, device=device)
    y = x + 1
    print("\nGPU computation test:")
    print(y)
    print("GPU computation successful! study.marearts.com")
else:
    print("\nNo GPUs available for PyTorch.")

🙏

Thank you!

9/17/2024

Creating a custom CUDA kernel that directly utilizes tensor cores

Creating a custom CUDA kernel that directly utilizes tensor cores is an advanced topic, as tensor cores are typically accessed through higher-level libraries like cuBLAS or cuDNN. However, NVIDIA does provide a way to use tensor cores in custom kernels through their CUDA Core library, specifically with Warp Matrix Multiply-Accumulate (WMMA) API. Here's an overview of how to create a kernel that works on tensor cores:

1. Use CUDA Core WMMA API:

   The WMMA API allows you to program tensor cores directly in your CUDA kernels.

2. Include necessary headers:

   ```cpp

   #include <mma.h>

   #include <cuda_fp16.h>

```

3. Use appropriate data types:

   Tensor cores work with specific data types like half precision floating point (`__half`).

4. Define matrix fragments:

   Use `nvcuda::wmma::fragment` to define matrix fragments that will be processed by tensor cores.

5. Load, compute, and store operations:

   Use WMMA load, multiply-accumulate, and store operations.

Here's an example kernel that uses tensor cores via the WMMA API:

```cuda

#include <mma.h>

#include <cuda_fp16.h>

// Define matrix dimensions

const int M = 16;

const int N = 16;

const int K = 16;

__global__ void wmma_example(half *a, half *b, float *c) {

    // Declare the fragments

    nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, M, N, K, half, nvcuda::wmma::col_major> a_frag;

    nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, M, N, K, half, nvcuda::wmma::col_major> b_frag;

    nvcuda::wmma::fragment<nvcuda::wmma::accumulator, M, N, K, float> c_frag;

    // Initialize the output to zero

    nvcuda::wmma::fill_fragment(c_frag, 0.0f);

    // Load the inputs

    nvcuda::wmma::load_matrix_sync(a_frag, a, K);

    nvcuda::wmma::load_matrix_sync(b_frag, b, K);

    // Perform the matrix multiplication

    nvcuda::wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

    // Store the output

    nvcuda::wmma::store_matrix_sync(c, c_frag, N, nvcuda::wmma::mem_row_major);

}

// Host function to launch the kernel

void launch_wmma_kernel(half *a, half *b, float *c) {

    dim3 gridDim(1);

    dim3 blockDim(32);  // One warp

    wmma_example<<<gridDim, blockDim>>>(a, b, c);

}

```

Key points about this example:

1. We're using 16x16 matrices as this is a common size for tensor core operations.
2. The kernel uses `nvcuda::wmma::fragment` to define matrix fragments.
3. `load_matrix_sync`, `mma_sync`, and `store_matrix_sync` are used to load data, perform matrix multiplication, and store results using tensor cores.
4. The kernel operates on half-precision input (`half`) and produces single-precision output (`float`).

To use this kernel:

1. Compile with a CUDA compiler that supports tensor cores (CUDA 9.0 or later).
2. Use appropriate GPU architecture flags (e.g., `-arch=sm_70` for Volta, `-arch=sm_75` for Turing).
3. Allocate memory and copy data to the GPU before calling `launch_wmma_kernel`.

Important considerations:

1. Error checking is omitted for brevity but should be included in production code.
2. This is a basic example. Real-world usage often involves tiling and more complex memory access patterns for larger matrices.
3. Performance tuning is crucial. The exact dimensions and data types should be chosen based on your specific use case and target GPU architecture.
4. Not all operations can be efficiently mapped to tensor cores. They're most beneficial for large matrix multiplications common in deep learning workloads.

Remember, while this approach gives you direct control over tensor core usage, in many cases, using higher-level libraries like cuBLAS or cuDNN is more practical and can automatically leverage tensor cores when appropriate.

8/22/2024

ROCm HIP asynchronous operation sample code

HIP (Heterogeneous-Compute Interface for Portability) provides similar functionality to CUDA streams for asynchronous execution. The concepts and usage are very similar, making it easier to port CUDA code to HIP. Here's an overview of HIP's equivalent features for asynchronous execution:

1. HIP Streams:
   In HIP, streams are represented by the `hipStream_t` type, which is analogous to CUDA's `cudaStream_t`.

2. Creating and Destroying Streams:
   ```cpp
   hipStream_t stream;
   hipError_t hipStreamCreate(hipStream_t* stream);
   hipError_t hipStreamDestroy(hipStream_t stream);
   ```

3. Asynchronous Memory Operations:
   ```cpp
   hipError_t hipMemcpyAsync(void* dst, const void* src, size_t count, hipMemcpyKind kind, hipStream_t stream);
   hipError_t hipMemsetAsync(void* dst, int value, size_t count, hipStream_t stream);
   ```

4. Launching Kernels on Streams:
   ```cpp
   hipLaunchKernelGGL(kernel, dim3(gridSize), dim3(blockSize), 0, stream, /* kernel arguments */);
   ```

5. Stream Synchronization:
   ```cpp
   hipError_t hipStreamSynchronize(hipStream_t stream);
   hipError_t hipDeviceSynchronize();
   ```

6. Stream Query:
   ```cpp
   hipError_t hipStreamQuery(hipStream_t stream);
   ```

7. Stream Callbacks:
   ```cpp
   hipError_t hipStreamAddCallback(hipStream_t stream, hipStreamCallback_t callback, void* userData, unsigned int flags);
   ```

8. Stream Priorities:
   ```cpp
   hipError_t hipStreamCreateWithPriority(hipStream_t* stream, unsigned int flags, int priority);
   ```

Here's a simple example demonstrating asynchronous execution with HIP streams:

```cpp
#include <hip/hip_runtime.h>
#include <stdio.h>

#define N 1000000
#define STREAMS 4

__global__ void vectorAdd(float* a, float* b, float* c, int numElements) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if (idx < numElements) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    float *h_a, *h_b, *h_c;
    float *d_a, *d_b, *d_c;
    size_t size = N * sizeof(float);

    // Allocate host memory
    h_a = (float*)malloc(size);
    h_b = (float*)malloc(size);
    h_c = (float*)malloc(size);

    // Initialize host arrays
    for (int i = 0; i < N; i++) {
        h_a[i] = rand() / (float)RAND_MAX;
        h_b[i] = rand() / (float)RAND_MAX;
    }

    // Allocate device memory
    hipMalloc(&d_a, size);
    hipMalloc(&d_b, size);
    hipMalloc(&d_c, size);

    // Create streams
    hipStream_t streams[STREAMS];
    for (int i = 0; i < STREAMS; i++) {
        hipStreamCreate(&streams[i]);
    }

    // Launch kernel on multiple streams
    int streamSize = N / STREAMS;
    dim3 blockSize(256);
    dim3 gridSize((streamSize + blockSize.x - 1) / blockSize.x);

    for (int i = 0; i < STREAMS; i++) {
        int offset = i * streamSize;
        hipMemcpyAsync(&d_a[offset], &h_a[offset], streamSize * sizeof(float), hipMemcpyHostToDevice, streams[i]);
        hipMemcpyAsync(&d_b[offset], &h_b[offset], streamSize * sizeof(float), hipMemcpyHostToDevice, streams[i]);
        hipLaunchKernelGGL(vectorAdd, gridSize, blockSize, 0, streams[i], &d_a[offset], &d_b[offset], &d_c[offset], streamSize);
        hipMemcpyAsync(&h_c[offset], &d_c[offset], streamSize * sizeof(float), hipMemcpyDeviceToHost, streams[i]);
    }

    // Synchronize all streams
    for (int i = 0; i < STREAMS; i++) {
        hipStreamSynchronize(streams[i]);
    }

    // Verify results
    for (int i = 0; i < N; i++) {
        if (fabs(h_c[i] - (h_a[i] + h_b[i])) > 1e-5) {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(1);
        }
    }

    printf("Test PASSED\n");

    // Clean up
    for (int i = 0; i < STREAMS; i++) {
        hipStreamDestroy(streams[i]);
    }
    hipFree(d_a);
    hipFree(d_b);
    hipFree(d_c);
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}
```

This example demonstrates how to use multiple streams to overlap computation and data transfer, similar to the CUDA example in the article you referenced. The key points are:

1. Creating multiple streams
2. Using `hipMemcpyAsync` for asynchronous data transfer
3. Launching kernels on specific streams
4. Synchronizing streams after all operations are queued

By using streams, you can potentially improve performance by overlapping operations and utilizing the GPU more efficiently.

hpcc install on cuda system. version 2

Please following the process

1. First, add the ROCm repository (if you haven't already):

   wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -

   echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

2. Update your package list:

   sudo apt update

3. Install only the HIP compiler and development tools:

   sudo apt install hip-base hip-doc

   This should install the basic HIP tools without the full runtime that caused issues before.

4. After installation, add the HIP binaries to your PATH. Add this line to your ~/.bashrc file:

   export PATH=$PATH:/opt/rocm/bin

5. Then, apply the changes:

   source ~/.bashrc

6. Verify the installation:

   hipcc --version

Install HIP (ROCm) compiler on CUDA system.

Try this process.

1. First, add the ROCm repository to your system. For Ubuntu, you can use these commands:

   wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -

   echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

2. Update your package list:

   sudo apt update

3. Install the HIP runtime and compiler for CUDA:

   sudo apt install hip-runtime-nvidia hip-dev

4. Set up environment variables. Add these lines to your `~/.bashrc` file:

   export HIP_PLATFORM=nvidia

   export PATH=$PATH:/opt/rocm/bin

   export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib

   Then run `source ~/.bashrc` to apply the changes.

5. Verify the installation:

   hipconfig --full

6. Now try compiling your code again:

   hipcc vector_add.cpp -o vector_add

1/30/2024

checking torch + cuda installed correctly

Run this script

import torch
from torch.utils.cpp_extension import CUDAExtension, BuildExtension

def check_cuda_setup():
    cuda_available = torch.cuda.is_available()
    print(f"CUDA available: {cuda_available}")

    if cuda_available:
        cuda_version = torch.version.cuda
        print(f"CUDA version (PyTorch): {cuda_version}")

        try:
            # Attempt to create a CUDA extension
            ext = CUDAExtension(
                name='test_ext',
                sources=[]
            )
            print("CUDAExtension can be created successfully.")
        except Exception as e:
            print(f"Error creating CUDAExtension: {e}")

        try:
            # Attempt to create a BuildExtension object
            build_ext = BuildExtension()
            print("BuildExtension can be created successfully.")
        except Exception as e:
            print(f"Error creating BuildExtension: {e}")

if __name__ == "__main__":
    check_cuda_setup()

If return 'False' then you need to fix your system.

Thank you.

2/19/2023

How to Install OpenCV 4.7 with CUDA, cuDNN, TBB, CUDA Video Codec, and Extra Modules in Linux

refer to bash code

#!/bin/bash

# Install dependencies
sudo apt-get update
sudo apt-get install build-essential cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libdc1394-22-dev
sudo apt-get install libcanberra-gtk-module libcanberra-gtk3-module

# Install CUDA 11
wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.57.02_linux.run
sudo sh cuda_11.4.0_470.57.02_linux.run --silent --toolkit --override
echo 'export PATH=/usr/local/cuda-11.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Download and extract TBB
wget https://github.com/oneapi-src/oneTBB/releases/download/v2022.0.0/oneapi-tbb-2022.0.0-lin.tgz
tar -xf oneapi-tbb-2022.0.0-lin.tgz
sudo cp -r oneapi-tbb-2022.0.0/lib/* /usr/local/lib/
echo 'export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Download and extract OpenCV 4.7 and OpenCV extra modules
wget https://github.com/opencv/opencv/archive/4.7.0.zip
unzip 4.7.0.zip
cd opencv-4.7.0

wget https://github.com/opencv/opencv_contrib/archive/4.7.0.zip
unzip 4.7.0.zip

# Build and install OpenCV 4.7 with CUDA, cuDNN, TBB, CUDA video codec, and OpenCV extra modules
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-4.7.0/modules -D WITH_CUDA=ON -D WITH_TBB=ON -D WITH_NVCUVID=ON -D WITH_GSTREAMER=ON -D WITH_GSTREAMER_0_10=OFF -D WITH_LIBV4L=ON -D WITH_CUDNN=ON -D CUDA_ARCH_BIN=7.5 ..
make -j$(nproc)
sudo make install
echo 'export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH' >> ~/.bashrc
source ~/.bashrc

# Compile and run the sample code
cd ../../
wget https://raw.githubusercontent.com/spmallick/learnopencv/master/Averaging4kVideo/Averaging4kVideo.cpp
g++ Averaging4kVideo.cpp -o Averaging4kVideo `pkg-config --cflags --libs opencv4`
./Averaging4kVideo

thank you.

🙇🏻‍♂️

www.marearts.com

10/20/2022

re-install nvidia drive (cuda) in ubuntu

command it on terminal

sudo apt clean
sudo apt update
sudo apt purge nvidia-* 
sudo apt autoremove
sudo apt install -y cuda

Thank you.

www.marearts.com

9/22/2021

pytorch cuda definition

Their syntax varies slightly, but they are equivalent:

⠀	.to(name)	.to(device)	.cuda()
CPU	to('cpu')	to(torch.device('cpu'))	cpu()
Current GPU	to('cuda')	to(torch.device('cuda'))	cuda()
Specific GPU	to('cuda:1')	to(torch.device('cuda:1'))	cuda(device=1)

Note: the current cuda device is 0 by default, but this can be set with torch.cuda.set_device().

9/04/2021

TensorFlow cuda version compatible

7/24/2021

check pytorch, Tensorflow can use GPU

test tensorflow which can use GPU

#method 1
import tensorflow as tf
tf.test.is_built_with_cuda()
> Ture

#method 2
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
> ..

test pytorch can use GPU

#method 3
import torch
torch.cuda.is_available()
>>> True

torch.cuda.current_device()
>>> 0

torch.cuda.device(0)
>>> <torch.cuda.device at 0x7efce0b03be0>

torch.cuda.device_count()
>>> 1

torch.cuda.get_device_name(0)
>>> 'GeForce GTX 950M'

Thank you.

www.MareArts.com

7/01/2021

Build OpenCV 4.5.2 + Ubuntu 20.4 + CUDA 11.2

Build OpenCV 4.5.2 + Ubuntu 20.4 + CUDA 11.2
Youtube Video:

Reference:
https://www.marearts.com/0-OpenCV-Build-Ubuntu-20-04-OpenCV-4-5-2-CUDA-11-2-4c876d664e84442e82ebec13c23586b3

Thank you.

6/28/2021

CUDA_ARCH_BIN Table for gpu type

GPU	Compute Capability
Jetson AGX Xavier	7.2
Jetson Nano	5.3
Jetson TX2	6.2
Jetson TX1	5.3
Tegra X1	5.3

GPU	Compute Capability
Tesla K80	3.7
Tesla K40	3.5
Tesla K20	3.5
Tesla C2075	2.0
Tesla C2050/C2070	2.0

GPU	Compute Capability
NVIDIA A100	8.0
NVIDIA T4	7.5
NVIDIA V100	7.0
Tesla P100	6.0
Tesla P40	6.1
Tesla P4	6.1
Tesla M60	5.2
Tesla M40	5.2
Tesla K80	3.7
Tesla K40	3.5
Tesla K20	3.5
Tesla K10	3.0

GPU	Compute Capability
Quadro RTX 8000	7.5
Quadro RTX 6000	7.5
Quadro RTX 5000	7.5
Quadro RTX 4000	7.5
Quadro GV100	7.0
Quadro GP100	6.0
Quadro P6000	6.1
Quadro P5000	6.1
Quadro P4000	6.1
Quadro P2200	6.1
Quadro P2000	6.1
Quadro P1000	6.1
Quadro P620	6.1
Quadro P600	6.1
Quadro P400	6.1
Quadro M6000 24GB	5.2
Quadro M6000	5.2
Quadro K6000	3.5
Quadro M5000	5.2
Quadro K5200	3.5
Quadro K5000	3.0
Quadro M4000	5.2
Quadro K4200	3.0
Quadro K4000	3.0
Quadro M2000	5.2
Quadro K2200	3.0
Quadro K2000	3.0
Quadro K2000D	3.0
Quadro K1200	5.0
Quadro K620	5.0
Quadro K600	3.0
Quadro K420	3.0
Quadro 410	3.0
Quadro Plex 7000	2.0

GPU	Compute Capability
RTX 5000	7.5
RTX 4000	7.5
RTX 3000	7.5
T2000	7.5
T1000	7.5
P620	6.1
P520	6.1
Quadro P5200	6.1
Quadro P4200	6.1
Quadro P3200	6.1
Quadro P5000	6.1
Quadro P4000	6.1
Quadro P3000	6.1
Quadro P2000	6.1
Quadro P1000	6.1
Quadro P600	6.1
Quadro P500	6.1
Quadro M5500M	5.2
Quadro M2200	5.2
Quadro M1200	5.0
Quadro M620	5.2
Quadro M520	5.0
Quadro K6000M	3.0
Quadro K5200M	3.0
Quadro K5100M	3.0
Quadro M5000M	5.0
Quadro K500M	3.0
Quadro K4200M	3.0
Quadro K4100M	3.0
Quadro M4000M	5.0
Quadro K3100M	3.0
Quadro M3000M	5.0
Quadro K2200M	3.0
Quadro K2100M	3.0
Quadro M2000M	5.0
Quadro K1100M	3.0
Quadro M1000M	5.0
Quadro K620M	5.0
Quadro K610M	3.5
Quadro M600M	5.0
Quadro K510M	3.5
Quadro M500M	5.0

MareArts Computer Vision Study.

Pages

3/07/2025

Check my torch support GPU

9/17/2024

Creating a custom CUDA kernel that directly utilizes tensor cores

8/22/2024

ROCm HIP asynchronous operation sample code

hpcc install on cuda system. version 2

Install HIP (ROCm) compiler on CUDA system.

1/30/2024

checking torch + cuda installed correctly

2/19/2023

How to Install OpenCV 4.7 with CUDA, cuDNN, TBB, CUDA Video Codec, and Extra Modules in Linux

10/20/2022

re-install nvidia drive (cuda) in ubuntu

9/22/2021

pytorch cuda definition

9/04/2021

TensorFlow cuda version compatible

7/24/2021

check pytorch, Tensorflow can use GPU

test tensorflow which can use GPU

test pytorch can use GPU

7/01/2021

Build OpenCV 4.5.2 + Ubuntu 20.4 + CUDA 11.2

6/28/2021

CUDA_ARCH_BIN Table for gpu type

Jetson Products

Tesla Workstation Products

Tesla NVIDIA Data Center Products

Quadro Desktop Products

Quadro Mobile Products

NVS Desktop Products

NVS Mobile Products

GeForce and TITAN Products

GeForce Notebook Products

GPU	Compute Capability
NVIDIA NVS 810	5.0
NVIDIA NVS 510	3.0
NVIDIA NVS 315	2.1
NVIDIA NVS 310	2.1

GPU	Compute Capability
GeForce RTX 3090	8.6
GeForce RTX 3080	8.6
GeForce RTX 3070	8.6
NVIDIA TITAN RTX	7.5
Geforce RTX 2080 Ti	7.5
Geforce RTX 2080	7.5
Geforce RTX 2070	7.5
Geforce RTX 2060	7.5
NVIDIA TITAN V	7.0
NVIDIA TITAN Xp	6.1
NVIDIA TITAN X	6.1
GeForce GTX 1080 Ti	6.1
GeForce GTX 1080	6.1
GeForce GTX 1070 Ti	6.1
GeForce GTX 1070	6.1
GeForce GTX 1060	6.1
GeForce GTX 1050	6.1
GeForce GTX TITAN X	5.2
GeForce GTX TITAN Z	3.5
GeForce GTX TITAN Black	3.5
GeForce GTX TITAN	3.5
GeForce GTX 980 Ti	5.2
GeForce GTX 980	5.2
GeForce GTX 970	5.2
GeForce GTX 960	5.2
GeForce GTX 950	5.2
GeForce GTX 780 Ti	3.5
GeForce GTX 780	3.5
GeForce GTX 770	3.0
GeForce GTX 760	3.0
GeForce GTX 750 Ti	5.0
GeForce GTX 750	5.0
GeForce GTX 690	3.0
GeForce GTX 680	3.0
GeForce GTX 670	3.0
GeForce GTX 660 Ti	3.0
GeForce GTX 660	3.0
GeForce GTX 650 Ti BOOST	3.0
GeForce GTX 650 Ti	3.0
GeForce GTX 650	3.0
GeForce GTX 560 Ti	2.1
GeForce GTX 550 Ti	2.1
GeForce GTX 460	2.1
GeForce GTS 450	2.1
GeForce GTS 450 *	2.1
GeForce GTX 590	2.0
GeForce GTX 580	2.0
GeForce GTX 570	2.0
GeForce GTX 480	2.0
GeForce GTX 470	2.0
GeForce GTX 465	2.0
GeForce GT 740	3.0
GeForce GT 730	3.5
GeForce GT 730 DDR3,128bit	2.1
GeForce GT 720	3.5
GeForce GT 705 *	3.5
GeForce GT 640 (GDDR5)	3.5
GeForce GT 640 (GDDR3)	2.1
GeForce GT 630	2.1
GeForce GT 620	2.1
GeForce GT 610	2.1
GeForce GT 520	2.1
GeForce GT 440	2.1
GeForce GT 440 *	2.1
GeForce GT 430	2.1
GeForce GT 430 *	2.1