Showing posts with label CUDA. Show all posts
Showing posts with label CUDA. Show all posts

3/07/2025

Check my torch support GPU

checkgpu.py

..

import torch

# Check PyTorch version
print(f"PyTorch version: {torch.__version__}")

# Check if CUDA/ROCm is available (unified API in newer PyTorch)
print(f"Is GPU available: {torch.cuda.is_available()}")

# Check how many GPUs are available
if torch.cuda.is_available():
print(f"Number of GPUs: {torch.cuda.device_count()}")
# Print device properties for each GPU
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
print(f"\nDevice {i}: {props.name}")
print(f" Total memory: {props.total_memory / 1024**3:.2f} GB")
if hasattr(props, 'major'):
print(f" Compute capability: {props.major}.{props.minor}")

# Try a simple operation on GPU
if torch.cuda.is_available():
device = torch.device("cuda:0") # Use the first GPU
x = torch.ones(5, 5, device=device)
y = x + 1
print("\nGPU computation test:")
print(y)
print("GPU computation successful! study.marearts.com")
else:
print("\nNo GPUs available for PyTorch.")

.

🙏

Thank you!

9/17/2024

Creating a custom CUDA kernel that directly utilizes tensor cores



Creating a custom CUDA kernel that directly utilizes tensor cores is an advanced topic, as tensor cores are typically accessed through higher-level libraries like cuBLAS or cuDNN. However, NVIDIA does provide a way to use tensor cores in custom kernels through their CUDA Core library, specifically with Warp Matrix Multiply-Accumulate (WMMA) API. Here's an overview of how to create a kernel that works on tensor cores:

1. Use CUDA Core WMMA API:
The WMMA API allows you to program tensor cores directly in your CUDA kernels.

2. Include necessary headers:
```cpp
#include <mma.h>
#include <cuda_fp16.h>
```

3. Use appropriate data types:
Tensor cores work with specific data types like half precision floating point (`__half`).

4. Define matrix fragments:
Use `nvcuda::wmma::fragment` to define matrix fragments that will be processed by tensor cores.

5. Load, compute, and store operations:
Use WMMA load, multiply-accumulate, and store operations.

Here's an example kernel that uses tensor cores via the WMMA API:



```cuda
#include <mma.h>
#include <cuda_fp16.h>

// Define matrix dimensions
const int M = 16;
const int N = 16;
const int K = 16;

__global__ void wmma_example(half *a, half *b, float *c) {
// Declare the fragments
nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, M, N, K, half, nvcuda::wmma::col_major> a_frag;
nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, M, N, K, half, nvcuda::wmma::col_major> b_frag;
nvcuda::wmma::fragment<nvcuda::wmma::accumulator, M, N, K, float> c_frag;

// Initialize the output to zero
nvcuda::wmma::fill_fragment(c_frag, 0.0f);

// Load the inputs
nvcuda::wmma::load_matrix_sync(a_frag, a, K);
nvcuda::wmma::load_matrix_sync(b_frag, b, K);

// Perform the matrix multiplication
nvcuda::wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

// Store the output
nvcuda::wmma::store_matrix_sync(c, c_frag, N, nvcuda::wmma::mem_row_major);
}

// Host function to launch the kernel
void launch_wmma_kernel(half *a, half *b, float *c) {
dim3 gridDim(1);
dim3 blockDim(32); // One warp
wmma_example<<<gridDim, blockDim>>>(a, b, c);
}

```

Key points about this example:

1. We're using 16x16 matrices as this is a common size for tensor core operations.
2. The kernel uses `nvcuda::wmma::fragment` to define matrix fragments.
3. `load_matrix_sync`, `mma_sync`, and `store_matrix_sync` are used to load data, perform matrix multiplication, and store results using tensor cores.
4. The kernel operates on half-precision input (`half`) and produces single-precision output (`float`).

To use this kernel:

1. Compile with a CUDA compiler that supports tensor cores (CUDA 9.0 or later).
2. Use appropriate GPU architecture flags (e.g., `-arch=sm_70` for Volta, `-arch=sm_75` for Turing).
3. Allocate memory and copy data to the GPU before calling `launch_wmma_kernel`.

Important considerations:

1. Error checking is omitted for brevity but should be included in production code.
2. This is a basic example. Real-world usage often involves tiling and more complex memory access patterns for larger matrices.
3. Performance tuning is crucial. The exact dimensions and data types should be chosen based on your specific use case and target GPU architecture.
4. Not all operations can be efficiently mapped to tensor cores. They're most beneficial for large matrix multiplications common in deep learning workloads.

Remember, while this approach gives you direct control over tensor core usage, in many cases, using higher-level libraries like cuBLAS or cuDNN is more practical and can automatically leverage tensor cores when appropriate.

8/22/2024

ROCm HIP asynchronous operation sample code

 





HIP (Heterogeneous-Compute Interface for Portability) provides similar functionality to CUDA streams for asynchronous execution. The concepts and usage are very similar, making it easier to port CUDA code to HIP. Here's an overview of HIP's equivalent features for asynchronous execution:

1. HIP Streams:
In HIP, streams are represented by the `hipStream_t` type, which is analogous to CUDA's `cudaStream_t`.

2. Creating and Destroying Streams:
```cpp
hipStream_t stream;
hipError_t hipStreamCreate(hipStream_t* stream);
hipError_t hipStreamDestroy(hipStream_t stream);
```

3. Asynchronous Memory Operations:
```cpp
hipError_t hipMemcpyAsync(void* dst, const void* src, size_t count, hipMemcpyKind kind, hipStream_t stream);
hipError_t hipMemsetAsync(void* dst, int value, size_t count, hipStream_t stream);
```

4. Launching Kernels on Streams:
```cpp
hipLaunchKernelGGL(kernel, dim3(gridSize), dim3(blockSize), 0, stream, /* kernel arguments */);
```

5. Stream Synchronization:
```cpp
hipError_t hipStreamSynchronize(hipStream_t stream);
hipError_t hipDeviceSynchronize();
```

6. Stream Query:
```cpp
hipError_t hipStreamQuery(hipStream_t stream);
```

7. Stream Callbacks:
```cpp
hipError_t hipStreamAddCallback(hipStream_t stream, hipStreamCallback_t callback, void* userData, unsigned int flags);
```

8. Stream Priorities:
```cpp
hipError_t hipStreamCreateWithPriority(hipStream_t* stream, unsigned int flags, int priority);
```

Here's a simple example demonstrating asynchronous execution with HIP streams:

```cpp
#include <hip/hip_runtime.h>
#include <stdio.h>

#define N 1000000
#define STREAMS 4

__global__ void vectorAdd(float* a, float* b, float* c, int numElements) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx < numElements) {
c[idx] = a[idx] + b[idx];
}
}

int main() {
float *h_a, *h_b, *h_c;
float *d_a, *d_b, *d_c;
size_t size = N * sizeof(float);

// Allocate host memory
h_a = (float*)malloc(size);
h_b = (float*)malloc(size);
h_c = (float*)malloc(size);

// Initialize host arrays
for (int i = 0; i < N; i++) {
h_a[i] = rand() / (float)RAND_MAX;
h_b[i] = rand() / (float)RAND_MAX;
}

// Allocate device memory
hipMalloc(&d_a, size);
hipMalloc(&d_b, size);
hipMalloc(&d_c, size);

// Create streams
hipStream_t streams[STREAMS];
for (int i = 0; i < STREAMS; i++) {
hipStreamCreate(&streams[i]);
}

// Launch kernel on multiple streams
int streamSize = N / STREAMS;
dim3 blockSize(256);
dim3 gridSize((streamSize + blockSize.x - 1) / blockSize.x);

for (int i = 0; i < STREAMS; i++) {
int offset = i * streamSize;
hipMemcpyAsync(&d_a[offset], &h_a[offset], streamSize * sizeof(float), hipMemcpyHostToDevice, streams[i]);
hipMemcpyAsync(&d_b[offset], &h_b[offset], streamSize * sizeof(float), hipMemcpyHostToDevice, streams[i]);
hipLaunchKernelGGL(vectorAdd, gridSize, blockSize, 0, streams[i], &d_a[offset], &d_b[offset], &d_c[offset], streamSize);
hipMemcpyAsync(&h_c[offset], &d_c[offset], streamSize * sizeof(float), hipMemcpyDeviceToHost, streams[i]);
}

// Synchronize all streams
for (int i = 0; i < STREAMS; i++) {
hipStreamSynchronize(streams[i]);
}

// Verify results
for (int i = 0; i < N; i++) {
if (fabs(h_c[i] - (h_a[i] + h_b[i])) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(1);
}
}

printf("Test PASSED\n");

// Clean up
for (int i = 0; i < STREAMS; i++) {
hipStreamDestroy(streams[i]);
}
hipFree(d_a);
hipFree(d_b);
hipFree(d_c);
free(h_a);
free(h_b);
free(h_c);

return 0;
}
```

This example demonstrates how to use multiple streams to overlap computation and data transfer, similar to the CUDA example in the article you referenced. The key points are:

1. Creating multiple streams
2. Using `hipMemcpyAsync` for asynchronous data transfer
3. Launching kernels on specific streams
4. Synchronizing streams after all operations are queued

By using streams, you can potentially improve performance by overlapping operations and utilizing the GPU more efficiently.

hpcc install on cuda system. version 2

Please following the process

1. First, add the ROCm repository (if you haven't already):
wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

2. Update your package list:
sudo apt update

3. Install only the HIP compiler and development tools:
sudo apt install hip-base hip-doc
This should install the basic HIP tools without the full runtime that caused issues before.

4. After installation, add the HIP binaries to your PATH. Add this line to your ~/.bashrc file:
export PATH=$PATH:/opt/rocm/bin

5. Then, apply the changes:
source ~/.bashrc

6. Verify the installation:
hipcc --version

Install HIP (ROCm) compiler on CUDA system.


Try this process.


1. First, add the ROCm repository to your system. For Ubuntu, you can use these commands:

wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list


2. Update your package list:

sudo apt update


3. Install the HIP runtime and compiler for CUDA:

sudo apt install hip-runtime-nvidia hip-dev


4. Set up environment variables. Add these lines to your `~/.bashrc` file:

export HIP_PLATFORM=nvidia
export PATH=$PATH:/opt/rocm/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib

Then run `source ~/.bashrc` to apply the changes.


5. Verify the installation:

hipconfig --full


6. Now try compiling your code again:

hipcc vector_add.cpp -o vector_add


1/30/2024

checking torch + cuda installed correctly

 

 

Run this script 

.

 

import torch
from torch.utils.cpp_extension import CUDAExtension, BuildExtension

def check_cuda_setup():
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")

if cuda_available:
cuda_version = torch.version.cuda
print(f"CUDA version (PyTorch): {cuda_version}")

try:
# Attempt to create a CUDA extension
ext = CUDAExtension(
name='test_ext',
sources=[]
)
print("CUDAExtension can be created successfully.")
except Exception as e:
print(f"Error creating CUDAExtension: {e}")

try:
# Attempt to create a BuildExtension object
build_ext = BuildExtension()
print("BuildExtension can be created successfully.")
except Exception as e:
print(f"Error creating BuildExtension: {e}")

if __name__ == "__main__":
check_cuda_setup()


..

If return 'False' then you need to fix your system.

Thank you.


2/19/2023

How to Install OpenCV 4.7 with CUDA, cuDNN, TBB, CUDA Video Codec, and Extra Modules in Linux

 


refer to bash code


.

#!/bin/bash

# Install dependencies
sudo apt-get update
sudo apt-get install build-essential cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libdc1394-22-dev
sudo apt-get install libcanberra-gtk-module libcanberra-gtk3-module

# Install CUDA 11
wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.57.02_linux.run
sudo sh cuda_11.4.0_470.57.02_linux.run --silent --toolkit --override
echo 'export PATH=/usr/local/cuda-11.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Download and extract TBB
wget https://github.com/oneapi-src/oneTBB/releases/download/v2022.0.0/oneapi-tbb-2022.0.0-lin.tgz
tar -xf oneapi-tbb-2022.0.0-lin.tgz
sudo cp -r oneapi-tbb-2022.0.0/lib/* /usr/local/lib/
echo 'export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Download and extract OpenCV 4.7 and OpenCV extra modules
wget https://github.com/opencv/opencv/archive/4.7.0.zip
unzip 4.7.0.zip
cd opencv-4.7.0

wget https://github.com/opencv/opencv_contrib/archive/4.7.0.zip
unzip 4.7.0.zip

# Build and install OpenCV 4.7 with CUDA, cuDNN, TBB, CUDA video codec, and OpenCV extra modules
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-4.7.0/modules -D WITH_CUDA=ON -D WITH_TBB=ON -D WITH_NVCUVID=ON -D WITH_GSTREAMER=ON -D WITH_GSTREAMER_0_10=OFF -D WITH_LIBV4L=ON -D WITH_CUDNN=ON -D CUDA_ARCH_BIN=7.5 ..
make -j$(nproc)
sudo make install
echo 'export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH' >> ~/.bashrc
source ~/.bashrc

# Compile and run the sample code
cd ../../
wget https://raw.githubusercontent.com/spmallick/learnopencv/master/Averaging4kVideo/Averaging4kVideo.cpp
g++ Averaging4kVideo.cpp -o Averaging4kVideo `pkg-config --cflags --libs opencv4`
./Averaging4kVideo

..



thank you.

🙇🏻‍♂️

www.marearts.com

10/20/2022

re-install nvidia drive (cuda) in ubuntu

command it on terminal

 

..

sudo apt clean
sudo apt update
sudo apt purge nvidia-* 
sudo apt autoremove
sudo apt install -y cuda

..


Thank you.


www.marearts.com

9/22/2021

pytorch cuda definition

Their syntax varies slightly, but they are equivalent:

.to(name)

.to(device)

.cuda()

CPU

to('cpu')

to(torch.device('cpu'))

cpu()

Current GPU

to('cuda')

to(torch.device('cuda'))

cuda()

Specific GPU

to('cuda:1')

to(torch.device('cuda:1'))

cuda(device=1)

Note: the current cuda device is 0 by default, but this can be set with torch.cuda.set_device().

7/24/2021

check pytorch, Tensorflow can use GPU

 

test tensorflow which can use GPU

#method 1
import tensorflow as tf
tf.test.is_built_with_cuda()
> Ture

#method 2
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
> ..

test pytorch can use GPU

#method 3
import torch
torch.cuda.is_available()
>>> True

torch.cuda.current_device()
>>> 0

torch.cuda.device(0)
>>> <torch.cuda.device at 0x7efce0b03be0>

torch.cuda.device_count()
>>> 1

torch.cuda.get_device_name(0)
>>> 'GeForce GTX 950M'



Thank you.
www.MareArts.com

6/28/2021

CUDA_ARCH_BIN Table for gpu type


CUDA_ARCH_BIN Table for gpu type


Tegra

Jetson Products

GPUCompute Capability
Jetson AGX Xavier7.2
Jetson Nano5.3
Jetson TX26.2
Jetson TX15.3
Tegra X15.3
Tesla

Tesla Workstation Products

GPUCompute Capability
Tesla K803.7
Tesla K403.5
Tesla K203.5
Tesla C20752.0
Tesla C2050/C20702.0
Tesla

Tesla NVIDIA Data Center Products

GPUCompute Capability
NVIDIA A1008.0
NVIDIA T47.5
NVIDIA V1007.0
Tesla P1006.0
Tesla P406.1
Tesla P46.1
Tesla M605.2
Tesla M405.2
Tesla K803.7
Tesla K403.5
Tesla K203.5
Tesla K103.0
Quadro

Quadro Desktop Products

GPUCompute Capability
Quadro RTX 80007.5
Quadro RTX 60007.5
Quadro RTX 50007.5
Quadro RTX 40007.5
Quadro GV1007.0
Quadro GP1006.0
Quadro P60006.1
Quadro P50006.1
Quadro P40006.1
Quadro P22006.1
Quadro P20006.1
Quadro P10006.1
Quadro P6206.1
Quadro P6006.1
Quadro P4006.1
Quadro M6000 24GB5.2
Quadro M60005.2
Quadro K60003.5
Quadro M50005.2
Quadro K52003.5
Quadro K50003.0
Quadro M40005.2
Quadro K42003.0
Quadro K40003.0
Quadro M20005.2
Quadro K22003.0
Quadro K20003.0
Quadro K2000D3.0
Quadro K12005.0
Quadro K6205.0
Quadro K6003.0
Quadro K4203.0
Quadro 4103.0
Quadro Plex 70002.0
Quadro

Quadro Mobile Products

GPUCompute Capability
RTX 50007.5
RTX 40007.5
RTX 30007.5
T20007.5
T10007.5
P6206.1
P5206.1
Quadro P52006.1
Quadro P42006.1
Quadro P32006.1
Quadro P50006.1
Quadro P40006.1
Quadro P30006.1
Quadro P20006.1
Quadro P10006.1
Quadro P6006.1
Quadro P5006.1
Quadro M5500M5.2
Quadro M22005.2
Quadro M12005.0
Quadro M6205.2
Quadro M5205.0
Quadro K6000M3.0
Quadro K5200M3.0
Quadro K5100M3.0
Quadro M5000M5.0
Quadro K500M3.0
Quadro K4200M3.0
Quadro K4100M3.0
Quadro M4000M5.0
Quadro K3100M3.0
Quadro M3000M5.0
Quadro K2200M3.0
Quadro K2100M3.0
Quadro M2000M5.0
Quadro K1100M3.0
Quadro M1000M5.0
Quadro K620M5.0
Quadro K610M3.5
Quadro M600M5.0
Quadro K510M3.5
Quadro M500M5.0
NVS

NVS Desktop Products

GPUCompute Capability
NVIDIA NVS 8105.0
NVIDIA NVS 5103.0
NVIDIA NVS 3152.1
NVIDIA NVS 3102.1
NVS

NVS Mobile Products

GPUCompute Capability
NVS 5400M2.1
NVS 5200M2.1
NVS 4200M2.1
GeForce

GeForce and TITAN Products

GPUCompute Capability
GeForce RTX 30908.6
GeForce RTX 30808.6
GeForce RTX 30708.6
NVIDIA TITAN RTX7.5
Geforce RTX 2080 Ti7.5
Geforce RTX 20807.5
Geforce RTX 20707.5
Geforce RTX 20607.5
NVIDIA TITAN V7.0
NVIDIA TITAN Xp6.1
NVIDIA TITAN X6.1
GeForce GTX 1080 Ti6.1
GeForce GTX 10806.1
GeForce GTX 1070 Ti6.1
GeForce GTX 10706.1
GeForce GTX 10606.1
GeForce GTX 10506.1
GeForce GTX TITAN X5.2
GeForce GTX TITAN Z3.5
GeForce GTX TITAN Black3.5
GeForce GTX TITAN3.5
GeForce GTX 980 Ti5.2
GeForce GTX 9805.2
GeForce GTX 9705.2
GeForce GTX 9605.2
GeForce GTX 9505.2
GeForce GTX 780 Ti3.5
GeForce GTX 7803.5
GeForce GTX 7703.0
GeForce GTX 7603.0
GeForce GTX 750 Ti5.0
GeForce GTX 7505.0
GeForce GTX 6903.0
GeForce GTX 6803.0
GeForce GTX 6703.0
GeForce GTX 660 Ti3.0
GeForce GTX 6603.0
GeForce GTX 650 Ti BOOST3.0
GeForce GTX 650 Ti3.0
GeForce GTX 6503.0
GeForce GTX 560 Ti2.1
GeForce GTX 550 Ti2.1
GeForce GTX 4602.1
GeForce GTS 4502.1
GeForce GTS 450*2.1
GeForce GTX 5902.0
GeForce GTX 5802.0
GeForce GTX 5702.0
GeForce GTX 4802.0
GeForce GTX 4702.0
GeForce GTX 4652.0
GeForce GT 7403.0
GeForce GT 7303.5
GeForce GT 730 DDR3,128bit2.1
GeForce GT 7203.5
GeForce GT 705*3.5
GeForce GT 640 (GDDR5)3.5
GeForce GT 640 (GDDR3)2.1
GeForce GT 6302.1
GeForce GT 6202.1
GeForce GT 6102.1
GeForce GT 5202.1
GeForce GT 4402.1
GeForce GT 440*2.1
GeForce GT 4302.1
GeForce GT 430*2.1