8/22/2024

ROCm HIP asynchronous operation sample code

 





HIP (Heterogeneous-Compute Interface for Portability) provides similar functionality to CUDA streams for asynchronous execution. The concepts and usage are very similar, making it easier to port CUDA code to HIP. Here's an overview of HIP's equivalent features for asynchronous execution:

1. HIP Streams:
In HIP, streams are represented by the `hipStream_t` type, which is analogous to CUDA's `cudaStream_t`.

2. Creating and Destroying Streams:
```cpp
hipStream_t stream;
hipError_t hipStreamCreate(hipStream_t* stream);
hipError_t hipStreamDestroy(hipStream_t stream);
```

3. Asynchronous Memory Operations:
```cpp
hipError_t hipMemcpyAsync(void* dst, const void* src, size_t count, hipMemcpyKind kind, hipStream_t stream);
hipError_t hipMemsetAsync(void* dst, int value, size_t count, hipStream_t stream);
```

4. Launching Kernels on Streams:
```cpp
hipLaunchKernelGGL(kernel, dim3(gridSize), dim3(blockSize), 0, stream, /* kernel arguments */);
```

5. Stream Synchronization:
```cpp
hipError_t hipStreamSynchronize(hipStream_t stream);
hipError_t hipDeviceSynchronize();
```

6. Stream Query:
```cpp
hipError_t hipStreamQuery(hipStream_t stream);
```

7. Stream Callbacks:
```cpp
hipError_t hipStreamAddCallback(hipStream_t stream, hipStreamCallback_t callback, void* userData, unsigned int flags);
```

8. Stream Priorities:
```cpp
hipError_t hipStreamCreateWithPriority(hipStream_t* stream, unsigned int flags, int priority);
```

Here's a simple example demonstrating asynchronous execution with HIP streams:

```cpp
#include <hip/hip_runtime.h>
#include <stdio.h>

#define N 1000000
#define STREAMS 4

__global__ void vectorAdd(float* a, float* b, float* c, int numElements) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx < numElements) {
c[idx] = a[idx] + b[idx];
}
}

int main() {
float *h_a, *h_b, *h_c;
float *d_a, *d_b, *d_c;
size_t size = N * sizeof(float);

// Allocate host memory
h_a = (float*)malloc(size);
h_b = (float*)malloc(size);
h_c = (float*)malloc(size);

// Initialize host arrays
for (int i = 0; i < N; i++) {
h_a[i] = rand() / (float)RAND_MAX;
h_b[i] = rand() / (float)RAND_MAX;
}

// Allocate device memory
hipMalloc(&d_a, size);
hipMalloc(&d_b, size);
hipMalloc(&d_c, size);

// Create streams
hipStream_t streams[STREAMS];
for (int i = 0; i < STREAMS; i++) {
hipStreamCreate(&streams[i]);
}

// Launch kernel on multiple streams
int streamSize = N / STREAMS;
dim3 blockSize(256);
dim3 gridSize((streamSize + blockSize.x - 1) / blockSize.x);

for (int i = 0; i < STREAMS; i++) {
int offset = i * streamSize;
hipMemcpyAsync(&d_a[offset], &h_a[offset], streamSize * sizeof(float), hipMemcpyHostToDevice, streams[i]);
hipMemcpyAsync(&d_b[offset], &h_b[offset], streamSize * sizeof(float), hipMemcpyHostToDevice, streams[i]);
hipLaunchKernelGGL(vectorAdd, gridSize, blockSize, 0, streams[i], &d_a[offset], &d_b[offset], &d_c[offset], streamSize);
hipMemcpyAsync(&h_c[offset], &d_c[offset], streamSize * sizeof(float), hipMemcpyDeviceToHost, streams[i]);
}

// Synchronize all streams
for (int i = 0; i < STREAMS; i++) {
hipStreamSynchronize(streams[i]);
}

// Verify results
for (int i = 0; i < N; i++) {
if (fabs(h_c[i] - (h_a[i] + h_b[i])) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(1);
}
}

printf("Test PASSED\n");

// Clean up
for (int i = 0; i < STREAMS; i++) {
hipStreamDestroy(streams[i]);
}
hipFree(d_a);
hipFree(d_b);
hipFree(d_c);
free(h_a);
free(h_b);
free(h_c);

return 0;
}
```

This example demonstrates how to use multiple streams to overlap computation and data transfer, similar to the CUDA example in the article you referenced. The key points are:

1. Creating multiple streams
2. Using `hipMemcpyAsync` for asynchronous data transfer
3. Launching kernels on specific streams
4. Synchronizing streams after all operations are queued

By using streams, you can potentially improve performance by overlapping operations and utilizing the GPU more efficiently.

No comments:

Post a Comment