To process all arrays by reasonably small number of threads in cuda ( the explaination of tid = blockDim.x * gridDim.x )

see the this source code
...
#define N 10 //(33*1024)

__global__ void add(int *c){
 int tid = threadIdx.x + blockIdx.x * gridDim.x;

 if(tid < N)
  c[tid] = 1;
 
 
 while( tid < N)
 {
  c[tid] = 1;
  tid += blockDim.x * gridDim.x;
 }
 
}




int main(void)
{
 int c[N];
 int *dev_c;
 cudaMalloc( (void**)&dev_c, N*sizeof(int) );

 for(int i=0; i< N; ++i)
 {
  c[i] = -1;
 }

 cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);

 add<<< 2, 2>>>(dev_c);
 cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );


 for(int i=0; i< N; ++i)
 {
  printf("c[%d] = %d \n" ,i, c[i] );
 }

 cudaFree( dev_c );

}


---

Why we do not create 10 threads ex) add<<<2>>> or add<5>>>
Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.

This source code is example of this case.
arrays are 10, cuda threads are 4.
How to access all 10 arrays only by 4 threads.


see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.
(1D) -> http://study.marearts.com/2015/03/meaning-of-threadidx-blockidx-blockdim.html


In this source code, 
gridDim.x -> 2     //this means number of block of x
gridDim.y -> 1     //this means number of block of y
blockDim.x -> 2   //this means number of thread of x in a block
blockDim.y -> 1   //this means number of thread of y in a block

Our number of thread are 4, because 2*2(blocks * thread).

In add kernel function, we can access 0, 1, 2, 3 index of thread

int tid = threadIdx.x + blockIdx.x * blockDim.x;
①0+0*2=0
②1+0*2=1
③0+1*2=2
④1+1*2=3 

How to access rest of index 4, 5, 6, 7, 8, 9.
There is a calculation in while loop
while(tid
{
   c[tid] = 1;
   tid += blockDim.x + gridDim.x;
}

** first call of kernel **
#1 loop: 0+2*2=4
#2 loop: 4+2*2=8 
#3 loop: 8+2*2=12 ( but this value is false, while out!)

** second call of kernel **
#1 loop: 1+2*2=5
#2 loop: 5+2*2=9
#3 loop: 9+2*2=13 ( but this value is false, while out!)

** third call of kernel **
#1 loop: 2+2*2=6
#2 loop: 6+2*2=10 ( but this value is false, while out!)

** fourth call of kernel **
#1 loop: 3+2*2=7
#2 loop: 7+2*2=11 ( but this value is false, while out!)

So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.







Comments

Popular posts from this blog

(OpenCV Study) Background subtractor MOG, MOG2, GMG example source code (BackgroundSubtractorMOG, BackgroundSubtractorMOG2, BackgroundSubtractorGMG)

OpenCV Stitching example (Stitcher class, Panorama)

AMP dose run on my pc?, AMP(Accelerated Massive Parallelism)

Example source code of extract HOG feature from images, save descriptor values to xml file, using opencv (using HOGDescriptor )

Optical Flow sample source code using OpenCV

Real-time N camera stitching Class.

OpenCV meanShiftFiltering example source code ( cpu: pyrMeanShiftFiltering, gpu:meanShiftFiltering, gpu:meanShiftSegmentation )

8 point algorithm (Matlab source code) / The method to get the Fundamental Matrix and the Essential matrix

Image warping (using opencv findHomography, warpPerspective)

Video Stabilization example source code, (using cvFindHomography, cvWarpPerspective functions in openCV)