Showing posts with label blockDim. Show all posts
Showing posts with label blockDim. Show all posts

3/16/2015

To process all arrays by reasonably small number of threads in cuda ( the explaination of tid = blockDim.x * gridDim.x )

see the this source code
...
#define N 10 //(33*1024)

__global__ void add(int *c){
 int tid = threadIdx.x + blockIdx.x * gridDim.x;

 if(tid < N)
  c[tid] = 1;
 
 
 while( tid < N)
 {
  c[tid] = 1;
  tid += blockDim.x * gridDim.x;
 }
 
}




int main(void)
{
 int c[N];
 int *dev_c;
 cudaMalloc( (void**)&dev_c, N*sizeof(int) );

 for(int i=0; i< N; ++i)
 {
  c[i] = -1;
 }

 cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);

 add<<< 2, 2>>>(dev_c);
 cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );


 for(int i=0; i< N; ++i)
 {
  printf("c[%d] = %d \n" ,i, c[i] );
 }

 cudaFree( dev_c );

}


---

Why we do not create 10 threads ex) add<<<2>>> or add<5>>>
Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.

This source code is example of this case.
arrays are 10, cuda threads are 4.
How to access all 10 arrays only by 4 threads.


see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.
(1D) -> http://study.marearts.com/2015/03/meaning-of-threadidx-blockidx-blockdim.html


In this source code, 
gridDim.x -> 2     //this means number of block of x
gridDim.y -> 1     //this means number of block of y
blockDim.x -> 2   //this means number of thread of x in a block
blockDim.y -> 1   //this means number of thread of y in a block

Our number of thread are 4, because 2*2(blocks * thread).

In add kernel function, we can access 0, 1, 2, 3 index of thread

int tid = threadIdx.x + blockIdx.x * blockDim.x;
①0+0*2=0
②1+0*2=1
③0+1*2=2
④1+1*2=3 

How to access rest of index 4, 5, 6, 7, 8, 9.
There is a calculation in while loop
while(tid
{
   c[tid] = 1;
   tid += blockDim.x + gridDim.x;
}

** first call of kernel **
#1 loop: 0+2*2=4
#2 loop: 4+2*2=8 
#3 loop: 8+2*2=12 ( but this value is false, while out!)

** second call of kernel **
#1 loop: 1+2*2=5
#2 loop: 5+2*2=9
#3 loop: 9+2*2=13 ( but this value is false, while out!)

** third call of kernel **
#1 loop: 2+2*2=6
#2 loop: 6+2*2=10 ( but this value is false, while out!)

** fourth call of kernel **
#1 loop: 3+2*2=7
#2 loop: 7+2*2=11 ( but this value is false, while out!)

So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.







3/12/2015

Meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda (2D)

This article explain how to access the thread index when you make block and thread with two dimensions.

please refer to this page about method to access thread 1D.
->http://feelmare.blogspot.com/2015/03/meaning-of-threadidx-blockidx-blockdim.html

If you make kernel like that

dim3 blocks(2,3);
dim3 thread(3,2);
Kernel<<< blocks, threads >>>

The threads are made as follows figure.


36 threads are made and gridDim and blockDim is (2,3) and (3,2).

problem is now...
How to access 15th thread??
See the this figure..



Do you understand?
We have to do indexing calculation, because threadIdx.x, threadIdx.y is only indicate indexing in their block.

For more detail, refer to below figure that represent the index list of tid calculation result.




#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include < stdio.h>



#define N 40

__global__ void increase(int *c){
 int x = threadIdx.x + blockIdx.x * blockDim.x;
 int y = threadIdx.y + blockIdx.y * blockDim.y;
 int tid = x + y*blockDim.x * gridDim.x;
 if(tid < N)
  c[tid] = tid;
}



int main(void)
{
 int c[N];
 int *dev_c;

 cudaMalloc( (void**)&dev_c, N*sizeof(int) );

 for(int i=0; i< N; ++i)
 {
  c[i] = -1;
 }

 cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);

 dim3 blocks(2,3);
 dim3 threads(3,2);
 increase<<< blocks, threads>>>(dev_c);

 cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );

 for(int i=0; i< N; ++i)
 {
  printf("c[%d] = %d \n" ,i, c[i] );
 }

 cudaFree( dev_c );
}

...


In the source code, threads are made only 36. so 37th 38th 39th 40th array have left initial value -1.





Meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda (1D)


When we study cuda firstly, thread indexing is very confusing.
So I tried to clean up.

First, Let's grab a sense of looking at this example
...
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include < stdio.h>

#define N 15

__global__ void increase(int *c){
 int tid = threadIdx.x + blockIdx.x * blockDim.x;
 
 if(tid < N)
  c[tid] = tid;
}

int main(void)
{
 int c[N];
 int *dev_c;

 cudaMalloc( (void**)&dev_c, N*sizeof(int) );

 for(int i=0; i< N; ++i)
 {
  c[i] = -1;
 }

 cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);

 increase<<< 4, 3>>>(dev_c);

 cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );

 for(int i=0; i< N; ++i)
 {
  printf("c[%d] = %d \n" ,i, c[i] );
 }

 cudaFree( dev_c );
}
...

The result of this example source is

In the source code, kernel function(increase) is created by <<< 4, 3 >>>, this means to create 12 threads.
12 threads are executed at the same time.
So, the kernel function need to know what number of thread am I?
The method is threadIdx and blockIdx.

But we have to calculate thread index, because threadIdx and blockIdx is different space index.
like that " int tid = threadIdx.x + blockIdx.x * blockDim.x; "

threadIdx tells current thread index.
blockIdx tells current block index.

gridDim tells number of blocks in a grid
blockDim tells number of a threads in a block


Did you more confused?

My final explanation. See the this figure.




Next time, I will introduce 2D kernel.
http://study.marearts.com/2015/03/meaning-of-threadidx-blockidx-blockdim_12.html
Thank you.