[PyCUDA] Questions on pinned memory

Discussion:

Walter White

2015-10-05 09:55:20 UTC

Hello,

I have a question about pinned memory and hope that you can help me.

I found out that copying data from device to host takes
a very big part of my runtime, so I read about the issue
and came across "pinned memory".

There are several examples on the mailing list but I am not
sure if I am doing this the right way.

Do I need to initialize with drv.ctx_flags.MAP_HOST
or is this automatically activated if one of the
functions below is used?

drv.init()
dev = drv.Device(0)
ctx = dev.make_context(drv.ctx_flags.SCHED_AUTO | drv.ctx_flags.MAP_HOST)

Is drv.mem_host_register_flags.DEVICEMAP also needed if
the context is initialized with drv.ctx_flags.MAP_HOST ?

I found several methods that should do this
but none of them seems to work.
Are they all equivalent?

--
x = drv.register_host_memory(x, flags=drv.mem_host_register_flags.DEVICEMAP)
x_gpu_ptr = np.intp(x.base.get_device_pointer())

--
x = drv.pagelocked_empty(shape=x.shape, dtype=np.float32,
mem_flags=drv.mem_host_register_flags.DEVICEMAP)
--

from pycuda.tools import PageLockedMemoryPool
pool = PageLockedMemoryPool()
x_ptr = pool.allocate(dest.shape , np.float32)
--

If I use
np.intp(x.base.get_device_pointer())
and
drv.memcpy_dtoh(a_gpu, x_ptr)

there is an error message

"BufferError: Object is not writable."

Kind regards,
Joe

Andreas Kloeckner

2015-10-05 14:39:13 UTC

Permalink

Post by Walter White
Hello,
I have a question about pinned memory and hope that you can help me.
I found out that copying data from device to host takes
a very big part of my runtime, so I read about the issue
and came across "pinned memory".
There are several examples on the mailing list but I am not
sure if I am doing this the right way.
Do I need to initialize with drv.ctx_flags.MAP_HOST
or is this automatically activated if one of the
functions below is used?
drv.init()
dev = drv.Device(0)
ctx = dev.make_context(drv.ctx_flags.SCHED_AUTO | drv.ctx_flags.MAP_HOST)

No, this is necessary.

Post by Walter White
Is drv.mem_host_register_flags.DEVICEMAP also needed if
the context is initialized with drv.ctx_flags.MAP_HOST ?
I found several methods that should do this
but none of them seems to work.
Are they all equivalent?
--
x = drv.register_host_memory(x, flags=drv.mem_host_register_flags.DEVICEMAP)
x_gpu_ptr = np.intp(x.base.get_device_pointer())
--
x = drv.pagelocked_empty(shape=x.shape, dtype=np.float32,
mem_flags=drv.mem_host_register_flags.DEVICEMAP)
--
from pycuda.tools import PageLockedMemoryPool
pool = PageLockedMemoryPool()
x_ptr = pool.allocate(dest.shape , np.float32)
--

The former two are equivalent. The latter just uses 'page-locked' memory
(which *can* be pinned, but normally isn't).

Post by Walter White
If I use
np.intp(x.base.get_device_pointer())
and
drv.memcpy_dtoh(a_gpu, x_ptr)
there is an error message
"BufferError: Object is not writable."

This is a sign that it worked--the memory is no longer writable host-side.

Andreas

Joe

2015-10-11 06:29:47 UTC

Permalink

Hello,

I have a question and hope that you can help me.

A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.

I would rather not store all of the results as integers.
since the condition is only met in very rare cases.

The two options I found would be

1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.

2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.

Is one of this ideas to be preferred? Or do you have
better suggestions to do this?

Kind regards,
Joe

Andreas Kloeckner

2015-10-11 17:30:42 UTC

Permalink

Post by Walter White
Hello,
I have a question and hope that you can help me.
A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.
I would rather not store all of the results as integers.
since the condition is only met in very rare cases.
The two options I found would be
1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.
2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.
Is one of this ideas to be preferred? Or do you have
better suggestions to do this?

This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.

Andreas

_______________________________________________
PyCUDA mailing list

Joe

2015-10-14 19:38:50 UTC

Permalink

Hello,

in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?

This is about the code I am using:

if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}

Anything not good about it?

I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.

And then check the array before each write if one
of the other threads has added something.

while(out[j] == -1)
{
j++;
}
out[j] == i

But I think this is prone to errors if the memory is accessed
at the same time.

Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?

Kind regards,
Joe

Post by Andreas Kloeckner

Joe

2015-10-18 12:27:00 UTC

Permalink

Hello,

in the meantime I added a scan function to find out how many
indices will be written by a specific thread.
These results are written to shared memory and it works
fine.

However, the final writing of the results to global memory
is very slow and takes up nearly all the time. (18 seconds out of 20).

Is there something I am missing in the following code?

int cnt
//holds the index where this thread starts to write to the global array
//this is computed by each thread earlier

thrd_chk_start and thrd_chk_end are the set the data that
each thread processes.
Typically (thrd_chk_end - thrd_chk_start) is between 25 and 100.

for (int i = thrd_chk_start; i < thrd_chk_end; i++)
{
if(condition)
{
out[(hIdx * nearNeigh_n) + cnt ] = i;
cnt += 1;
}
}

The line with out[...] is very slow, does anyone know if there is
a reason for that? Indices not known to compiler beforehand or whatever?
All other writes to global memory are way way faster than this.

I tried blockDim.x between 64 and 1024 but there is not much difference.

Kind regards,
Joe

Post by Walter White
Hello,
in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;Am 14.10.2015 um 21:38 schrieb Joe:> Hello,
Am 14.10.2015 um 21:38 schrieb Joe:> Hello,
in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;
}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Am 14.10.2015 um 21:38 schrieb Joe:> Hello,
in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;
}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe

Post by Andreas Kloeckner

Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe

Post by Andreas Kloeckner

in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;
}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe

Post by Andreas Kloeckner

}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe

Post by Andreas Kloeckner

Andreas Kloeckner

2015-10-18 23:27:56 UTC

Permalink

Post by Joe
in the meantime I added a scan function to find out how many
indices will be written by a specific thread.
These results are written to shared memory and it works
fine.
However, the final writing of the results to global memory
is very slow and takes up nearly all the time. (18 seconds out of 20).
Is there something I am missing in the following code?
int cnt
//holds the index where this thread starts to write to the global array
//this is computed by each thread earlier
thrd_chk_start and thrd_chk_end are the set the data that
each thread processes.
Typically (thrd_chk_end - thrd_chk_start) is between 25 and 100.
for (int i = thrd_chk_start; i < thrd_chk_end; i++)
{
if(condition)
{
out[(hIdx * nearNeigh_n) + cnt ] = i;
cnt += 1;
}
}
The line with out[...] is very slow, does anyone know if there is
a reason for that? Indices not known to compiler beforehand or whatever?
All other writes to global memory are way way faster than this.

Depending on how scattered these writes are, it might be helpful to turn
off caching for them. See the CUDA docs for how.

Andreas