indices will be written by a specific thread.
fine.
is very slow and takes up nearly all the time. (18 seconds out of 20).
each thread processes.
Typically (thrd_chk_end - thrd_chk_start) is between 25 and 100.
All other writes to global memory are way way faster than this.
I tried blockDim.x between 64 and 1024 but there is not much difference.
Post by Walter WhiteHello,
in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;Am 14.10.2015 um 21:38 schrieb Joe:> Hello,
Am 14.10.2015 um 21:38 schrieb Joe:> Hello,
in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;
}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Am 14.10.2015 um 21:38 schrieb Joe:> Hello,
in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;
}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe
Post by Andreas KloecknerPost by Walter WhiteHello,
I have a question and hope that you can help me.
A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.
I would rather not store all of the results as integers.
since the condition is only met in very rare cases.
The two options I found would be
1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.
2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.
Is one of this ideas to be preferred? Or do you have
better suggestions to do this?
This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.
Andreas
Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe
Post by Andreas KloecknerPost by Walter WhiteHello,
I have a question and hope that you can help me.
A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.
I would rather not store all of the results as integers.
since the condition is only met in very rare cases.
The two options I found would be
1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.
2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.
Is one of this ideas to be preferred? Or do you have
better suggestions to do this?
This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.
Andreas
in the meantime I made a first approach on this issue.
Unfortunately it is still very slow.
I am using atomicAdd with a shared variable at the moment.
I read somewhere that that it is recommended to use
shared instead global as it is faster, is this true in general?
if(condition)
{
cnt = atomicAdd(&shrd_cnt, 1);
out[ cnt ] = i;
}
Anything not good about it?
I am still thinking about how it could be done without
atomicAdd. One idea would be to initialize the
"out" array with e.g. -1.
And then check the array before each write if one
of the other threads has added something.
while(out[j] == -1)
{
j++;
}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe
Post by Andreas KloecknerPost by Walter WhiteHello,
I have a question and hope that you can help me.
A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.
I would rather not store all of the results as integers.
since the condition is only met in very rare cases.
The two options I found would be
1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.
2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.
Is one of this ideas to be preferred? Or do you have
better suggestions to do this?
This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.
Andreas
}
out[j] == i
But I think this is prone to errors if the memory is accessed
at the same time.
Andreas, you mentioned to do "a scan withing each block".
Do you mean something like each thread checks how many indices
it wants to add to the global list and then add them in a second
step?
Kind regards,
Joe
Post by Andreas KloecknerPost by Walter WhiteHello,
I have a question and hope that you can help me.
A block is working on a common problem, the threads are iterating
through a part of the problem each.
Now if some condition is met, a thread should write its threadId
to a 1D output which is smaller than the total number of threads.
I would rather not store all of the results as integers.
since the condition is only met in very rare cases.
The two options I found would be
1.) to store all results in a bitfield with is as long as there are
threads and use bitwise atomicAnd.
2.) share a common index within a block which is and use the
return value of atomicAdd to store the threadId there.
Is one of this ideas to be preferred? Or do you have
better suggestions to do this?
This sounds tricky. A reasonable design might be to allocate space so
that every block has room to write out twice or three times its expected
number of outputs, use a scan within each block to compute indices, and
have some sort of failure indication (+do-over) if the allocated output
space overruns.
Andreas