Marco Tazzari
2015-01-21 14:42:10 UTC
I am trying to implement in Python the following pattern for **multi-CPU and
single-GPU** computation using **pycuda** and **pyfft** packages.
I would like to have **several processes** (e.g. launched with
multiprocessing.Pool()), with **each of them** able to perform **FFTs using
the GPU (using NVIDIA CUDA)**.
However, I have the following problem:
If I run too many processes or too many FFTs per process, **the overall
script remains on hold without terminating** (and without computing all the
FFTs that are due). From further investigations I suppose this is due to the
**memory limit** on the GPU (currently 2048MB on NVIDIA GeForce GT 750M). It
seems that the multiprocessing pool is not able to acquire the control back.
Is there any way to avoid this?
Since each process requires less than 2048 MB, I would like to have
something like a **queue** where each process can *book* the usage of the
GPU and, when a process releases the context, the next process in the queue
starts using it.
Is this doable?
Alternatively, is it possible to force that only one process uses the GPU at
a given time?
I have tried separately these solutions but they do not work (or probably I
have not implemented them correctly):
1. synchronize the stream, with proc_stream.synchronize()
2. clear context cache, with pycuda.tools.clear_context_caches()
3. change the compute mode, with cuda.compute_mode =
cuda.compute_mode.EXCLUSIVE
**Note:** The solution 2. seems to free some memory, but it makes the
computation way slower, and does not solve the problem: e.g. increasing the
number of ffts to be computed, the script shows the same behaviour.
Here the code. To start from a simple task, here each process computes 1 FFT
(then one can use batch option in execute() to do more FFTs in a row).
import multiprocessing
import pycuda.driver as cuda
import pycuda.gpuarray as gpuarray
from pycuda.tools import make_default_context
from pyfft.cuda import Plan
def main():
# generates simple matrix, (e.g. image with a signal at the center)
size = 4096
center = size/2
in_matrix = np.zeros((size, size), dtype='complex64')
in_matrix[center:center+2, center:center+2] = 10.
pool_size = 4 # integer up to multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size)
func = FuncWrapper(in_matrix, size)
nffts = 16 # total number of ffts to be computed
par = np.arange(nffts)
results = pool.map(func, par)
pool.close()
pool.join()
print results
And here the function wrapper:
class FuncWrapper(object):
def __init__(self, matrix, size):
self.in_matrix = matrix
self.size = size
print("Func initialized with matrix size=%i" % size)
def __call__(self, par):
proc_id = multiprocessing.current_process().name
# take control over the GPU
cuda.init()
context = make_default_context()
device = context.get_device()
proc_stream = cuda.Stream()
# move data to GPU
# multiplication self.in_matrix*par is just to have each process
computing
# different matrices
in_map_gpu = gpuarray.to_gpu(self.in_matrix*par)
# create Plan, execute FFT and get back the result from GPU
plan = Plan((self.size, self.size), dtype=np.complex64,
fast_math=False, normalize=False,
wait_for_finish=True,
stream=proc_stream)
plan.execute(in_map_gpu, wait_for_finish=True)
result = in_map_gpu.get()
# free memory on GPU
del in_map_gpu
mem = np.array(cuda.mem_get_info())/1.e6
print("%s free=%f\ttot=%f" % (proc_id, mem[0], mem[1]))
# release context
context.pop()
return par
Now, with nffts=16 and pool_size=4 the script terminates correctly and gives
this output:
Func initialized with matrix size=4096
PoolWorker-1 free=1481.019392 tot=2147.024896
PoolWorker-2 free=1331.011584 tot=2147.024896
PoolWorker-3 free=1181.003776 tot=2147.024896
PoolWorker-4 free=1030.631424 tot=2147.024896
PoolWorker-1 free=881.074176 tot=2147.024896
PoolWorker-2 free=731.746304 tot=2147.024896
PoolWorker-3 free=582.418432 tot=2147.024896
PoolWorker-4 free=433.090560 tot=2147.024896
PoolWorker-1 free=582.754304 tot=2147.024896
PoolWorker-2 free=718.946304 tot=2147.024896
PoolWorker-3 free=881.254400 tot=2147.024896
PoolWorker-4 free=1030.684672 tot=2147.024896
PoolWorker-1 free=868.028416 tot=2147.024896
PoolWorker-2 free=731.713536 tot=2147.024896
PoolWorker-3 free=582.402048 tot=2147.024896
PoolWorker-4 free=433.090560 tot=2147.024896
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
But with nffts=18 and pool_size=4 the script does not terminate and gives
this output, remaining stuck at the last line:
Func initialized with matrix size=4096
PoolWorker-1 free=1416.392704 tot=2147.024896
PoolWorker-2 free=982.544384 tot=2147.024896
PoolWorker-1 free=1101.037568 tot=2147.024896
PoolWorker-2 free=682.991616 tot=2147.024896
PoolWorker-3 free=815.747072 tot=2147.024896
PoolWorker-4 free=396.918784 tot=2147.024896
PoolWorker-3 free=503.046144 tot=2147.024896
PoolWorker-4 free=397.144064 tot=2147.024896
PoolWorker-1 free=531.361792 tot=2147.024896
PoolWorker-1 free=397.246464 tot=2147.024896
PoolWorker-2 free=518.610944 tot=2147.024896
PoolWorker-2 free=397.021184 tot=2147.024896
PoolWorker-3 free=517.193728 tot=2147.024896
PoolWorker-4 free=397.021184 tot=2147.024896
PoolWorker-3 free=504.336384 tot=2147.024896
PoolWorker-4 free=149.123072 tot=2147.024896
PoolWorker-1 free=283.340800 tot=2147.024896
...on hold...
Many thanks for your help!
--
View this message in context: http://pycuda.2962900.n2.nabble.com/Compute-several-FFT-with-GPU-using-Python-multiprocessing-and-pyfft-how-to-avoid-GPU-memory-leak-tp7575533.html
Sent from the PyCuda mailing list archive at Nabble.com.
single-GPU** computation using **pycuda** and **pyfft** packages.
I would like to have **several processes** (e.g. launched with
multiprocessing.Pool()), with **each of them** able to perform **FFTs using
the GPU (using NVIDIA CUDA)**.
However, I have the following problem:
If I run too many processes or too many FFTs per process, **the overall
script remains on hold without terminating** (and without computing all the
FFTs that are due). From further investigations I suppose this is due to the
**memory limit** on the GPU (currently 2048MB on NVIDIA GeForce GT 750M). It
seems that the multiprocessing pool is not able to acquire the control back.
Is there any way to avoid this?
Since each process requires less than 2048 MB, I would like to have
something like a **queue** where each process can *book* the usage of the
GPU and, when a process releases the context, the next process in the queue
starts using it.
Is this doable?
Alternatively, is it possible to force that only one process uses the GPU at
a given time?
I have tried separately these solutions but they do not work (or probably I
have not implemented them correctly):
1. synchronize the stream, with proc_stream.synchronize()
2. clear context cache, with pycuda.tools.clear_context_caches()
3. change the compute mode, with cuda.compute_mode =
cuda.compute_mode.EXCLUSIVE
**Note:** The solution 2. seems to free some memory, but it makes the
computation way slower, and does not solve the problem: e.g. increasing the
number of ffts to be computed, the script shows the same behaviour.
Here the code. To start from a simple task, here each process computes 1 FFT
(then one can use batch option in execute() to do more FFTs in a row).
import multiprocessing
import pycuda.driver as cuda
import pycuda.gpuarray as gpuarray
from pycuda.tools import make_default_context
from pyfft.cuda import Plan
def main():
# generates simple matrix, (e.g. image with a signal at the center)
size = 4096
center = size/2
in_matrix = np.zeros((size, size), dtype='complex64')
in_matrix[center:center+2, center:center+2] = 10.
pool_size = 4 # integer up to multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size)
func = FuncWrapper(in_matrix, size)
nffts = 16 # total number of ffts to be computed
par = np.arange(nffts)
results = pool.map(func, par)
pool.close()
pool.join()
print results
And here the function wrapper:
class FuncWrapper(object):
def __init__(self, matrix, size):
self.in_matrix = matrix
self.size = size
print("Func initialized with matrix size=%i" % size)
def __call__(self, par):
proc_id = multiprocessing.current_process().name
# take control over the GPU
cuda.init()
context = make_default_context()
device = context.get_device()
proc_stream = cuda.Stream()
# move data to GPU
# multiplication self.in_matrix*par is just to have each process
computing
# different matrices
in_map_gpu = gpuarray.to_gpu(self.in_matrix*par)
# create Plan, execute FFT and get back the result from GPU
plan = Plan((self.size, self.size), dtype=np.complex64,
fast_math=False, normalize=False,
wait_for_finish=True,
stream=proc_stream)
plan.execute(in_map_gpu, wait_for_finish=True)
result = in_map_gpu.get()
# free memory on GPU
del in_map_gpu
mem = np.array(cuda.mem_get_info())/1.e6
print("%s free=%f\ttot=%f" % (proc_id, mem[0], mem[1]))
# release context
context.pop()
return par
Now, with nffts=16 and pool_size=4 the script terminates correctly and gives
this output:
Func initialized with matrix size=4096
PoolWorker-1 free=1481.019392 tot=2147.024896
PoolWorker-2 free=1331.011584 tot=2147.024896
PoolWorker-3 free=1181.003776 tot=2147.024896
PoolWorker-4 free=1030.631424 tot=2147.024896
PoolWorker-1 free=881.074176 tot=2147.024896
PoolWorker-2 free=731.746304 tot=2147.024896
PoolWorker-3 free=582.418432 tot=2147.024896
PoolWorker-4 free=433.090560 tot=2147.024896
PoolWorker-1 free=582.754304 tot=2147.024896
PoolWorker-2 free=718.946304 tot=2147.024896
PoolWorker-3 free=881.254400 tot=2147.024896
PoolWorker-4 free=1030.684672 tot=2147.024896
PoolWorker-1 free=868.028416 tot=2147.024896
PoolWorker-2 free=731.713536 tot=2147.024896
PoolWorker-3 free=582.402048 tot=2147.024896
PoolWorker-4 free=433.090560 tot=2147.024896
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
But with nffts=18 and pool_size=4 the script does not terminate and gives
this output, remaining stuck at the last line:
Func initialized with matrix size=4096
PoolWorker-1 free=1416.392704 tot=2147.024896
PoolWorker-2 free=982.544384 tot=2147.024896
PoolWorker-1 free=1101.037568 tot=2147.024896
PoolWorker-2 free=682.991616 tot=2147.024896
PoolWorker-3 free=815.747072 tot=2147.024896
PoolWorker-4 free=396.918784 tot=2147.024896
PoolWorker-3 free=503.046144 tot=2147.024896
PoolWorker-4 free=397.144064 tot=2147.024896
PoolWorker-1 free=531.361792 tot=2147.024896
PoolWorker-1 free=397.246464 tot=2147.024896
PoolWorker-2 free=518.610944 tot=2147.024896
PoolWorker-2 free=397.021184 tot=2147.024896
PoolWorker-3 free=517.193728 tot=2147.024896
PoolWorker-4 free=397.021184 tot=2147.024896
PoolWorker-3 free=504.336384 tot=2147.024896
PoolWorker-4 free=149.123072 tot=2147.024896
PoolWorker-1 free=283.340800 tot=2147.024896
...on hold...
Many thanks for your help!
--
View this message in context: http://pycuda.2962900.n2.nabble.com/Compute-several-FFT-with-GPU-using-Python-multiprocessing-and-pyfft-how-to-avoid-GPU-memory-leak-tp7575533.html
Sent from the PyCuda mailing list archive at Nabble.com.