Hanno Klemm
2015-09-15 15:52:02 UTC
Hi,
I am a newbie with regards to GPU computations and before embarking on
trying to put a calculation onto the GPU, I wanted to ask if there is a
significant uplift in execution speed likely for this scenario and how
to best go about this in pyCuda.
I have a problem where I have to calculate the probability density of a
few multidimensional Gaussians (order of 10) for many vectors (hundreds
of thousands to millions). The length of the vectors is usually in the
order of 100-600.
Currently I am doing this in Python with numpy (backed by MKL). I
pre-compute the covariance matrices and their determinants and then I
calculate
def calc_observation_likelihoods(self, x, y, t):
data = self.observations.retrieve_traces(x,y) #Here I get the
data
self.calc_inverses(t) # Here I retrieve the cached covariance
matrices
log_likelihoods = self.log_likelihoods # Just a previously
defined array
#Here comes the part I woud like to speed up
for f_ind in range(Mk):
log_det, U = self.inverse_structure[f_ind] #This is
currently a dictionary holding the required information
mu = self.likelihood_means[:, f_ind] # Also cached
dev = (data - mu)*self.scaling # Ths scaling is required to
prevent over/underflow
rank = dev.shape[0]
maha = np.sum(np.square(np.dot(dev, U)), axis=-1) ###This is
the line I spend most time on.
res = -0.5 * (rank * _LOG_2PI + log_det + maha)
log_likelihoods[f_ind] = res
#print dev.max(), dev.min(), res
log_likelihood_maxs = log_likelihoods.max()
log_likelihoods -= log_likelihood_maxs
likelihoods = np.exp(log_likelihoods)
likelihoods/=likelihoods.sum()
return likelihoods
The problem looks to me like something where it should be possible to
get a big speedup with the GPU, however, I have to shuffle a lot of data
around, I suppose. Therefore I would be grateful to get some pointers if
this looks like a problem where it is promising to try to use the GPU.
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro 6000"
CUDA Driver Version / Runtime Version 7.0 / 7.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 6143 MBytes (6441598976
bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Max Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1494 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,
65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048
layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy
engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with
device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA
Runtime Version = 7.0, NumDevs = 1, Device0 = Quadro 6000
Result = PASS
Many thanks,
Hanno
I am a newbie with regards to GPU computations and before embarking on
trying to put a calculation onto the GPU, I wanted to ask if there is a
significant uplift in execution speed likely for this scenario and how
to best go about this in pyCuda.
I have a problem where I have to calculate the probability density of a
few multidimensional Gaussians (order of 10) for many vectors (hundreds
of thousands to millions). The length of the vectors is usually in the
order of 100-600.
Currently I am doing this in Python with numpy (backed by MKL). I
pre-compute the covariance matrices and their determinants and then I
calculate
def calc_observation_likelihoods(self, x, y, t):
data = self.observations.retrieve_traces(x,y) #Here I get the
data
self.calc_inverses(t) # Here I retrieve the cached covariance
matrices
log_likelihoods = self.log_likelihoods # Just a previously
defined array
#Here comes the part I woud like to speed up
for f_ind in range(Mk):
log_det, U = self.inverse_structure[f_ind] #This is
currently a dictionary holding the required information
mu = self.likelihood_means[:, f_ind] # Also cached
dev = (data - mu)*self.scaling # Ths scaling is required to
prevent over/underflow
rank = dev.shape[0]
maha = np.sum(np.square(np.dot(dev, U)), axis=-1) ###This is
the line I spend most time on.
res = -0.5 * (rank * _LOG_2PI + log_det + maha)
log_likelihoods[f_ind] = res
#print dev.max(), dev.min(), res
log_likelihood_maxs = log_likelihoods.max()
log_likelihoods -= log_likelihood_maxs
likelihoods = np.exp(log_likelihoods)
likelihoods/=likelihoods.sum()
return likelihoods
The problem looks to me like something where it should be possible to
get a big speedup with the GPU, however, I have to shuffle a lot of data
around, I suppose. Therefore I would be grateful to get some pointers if
this looks like a problem where it is promising to try to use the GPU.
./deviceQuery
./deviceQuery Starting...CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro 6000"
CUDA Driver Version / Runtime Version 7.0 / 7.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 6143 MBytes (6441598976
bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Max Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1494 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,
65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048
layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy
engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with
device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA
Runtime Version = 7.0, NumDevs = 1, Device0 = Quadro 6000
Result = PASS
Many thanks,
Hanno
--
Hanno Klemm
***@phys.ethz.ch
Hanno Klemm
***@phys.ethz.ch