are ampblas_sgemm and ampblas_dgemm thread safe?

Nov 14, 2013 at 1:57 AM

i am working on a code base that is very heavy in linear algebra.

using visual c++ 2013 and intel mkl 11, i can put my blas calls in a parallel for loop, and this scales my data parallel code across my cpus. ex:

parallel_for (unsigned int(0), max, [&](unsigned int iter) {
store result in array[iter]

however, i do not see the same scaling behavior when i substitute ampblas_dgemm. ex:

parallel_for (unsigned int(0), max, [&](unsigned int iter) {
store result in array[iter]

of course, i have tried increasing the number of threads used in the parallel_for.

do you have any suggestions about how i can achieve this same scaling on my gpu (geforce gtx 780m)?
Nov 14, 2013 at 2:09 AM
alternately, if you can suggest another way to batch a set of calls to ampblas_dgemm(...), i would be very appreciative.

Nov 25, 2013 at 9:43 PM
Hi bioman,

I don't have much practical experience with Intel MKL 11, however from what I can read in their documentation, BLAS level 3 routines (which include dgemm) are internally parallelized to fully utilize all CPU cores, so further parallelization with parallel_for should not bring any benefit, and may even be harmful for performance due to oversubscription. One explanation I can see for the scaling you are reporting is that you have parallelization disabled in your implementation, and therefore with parallel_for distributes single-threaded dgemm computation to multiple threads running on multiple cores; or alternatively you are performing more sequential work in the loop than storing result and you are experiencing a speed-up from parallelizing that part. I will assume the former.

C++ AMP BLAS algorithms parallelize work on its own as well. Their goal is to fully utilize an accelerator, in your case the NVIDIA GPU (note you can control the accelerator being used with ampblas_set_current_accelerator_view). Since it is a single resource shared in your system, dispatching work to it from multiple CPU threads will not cause a significant speed-up -- some CPU-side preparatory work will be parallelized, but the core execution on the accelerator will be just queued up if the computation size is large enough. For efficient use of the C++ AMP BLAS it is advisable for the latter part to take the significant amount of time, so you should not see any speed-up from parallel_for and ampblas_dgemm used together, unless you have multiple GPUs to dispatch the work to (although in such case you need to do more manual management with setting the current accelerator_view). Therefore, if I understand your question correctly, there is no further scaling you can achieve in C++ AMP BLAS.

For your follow up question regarding batching a set of calls to ampblas_dgemm, could you clarify what kind of "batching" do you mean - maybe with an use-case example.

Thank you.
Dec 14, 2013 at 1:18 AM

thanks for the response. in my case, i get a performance boost by launching concurrent calls to mkl's dgemm in a c++ parallel_for loop. while cuda also allows concurrent kernels, i'm assuming that c++ amp will get this feature in the future.