I don't have much practical experience with Intel MKL 11, however from what I can read in their
, BLAS level 3 routines (which include dgemm) are internally parallelized to fully utilize all CPU cores, so further parallelization with
should not bring any benefit, and may even be harmful for performance due to oversubscription. One explanation I can see for the scaling you are reporting is that you have parallelization disabled in your implementation, and therefore
distributes single-threaded dgemm computation to multiple threads running on multiple cores; or alternatively you are performing more sequential work in the loop than storing result and you are experiencing a speed-up from parallelizing
that part. I will assume the former.
C++ AMP BLAS algorithms parallelize work on its own as well. Their goal is to fully utilize an accelerator, in your case the NVIDIA GPU (note you can control the accelerator being used with
). Since it is a single resource shared in your system, dispatching work to it from multiple CPU threads will not cause a significant speed-up -- some CPU-side preparatory work will be parallelized, but the core
execution on the accelerator will be just queued up if the computation size is large enough. For efficient use of the C++ AMP BLAS it is advisable for the latter part to take the significant amount of time, so you should not see any speed-up from parallel_for
and ampblas_dgemm used together, unless you have multiple GPUs to dispatch the work to (although in such case you need to do more manual management with setting the current
). Therefore, if I understand your question correctly, there is no further scaling you can achieve in C++ AMP BLAS.
For your follow up question regarding batching a set of calls to ampblas_dgemm, could you clarify what kind of "batching" do you mean - maybe with an use-case example.