CNTK very slow with Tesla K40

Mar 27, 2015 at 9:08 PM
Dongyu, I am configuring a new machine for use with CNTK. It has dual Xeon processors, a Tesla K40, and an nVidia Quadro K620 for graphics. OS is Windows 7 Professional 64-bit SP1. I installed CUDA 7.0 and ACML, downloaded CNTK sources as of yesterday,
and built cn.exe in release-64 version using MSVC 2013. I ran the speech demo, which I have previously run on a different machine, under 3 conditions: CPU computation; K40 computation; and Quadro computation. Times on the Quadro were a little faster than with
the dual Xeon CPU's; but the K40 was more than 5 times slower than either CPU or Quadro. GPU-Z shows that the K40 is fully loaded when using that card; the Quadro on the other hand was only lightly loaded. I also tested the K40 with a matlab GPU benchmark,
and on that test it is indeed much faster than the CPU or the Quadro. It is fastest for large matrices. So there is no problem with the hardware or the CUDA drivers or toolkit. I don't understand why CNTK is so slow with the K40. There must be an issue with
the software. Can you please look into this? Richard
Apr 1, 2015 at 1:58 AM
Further testing shows that the problem is caused by a dual XEON system with GPU in PCIE slot connected to CPU2. Data sent to GPU sometimes has to pass between memory attached to one processor to the other processor via the QPI bus and finally to the GPU . The bottleneck seems to be the QPI bus. This may be an essential issue with Dual XEON configurations.
Marked as answer by rbhodges on 3/31/2015 at 5:58 PM
Apr 1, 2015 at 9:25 AM
yes. hardware setup is very important.