[mvapich-discuss] problems with MV2_USE_CUDA=1
Igor Podladtchikov
igor.podladtchikov at spectraseis.com
Thu Jun 28 12:54:16 EDT 2012
Hi,
I downloaded the latest mvapich version about two weeks ago and I'm having trouble using the CUDA stuff.
I installed on a stand-alone node with 4 Tesla C1060's, and tried running the benchmarks, which error out.
$ is the command and > the shell output:
$ mpirun_rsh -np 2 guppy guppy MV2_USE_CUDA=1 ./osu_bw D D
> [guppy:mpispawn_0][child_handler] MPI process (rank: 0, pid: 17710) exited with status 1
I know C1060's don't support UVA but I kind of expect mvapich to resort to "regular" communication if the GPU doesn't support it.. The final goal is to install mvapich on our cluster with M2070's used for production but I kind of need a proof of concept first.
I isolated the problem with dummy code:
#include <stdio.h>
#include "mpi.h"
int main(int argc, char** argv){
// init mpi
MPI_Init(&argc, &argv);
int rank, size, len;
char name[1024];
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(name, &len);
// say hi
printf("%s: rank %d size %d\n", name, rank, size);
// finalize
MPI_Finalize();
return 0;
}
I compiled the code like this:
mpicc mvapich_test.c -o mvtest
And ran it like this:
$ mpirun_rsh -np 2 guppy guppy ./mvtest
> guppy: rank 1 size 2
> guppy: rank 0 size 2
So far so good, right?
Then I add MV2_USE_CUDA=1 to my launch command:
$ mpirun_rsh -np 2 guppy guppy MV2_USE_CUDA=1 ./mvtest
> [cli_0]: [cli_1]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error
>
> aborting job:
> Fatal error in MPI_Init:
> Other MPI error
>
> [guppy:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 7. MPI process died?
> [guppy:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [guppy:mpispawn_0][child_handler] MPI process (rank: 1, pid: 17755) exited with status 1
> [guppy:mpispawn_0][child_handler] MPI process (rank: 0, pid: 17754) exited with status 1
So I'm not doing anything with the GPU's yet, but if I understand correctly your MPI_Init implementation attempts to create a context on the gpu and fails for some reason?
All my other CUDA apps run fine on this node, including mpi based gpu solvers. I can even run them with your mpirun.
The only full example I was able to find on how to use your MV2_USE_CUDA=1 was here:
http://cudamusing.blogspot.com/
and his stuff just works, so that doesn't help..
I really hope this is something simple and I'm just plain stupid. I read your user guide, including the FAQ and Troubleshooting section, tried this and that for about a week, I hope you can give me some clues.
Here's some system info:
$ cat /etc/*release*
> CentOS release 5.5 (Final)
$ uname -a:
> Linux guppy 2.6.18-194.26.1.el5 #1 SMP Tue Nov 9 12:54:20 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
$ mpiname -a:
> MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
>
> Compilation
> CC: gcc -DNDEBUG -DNVALGRIND -O2
> CXX: c++ -DNDEBUG -DNVALGRIND -O2
> F77:
> FC:
>
> Configuration
> --enable-cuda --with-cuda-include=/usr/local/cuda/include --with-cuda-libpath=/usr/local/cuda/lib64 --enable-shared --disable-f77 --disable-fc --without-hwloc
$ cudaquery
> Using cuda version 4020 (Driver API v2)
> Using cuda runtime version 4020 (Runtime API v2)
> Found 4 devices.
> Tesla C1060 (id 0 cc 1.3) : 4294770688 bytes (4.000 GB)
> Tesla C1060 (id 1 cc 1.3) : 4294770688 bytes (4.000 GB)
> Tesla C1060 (id 2 cc 1.3) : 4294770688 bytes (4.000 GB)
> Tesla C1060 (id 3 cc 1.3) : 4294770688 bytes (4.000 GB)
$ ll /usr/lib64/libcuda*
> lrwxrwxrwx 1 root root 12 Jun 22 14:42 /usr/lib64/libcuda.so -> libcuda.so.1
> lrwxrwxrwx 1 root root 17 Jun 22 14:42 /usr/lib64/libcuda.so.1 -> libcuda.so.295.41
> -rwxr-xr-x 1 root root 8612596 Jun 22 14:42 /usr/lib64/libcuda.so.295.41
$ ll /usr/local/cuda/lib64
> lrwxrwxrwx 1 root root 14 Jun 26 17:13 libcublas.so -> libcublas.so.4
> lrwxrwxrwx 1 root root 18 Jun 26 17:13 libcublas.so.4 -> libcublas.so.4.2.9
> -rwxr-xr-x 1 root root 109211936 Jun 26 17:13 libcublas.so.4.2.9
> lrwxrwxrwx 1 root root 14 Jun 26 17:13 libcudart.so -> libcudart.so.4
> lrwxrwxrwx 1 root root 18 Jun 26 17:13 libcudart.so.4 -> libcudart.so.4.2.9
> -rwxr-xr-x 1 root root 369600 Jun 26 17:13 libcudart.so.4.2.9
> lrwxrwxrwx 1 root root 13 Jun 26 17:13 libcufft.so -> libcufft.so.4
> lrwxrwxrwx 1 root root 17 Jun 26 17:13 libcufft.so.4 -> libcufft.so.4.2.9
> -rwxr-xr-x 1 root root 31161488 Jun 26 17:13 libcufft.so.4.2.9
> lrwxrwxrwx 1 root root 13 Jun 26 17:13 libcuinj.so -> libcuinj.so.4
> lrwxrwxrwx 1 root root 17 Jun 26 17:13 libcuinj.so.4 -> libcuinj.so.4.2.9
> -rwxr-xr-x 1 root root 150480 Jun 26 17:13 libcuinj.so.4.2.9
> lrwxrwxrwx 1 root root 14 Jun 26 17:13 libcurand.so -> libcurand.so.4
> lrwxrwxrwx 1 root root 18 Jun 26 17:13 libcurand.so.4 -> libcurand.so.4.2.9
> -rwxr-xr-x 1 root root 27315384 Jun 26 17:13 libcurand.so.4.2.9
> lrwxrwxrwx 1 root root 16 Jun 26 17:13 libcusparse.so -> libcusparse.so.4
> lrwxrwxrwx 1 root root 20 Jun 26 17:13 libcusparse.so.4 -> libcusparse.so.4.2.9
> -rwxr-xr-x 1 root root 195959968 Jun 26 17:13 libcusparse.so.4.2.9
> lrwxrwxrwx 1 root root 11 Jun 26 17:13 libnpp.so -> libnpp.so.4
> lrwxrwxrwx 1 root root 15 Jun 26 17:13 libnpp.so.4 -> libnpp.so.4.2.9
> -rwxr-xr-x 1 root root 55095288 Jun 26 17:13 libnpp.so.4.2.9
$ nvidia-smi
Thu Jun 28 10:37:37 2012
+------------------------------------------------------+
| NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
|-------------------------------+----------------------+----------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla C1060 | 0000:02:00.0 Off | N/A N/A |
| 35% 69 C P8 N/A / N/A | 0% 3MB / 4095MB | 0% E. Thread |
|-------------------------------+----------------------+----------------------|
| 1. Tesla C1060 | 0000:03:00.0 Off | N/A N/A |
| 35% 53 C P8 N/A / N/A | 0% 3MB / 4095MB | 0% E. Thread |
|-------------------------------+----------------------+----------------------|
| 2. Tesla C1060 | 0000:83:00.0 Off | N/A N/A |
| 35% 61 C P8 N/A / N/A | 0% 3MB / 4095MB | 0% E. Thread |
|-------------------------------+----------------------+----------------------|
| 3. Tesla C1060 | 0000:84:00.0 Off | N/A N/A |
| 35% 63 C P8 N/A / N/A | 0% 3MB / 4095MB | 0% E. Thread |
|-------------------------------+----------------------+----------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
Looking forward to your reply!
Cheers
Igor Podladtchikov
Spectraseis
1899 Wynkoop St, Suite 350
Denver, CO 80202
Tel. +1 303 658 9172 (direct)
Tel. +1 303 330 8296 (cell)
www.spectraseis.com<http://www.spectraseis.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120628/eb5eb738/attachment.html
More information about the mvapich-discuss
mailing list