[mvapich-discuss] Fatal error in MPI_Init
Mohamad Amirul Abdullah
amirul.abdullah at mimos.my
Thu Dec 19 03:58:34 EST 2013
Hi,
I have two machine with Nvidia k20c and connected with Infiniband Mellanox Connect X-3. Im trying to use the GPUDirect with CUDA-awere-MPI so I install MVAPICH2 2.0b but seems to have problem to run simple MPI with it. I have enable the debug in MPI but don’t know how to interprate the debug information. hope you can help me
Running the application
comp at gpu0:/home/comp/Desktop/test$ mpirun_rsh -np 2 -hostfile machinefile a.out
Starting MPI..
Starting MPI..
[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(446).......:
MPID_Init(365)..............: channel initialization failed
MPIDI_CH3_Init(314).........:
MPIDI_CH3I_RDMA_init(170)...:
rdma_setup_startup_ring(389): cannot open hca device
[gpu0:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu0:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error
[gpu0:mpispawn_0][child_handler] MPI process (rank: 0, pid: 27061) exited with status 1
[gpu1:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[gpu1:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gpu1:mpispawn_1][child_handler] MPI process (rank: 1, pid: 16237) exited with status 1
[gpu1:mpispawn_1][report_error] connect() failed: Connection refused (111)
comp at gpu1-System-Product-Name:/home/gpu1/Desktop/test$
MVAPICH Settings
comp at gpu0:/home/comp/Desktop/test$ mpiname -a
MVAPICH2 2.0b Fri Nov 8 11:17:40 EST 2013 ch3:mrail
Compilation
CC: gcc -g
CXX: g++ -g
F77: no -L/lib -L/lib -g
FC: no -g
Configuration
--disable-fast --enable-g=dbg --enable-cuda --with-cuda=/usr/local/cuda --disable-fc --disable-f77
Dependency in a.out
comp at gpu0:/home/comp/Desktop/test$ ldd a.out
linux-vdso.so.1 => (0x00007ffffb5ff000)
libmpich.so.10 => /usr/local/lib/libmpich.so.10 (0x00007fce31052000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce30c7f000)
libmpl.so.1 => /usr/local/lib/libmpl.so.1 (0x00007fce30a79000)
libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007fce30868000)
libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007fce30533000)
libcudart.so.5.5 => /usr/local/cuda/lib64/libcudart.so.5.5 (0x00007fce302e5000)
libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fce2f681000)
libibmad.so.5 => /usr/lib/libibmad.so.5 (0x00007fce2f466000)
librdmacm.so.1 => /usr/lib/librdmacm.so.1 (0x00007fce2f252000)
libibumad.so.3 => /usr/lib/libibumad.so.3 (0x00007fce2f04b000)
libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0x00007fce2ee3c000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fce2ec33000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fce2e937000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce2e71a000)
/lib64/ld-linux-x86-64.so.2 (0x00007fce3178d000)
libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007fce2e4fb000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce2e2f7000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fce2dff7000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fce2dddf000)
libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007fce2dbdc000)
libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007fce2d9d5000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fce2d7bf000)
Infiniband stats
comp at gpu0:/home/comp/Desktop/test$ ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.30.3110
Hardware version: 1
Node GUID: 0xf4521403007f6060
System image GUID: 0xf4521403007f6063
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf4521403007f6061
Link layer: InfiniBand
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 56
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf4521403007f6062
Link layer: InfiniBand
Host OS info
comp at gpu0:/home/comp/Desktop/test$ uname -a
Linux gpu0 3.7.10-030710-generic #201302271235 SMP Wed Feb 27 17:36:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Code sample
#include <mpi.h>
int main(int argc, char **argv)
{
int myrank;
printf("Starting MPI..\n");
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("check rank :%i\n",myrank);
if (myrank == 0) {
printf("hello1\n");
} else {
printf("hello2\n");
}
MPI_Finalize();
return 0;
}
Regards,
-Amirul-
------------------------------------------------------------------
-
-
DISCLAIMER:
This e-mail (including any attachments) is for the addressee(s)
only and may contain confidential information. If you are not the
intended recipient, please note that any dealing, review,
distribution, printing, copying or use of this e-mail is strictly
prohibited. If you have received this email in error, please notify
the sender immediately and delete the original message.
MIMOS Berhad is a research and development institution under
the purview of the Malaysian Ministry of Science, Technology and
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad
and/or its subsidiaries shall be understood as neither given nor
endorsed by MIMOS Berhad and/or its subsidiaries and neither
MIMOS Berhad nor its subsidiaries accepts responsibility for the
same. All liability arising from or in connection with computer
viruses and/or corrupted e-mails is excluded to the fullest extent
permitted by law.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131219/3dba1806/attachment.html>
More information about the mvapich-discuss
mailing list