[mvapich-discuss] Fatal error in MPI_Init

Mohamad Amirul Abdullah amirul.abdullah at mimos.my
Thu Dec 19 03:58:34 EST 2013


Hi,

I have two machine with Nvidia k20c and connected with Infiniband Mellanox Connect X-3. Im trying to use the GPUDirect with CUDA-awere-MPI so I install MVAPICH2 2.0b but seems to have problem to run simple MPI with it. I have enable the debug in MPI but don’t know how to interprate the debug information. hope you can help me

Running the application
comp at gpu0:/home/comp/Desktop/test$ mpirun_rsh -np 2 -hostfile machinefile a.out
Starting MPI..
Starting MPI..
[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(446).......:
MPID_Init(365)..............: channel initialization failed
MPIDI_CH3_Init(314).........:
MPIDI_CH3I_RDMA_init(170)...:
rdma_setup_startup_ring(389): cannot open hca device

[gpu0:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
[gpu0:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[gpu0:mpispawn_0][child_handler] MPI process (rank: 0, pid: 27061) exited with status 1
[gpu1:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[gpu1:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[gpu1:mpispawn_1][child_handler] MPI process (rank: 1, pid: 16237) exited with status 1
[gpu1:mpispawn_1][report_error] connect() failed: Connection refused (111)
comp at gpu1-System-Product-Name:/home/gpu1/Desktop/test$

MVAPICH Settings
comp at gpu0:/home/comp/Desktop/test$ mpiname -a
MVAPICH2 2.0b Fri Nov  8 11:17:40 EST 2013 ch3:mrail

Compilation
CC: gcc    -g
CXX: g++   -g
F77: no -L/lib -L/lib   -g
FC: no   -g

Configuration
--disable-fast --enable-g=dbg --enable-cuda --with-cuda=/usr/local/cuda --disable-fc --disable-f77

Dependency in a.out
comp at gpu0:/home/comp/Desktop/test$ ldd a.out
    linux-vdso.so.1 =>  (0x00007ffffb5ff000)
    libmpich.so.10 => /usr/local/lib/libmpich.so.10 (0x00007fce31052000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce30c7f000)
    libmpl.so.1 => /usr/local/lib/libmpl.so.1 (0x00007fce30a79000)
    libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007fce30868000)
    libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007fce30533000)
    libcudart.so.5.5 => /usr/local/cuda/lib64/libcudart.so.5.5 (0x00007fce302e5000)
    libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fce2f681000)
    libibmad.so.5 => /usr/lib/libibmad.so.5 (0x00007fce2f466000)
    librdmacm.so.1 => /usr/lib/librdmacm.so.1 (0x00007fce2f252000)
    libibumad.so.3 => /usr/lib/libibumad.so.3 (0x00007fce2f04b000)
    libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0x00007fce2ee3c000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fce2ec33000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fce2e937000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce2e71a000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fce3178d000)
    libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007fce2e4fb000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce2e2f7000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fce2dff7000)
    libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fce2dddf000)
    libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007fce2dbdc000)
    libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007fce2d9d5000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fce2d7bf000)

Infiniband stats
comp at gpu0:/home/comp/Desktop/test$ ibstat
CA 'mlx4_0'
    CA type: MT4099
    Number of ports: 2
    Firmware version: 2.30.3110
    Hardware version: 1
    Node GUID: 0xf4521403007f6060
    System image GUID: 0xf4521403007f6063
    Port 1:
        State: Down
        Physical state: Disabled
        Rate: 10
        Base lid: 0
        LMC: 0
        SM lid: 0
        Capability mask: 0x02514868
        Port GUID: 0xf4521403007f6061
        Link layer: InfiniBand
    Port 2:
        State: Initializing
        Physical state: LinkUp
        Rate: 56
        Base lid: 0
        LMC: 0
        SM lid: 0
        Capability mask: 0x02514868
        Port GUID: 0xf4521403007f6062
        Link layer: InfiniBand

Host OS info
comp at gpu0:/home/comp/Desktop/test$ uname -a
Linux gpu0 3.7.10-030710-generic #201302271235 SMP Wed Feb 27 17:36:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Code sample
#include <mpi.h>

int main(int argc, char **argv)
{
  int myrank;
  printf("Starting MPI..\n");
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
 printf("check rank :%i\n",myrank);

 if (myrank == 0) {
   printf("hello1\n");
  } else {
   printf("hello2\n");
  }

  MPI_Finalize();
  return 0;
}


Regards,
-Amirul-



------------------------------------------------------------------
-
-
DISCLAIMER: 

This e-mail (including any attachments) is for the addressee(s) 
only and may contain confidential information. If you are not the 
intended recipient, please note that any dealing, review, 
distribution, printing, copying or use of this e-mail is strictly 
prohibited. If you have received this email in error, please notify 
the sender  immediately and delete the original message. 
MIMOS Berhad is a research and development institution under 
the purview of the Malaysian Ministry of Science, Technology and 
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad 
and/or its subsidiaries shall be understood as neither given nor 
endorsed by MIMOS Berhad and/or its subsidiaries and neither 
MIMOS Berhad nor its subsidiaries accepts responsibility for the 
same. All liability arising from or in connection with computer 
viruses and/or corrupted e-mails is excluded to the fullest extent 
permitted by law.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131219/3dba1806/attachment.html>


More information about the mvapich-discuss mailing list