[mvapich-discuss] Fatal error in MPI_Init

sreeram potluri potluri.2 at osu.edu
Thu Dec 19 08:27:42 EST 2013


Hi Amirul,

The port on the IB HCA is still in an Initializing state. This could
probably be because opensmd service is not running or has to be restarted. The
"State" field for the connected port should show "Active" once this is
fixed.

Best
Sreeram Potluri


On Thu, Dec 19, 2013 at 3:58 AM, Mohamad Amirul Abdullah <
amirul.abdullah at mimos.my> wrote:

>  Hi,
>
> I have two machine with Nvidia k20c and connected with Infiniband Mellanox
> Connect X-3. Im trying to use the GPUDirect with CUDA-awere-MPI so I
> install MVAPICH2 2.0b but seems to have problem to run simple MPI with it.
> I have enable the debug in MPI but don’t know how to interprate the debug
> information. hope you can help me
>
> *Running the application*
> comp at gpu0:/home/comp/Desktop/test$ mpirun_rsh -np 2 -hostfile machinefile
> a.out
> Starting MPI..
> Starting MPI..
> [cli_0]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(446).......:
> MPID_Init(365)..............: channel initialization failed
> MPIDI_CH3_Init(314).........:
> MPIDI_CH3I_RDMA_init(170)...:
> rdma_setup_startup_ring(389): cannot open hca device
>
> [gpu0:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6.
> MPI process died?
> [gpu0:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [cli_1]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error
>
> [gpu0:mpispawn_0][child_handler] MPI process (rank: 0, pid: 27061) exited
> with status 1
> [gpu1:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
> MPI process died?
> [gpu1:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [gpu1:mpispawn_1][child_handler] MPI process (rank: 1, pid: 16237) exited
> with status 1
> [gpu1:mpispawn_1][report_error] connect() failed: Connection refused (111)
> comp at gpu1-System-Product-Name:/home/gpu1/Desktop/test$
>
> *MVAPICH Settings*
> comp at gpu0:/home/comp/Desktop/test$ mpiname -a
> MVAPICH2 2.0b Fri Nov  8 11:17:40 EST 2013 ch3:mrail
>
> Compilation
> CC: gcc    -g
> CXX: g++   -g
> F77: no -L/lib -L/lib   -g
> FC: no   -g
>
> Configuration
> --disable-fast --enable-g=dbg --enable-cuda --with-cuda=/usr/local/cuda
> --disable-fc --disable-
>
> *f77 Dependency in a.out*
> comp at gpu0:/home/comp/Desktop/test$ ldd a.out
>     linux-vdso.so.1 =>  (0x00007ffffb5ff000)
>     libmpich.so.10 => /usr/local/lib/libmpich.so.10 (0x00007fce31052000)
>     libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce30c7f000)
>     libmpl.so.1 => /usr/local/lib/libmpl.so.1 (0x00007fce30a79000)
>     libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6
> (0x00007fce30868000)
>     libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6
> (0x00007fce30533000)
>     libcudart.so.5.5 => /usr/local/cuda/lib64/libcudart.so.5.5
> (0x00007fce302e5000)
>     libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fce2f681000)
>     libibmad.so.5 => /usr/lib/libibmad.so.5 (0x00007fce2f466000)
>     librdmacm.so.1 => /usr/lib/librdmacm.so.1 (0x00007fce2f252000)
>     libibumad.so.3 => /usr/lib/libibumad.so.3 (0x00007fce2f04b000)
>     libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0x00007fce2ee3c000)
>     librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fce2ec33000)
>     libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fce2e937000)
>     libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> (0x00007fce2e71a000)
>     /lib64/ld-linux-x86-64.so.2 (0x00007fce3178d000)
>     libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1
> (0x00007fce2e4fb000)
>     libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce2e2f7000)
>     libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> (0x00007fce2dff7000)
>     libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fce2dddf000)
>     libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6
> (0x00007fce2dbdc000)
>     libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6
> (0x00007fce2d9d5000)
>     libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
> (0x00007fce2d7bf000)
>
> *Infiniband stats*
> comp at gpu0:/home/comp/Desktop/test$ ibstat
> CA 'mlx4_0'
>     CA type: MT4099
>     Number of ports: 2
>     Firmware version: 2.30.3110
>     Hardware version: 1
>     Node GUID: 0xf4521403007f6060
>     System image GUID: 0xf4521403007f6063
>     Port 1:
>         State: Down
>         Physical state: Disabled
>         Rate: 10
>         Base lid: 0
>         LMC: 0
>         SM lid: 0
>         Capability mask: 0x02514868
>         Port GUID: 0xf4521403007f6061
>         Link layer: InfiniBand
>     Port 2:
>         State: Initializing
>         Physical state: LinkUp
>         Rate: 56
>         Base lid: 0
>         LMC: 0
>         SM lid: 0
>         Capability mask: 0x02514868
>         Port GUID: 0xf4521403007f6062
>         Link layer: InfiniBand
>
> *Host OS info*
> comp at gpu0:/home/comp/Desktop/test$ uname -a
> Linux gpu0 3.7.10-030710-generic #201302271235 SMP Wed Feb 27 17:36:27 UTC
> 2013 x86_64 x86_64 x86_64 GNU/Linux
>
> *Code sample*
> #include <mpi.h>
>
> int main(int argc, char **argv)
> {
>   int myrank;
>   printf("Starting MPI..\n");
>   MPI_Init(&argc, &argv);
>   MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
>  printf("check rank :%i\n",myrank);
>
>  if (myrank == 0) {
>    printf("hello1\n");
>   } else {
>    printf("hello2\n");
>   }
>
>   MPI_Finalize();
>   return 0;
> }
>
>
> Regards,
> -Amirul-
>
>    ------------------------------------------------------------------
> -
> -
> DISCLAIMER:
>
> This e-mail (including any attachments) is for the addressee(s)
> only and may contain confidential information. If you are not the
> intended recipient, please note that any dealing, review,
> distribution, printing, copying or use of this e-mail is strictly
> prohibited. If you have received this email in error, please notify
> the sender immediately and delete the original message.
> MIMOS Berhad is a research and development institution under
> the purview of the Malaysian Ministry of Science, Technology and
> Innovation. Opinions, conclusions and other information in this e-
> mail that do not relate to the official business of MIMOS Berhad
> and/or its subsidiaries shall be understood as neither given nor
> endorsed by MIMOS Berhad and/or its subsidiaries and neither
> MIMOS Berhad nor its subsidiaries accepts responsibility for the
> same. All liability arising from or in connection with computer
> viruses and/or corrupted e-mails is excluded to the fullest extent
> permitted by law.
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131219/e8f07129/attachment-0001.html>


More information about the mvapich-discuss mailing list