[mvapich-discuss] Problem initializing IB device (gen2)

Roland Fehrenbacher Roland.Fehrenbacher at transtec.de
Fri Jan 26 04:28:38 EST 2007


>>>>> "Sayantan" == Sayantan Sur <surs at cse.ohio-state.edu> writes:

Hi Sayantan,

    >> Anybody got an idea how to debug the error below?
    >> 

    Sayantan> Could you try this patch out to see if there is any
    Sayantan> additional error message?

that actually just showed the message "success" ;-) After digging in
it deeper, and putting debug statements in the IB libraries, it
finally turned out, that there was an old installation of OFED under
/usr/local, from which the headers were taken during compile of
mvapich. So there was a mismatch of compile time headers and run-time
libraries which obviously was bound to lead to erratic behaviour. Now
everything is working fine. Thanks for your help.

Roland

    Sayantan> Thanks, Sayantan.

    Sayantan> Index: viapriv.c
    Sayantan> ===================================================================
    Sayantan> --- viapriv.c (revision 879) +++ viapriv.c (working
    Sayantan> copy) @@ -624,6 +624,7 @@


    Sayantan>      if(ibv_post_recv(c->vi, &(v->desc.u.rr), &bad_wr))
    Sayantan> { + perror("ibv_post_recv");
    Sayantan> error_abort_all(IBV_RETURN_ERR, "Error posting recv\n");
    Sayantan> }

    >> Thanks,
    >> 
    >> Roland
    >> 
    >> 
    >>>>>>> "Roland" == Roland Fehrenbacher
    >>>>>>> <Roland.Fehrenbacher at transtec.de> writes:
    >>>>>>> 
    >> Hi,
    >> 
    >> meanwhile I got past the previous error. I noticed, that I had
    >> linked in the static version of libibverbs by mistake (strange
    >> that it makes a difference). When I used the dynamic version,
    >> the error changed to
    >> 
    >> $ mpiexec -comm mpich-ib ./cpi [2] Abort: Error posting recv at
    >> line 628 in file viapriv.c [3] Abort: Error posting recv at
    >> line 628 in file viapriv.c [0] Abort: Error posting recv
    >> mpiexec: Warning: accept_abort_conn: MPI_Abort from IP
    >> 192.168.42.105, rank 2, killing all.  at line 628 in file
    >> viapriv.c [1] Abort: Error posting recv at line 628 in file
    >> viapriv.c mpiexec: Warning: tasks 0-3 exited with status 253.
    >> 
    >> Again, the same happens when using mpirun_rsh.
    >> 
    >> Roland
    >> 
    Roland> Hi, I have problems getting my IB adapters initialized
    Roland> when running an mvapich binary:
    >>
    Roland> $ mpiexec -comm mpich-ib ./cpi [0] Abort: Error getting
    Roland> HCA context at line 260 in file viainit.c [1] Abort: Error
    Roland> getting HCA context at line 260 in file viainit.c [2]
    Roland> Abort: Error getting HCA context at line 260 in file
    Roland> viainit.c [3] Abort: Error getting HCA context at line 260
    Roland> in file viainit.c mpiexec: Warning: tasks 0-3 exited with
    Roland> status 255.
    >>
    Roland> The same happens when using mpirun_rsh.
    >>
    Roland> I'm using mvapich 0.9.8 compiled against OFED 1.1.
    >>
    Roland> Basic IB connectivity works as shown by a ping over the IB
    Roland> network between the two test nodes I use. I can also run
    Roland> ib_rdma_bw, ib_rdma_lat, etc. programs from the OFED
    Roland> release without any problem.
    >>
    Roland> Loaded IB modules are:
    >>
    Roland> beo-104:~# lsmod | grep ib_ ib_ipoib 49944 0 ib_sa 17292 1
    Roland> ib_ipoib ib_uverbs 41520 0 ib_umad 18480 4 ib_mthca 120880
    Roland> 0 ib_mad 39588 3 ib_sa,ib_umad,ib_mthca ib_core 56192 6
    Roland> ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad and the
    Roland> following devices exist.
    >>
    Roland> beo-104:~# ls -l /dev/infiniband/ total 0 crw-rw---- 1
    Roland> root root 231, 64 Jan 22 14:11 issm0 crw-rw---- 1 root
    Roland> root 231, 0 Jan 22 14:11 umad0 crw-rw-rw- 1 root root 231,
    Roland> 192 Jan 22 14:11 uverbs0
    >>
    Roland> I have read section "7.2.2 Error getting HCA Context" from
    Roland> the Mvapich User Guide, but this didn't bring me any
    Roland> further.
    >>
    Roland> What is going wrong?
    >>
    Roland> Thanks,
    >>
    Roland> Roland
    >>
    Roland> _______________________________________________
    Roland> mvapich-discuss mailing list
    Roland> mvapich-discuss at cse.ohio-state.edu
    Roland> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
    >> 
    >> 
    >> _______________________________________________ mvapich-discuss
    >> mailing list mvapich-discuss at cse.ohio-state.edu
    >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
    >> 

    Sayantan> -- http://www.cse.ohio-state.edu/~surs

    Sayantan> _______________________________________________
    Sayantan> mvapich-discuss mailing list
    Sayantan> mvapich-discuss at cse.ohio-state.edu
    Sayantan> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list