[mvapich-discuss] Problem initializing IB device (gen2)

Sayantan Sur surs at cse.ohio-state.edu
Thu Jan 25 00:57:11 EST 2007


Hi Roland,

> Anybody got an idea how to debug the error below?
>   

Could you try this patch out to see if there is any additional error 
message?

Thanks,
Sayantan.

Index: viapriv.c
===================================================================
--- viapriv.c   (revision 879)
+++ viapriv.c   (working copy)
@@ -624,6 +624,7 @@


     if(ibv_post_recv(c->vi, &(v->desc.u.rr), &bad_wr)) {
+        perror("ibv_post_recv");
         error_abort_all(IBV_RETURN_ERR,
                 "Error posting recv\n");
     }

> Thanks,
>
> Roland
>
>   
>>>>>> "Roland" == Roland Fehrenbacher <Roland.Fehrenbacher at transtec.de> writes:
>>>>>>             
> Hi,
>
> meanwhile I got past the previous error. I noticed, that I had linked
> in the static version of libibverbs by mistake (strange that it makes
> a difference). When I used the dynamic version, the error changed to
>
> $ mpiexec -comm mpich-ib ./cpi
> [2] Abort: Error posting recv
>  at line 628 in file viapriv.c
> [3] Abort: Error posting recv
>  at line 628 in file viapriv.c
> [0] Abort: Error posting recv
> mpiexec: Warning: accept_abort_conn: MPI_Abort from IP 192.168.42.105, rank 2, killing all.
>  at line 628 in file viapriv.c
> [1] Abort: Error posting recv
>  at line 628 in file viapriv.c
> mpiexec: Warning: tasks 0-3 exited with status 253.
>
> Again, the same happens when using mpirun_rsh.
>
> Roland
>
>     Roland> Hi, I have problems getting my IB adapters initialized
>     Roland> when running an mvapich binary:
>
>     Roland> $ mpiexec -comm mpich-ib ./cpi
>     Roland> [0] Abort: Error getting HCA context
>     Roland>  at line 260 in file viainit.c
>     Roland> [1] Abort: Error getting HCA context
>     Roland>  at line 260 in file viainit.c
>     Roland> [2] Abort: Error getting HCA context
>     Roland>  at line 260 in file viainit.c
>     Roland> [3] Abort: Error getting HCA context
>     Roland>  at line 260 in file viainit.c
>     Roland> mpiexec: Warning: tasks 0-3 exited with status 255.
>
>     Roland> The same happens when using mpirun_rsh.
>
>     Roland> I'm using mvapich 0.9.8 compiled against OFED 1.1.
>
>     Roland> Basic IB connectivity works as shown by a ping over the IB
>     Roland> network between the two test nodes I use. I can also run
>     Roland> ib_rdma_bw, ib_rdma_lat, etc. programs from the OFED
>     Roland> release without any problem.
>
>     Roland> Loaded IB modules are:
>
>     Roland> beo-104:~# lsmod | grep ib_
>     Roland> ib_ipoib               49944  0
>     Roland> ib_sa                  17292  1 ib_ipoib
>     Roland> ib_uverbs              41520  0
>     Roland> ib_umad                18480  4
>     Roland> ib_mthca              120880  0
>     Roland> ib_mad                 39588  3 ib_sa,ib_umad,ib_mthca
>     Roland> ib_core                56192  6 ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad
>     Roland> and the following devices exist.
>
>     Roland> beo-104:~# ls -l /dev/infiniband/
>     Roland> total 0
>     Roland> crw-rw----  1 root root 231,  64 Jan 22 14:11 issm0
>     Roland> crw-rw----  1 root root 231,   0 Jan 22 14:11 umad0
>     Roland> crw-rw-rw-  1 root root 231, 192 Jan 22 14:11 uverbs0
>
>     Roland> I have read section "7.2.2 Error getting HCA Context" from
>     Roland> the Mvapich User Guide, but this didn't bring me any
>     Roland> further.
>
>     Roland> What is going wrong?
>
>     Roland> Thanks,
>
>     Roland> Roland
>
>     Roland> _______________________________________________
>     Roland> mvapich-discuss mailing list
>     Roland> mvapich-discuss at cse.ohio-state.edu
>     Roland> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>   

-- 
http://www.cse.ohio-state.edu/~surs



More information about the mvapich-discuss mailing list