[mvapich-discuss] VAPI_PORT_ERROR

Amit H Kumar AHKumar at odu.edu
Mon Nov 27 08:31:04 EST 2006



mvapich-discuss-bounces at cse.ohio-state.edu wrote on 11/22/2006 08:03:19 PM:

> Hello Axel,
>
> Thanks for reporting this. We haven't seen this error before. I suspect
> that it might be due to some InfiniBand cables being loose. Can you run
> Pallas (IMB) benchmarks on the cluster to make sure that all global
> communications are OK? Also, if possible could you consider moving (or
> asking the sysadmin of your cluster) to move to Open Fabrics
> distribution (OFED)? That is the latest and greatest InfiniBand
> software. It will be great if you could move to OFED.

May be a stupid question:
Can we have both the vendor supplied IB software and OFED, without any
conflict as long as the corresponding applications are compiled with the
appropriate libraries.

Thank you,
Amit


>
> Thanks,
> Sayantan.
>
> Axel Rimanek wrote:
> > Hello,
> > I'm currently working on the infiniband cluster of TU-Muenchen with the

> > following software installed:
> >
> > SuSE Professional 9.1,
> > Kernel 2.6.5
> > gcc version 3.3.3
> > OSU MVAPICH VERSION 0.9.7-SingleRail
> >
> > When I execute my parallel program, I get the following error after
some
> > time (intensive calculation  for 90min which works fine)
> >
> > [40] Abort: Got an asynchronous event: VAPI_PORT_ERROR
> > (VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
> > [24] Abort: Got an asynchronous event: VAPI_PORT_ERROR
> > (VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
> > mpirun_rsh: Abort signaled from [40]
> > [8] Abort: Got an asynchronous event: VAPI_PORT_ERROR
> > (VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
> > [56] Abort: Got an asynchronous event: VAPI_PORT_ERROR
> > (VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
> > done.
> >
> > When I try to restart the programm, I get instantly
> >
> > [24] Abort: [opt09:24] Got completion with error,
code=VAPI_RETRY_EXC_ERR,
> > vendor code=85
> >  at line 2044 in file viacheck.c
> > mpirun_rsh: Abort signaled from [24]
> > done.
> >
> > I first executed the program on half of all available machines of the
> > cluster and repeated everything now on the rest! I got the same error
> > messages and now the program returns the last set of errors constantly.
> >
> > I already tried osu_bw in afterwards on all machines and it works...
> >
> > Thx,
> > Axel
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
> --
> http://www.cse.ohio-state.edu/~surs
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list