[mvapich-discuss] VAPI_PORT_ERROR

Axel Rimanek Axel at Rimanek.de
Wed Nov 22 15:09:27 EST 2006


Hello,
I'm currently working on the infiniband cluster of TU-Muenchen with the 
following software installed:

SuSE Professional 9.1,
Kernel 2.6.5
gcc version 3.3.3
OSU MVAPICH VERSION 0.9.7-SingleRail 

When I execute my parallel program, I get the following error after some
time (intensive calculation  for 90min which works fine)

[40] Abort: Got an asynchronous event: VAPI_PORT_ERROR
(VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
[24] Abort: Got an asynchronous event: VAPI_PORT_ERROR
(VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
mpirun_rsh: Abort signaled from [40]
[8] Abort: Got an asynchronous event: VAPI_PORT_ERROR
(VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
[56] Abort: Got an asynchronous event: VAPI_PORT_ERROR
(VAPI_EV_SYNDROME_NONE) at line 199 in file viainit.c
done.

When I try to restart the programm, I get instantly

[24] Abort: [opt09:24] Got completion with error, code=VAPI_RETRY_EXC_ERR,
vendor code=85
 at line 2044 in file viacheck.c
mpirun_rsh: Abort signaled from [24]
done.

I first executed the program on half of all available machines of the
cluster and repeated everything now on the rest! I got the same error
messages and now the program returns the last set of errors constantly.

I already tried osu_bw in afterwards on all machines and it works...

Thx,
Axel




More information about the mvapich-discuss mailing list