[mvapich-discuss] viacheck.c error?

Aquarijen aquarijen at gmail.com
Thu Mar 15 16:31:59 EDT 2007


Hi Abhinav, Matt and Everyone,

As was suggested as a starting point, I upgraded to OFED 1.1 and made
sure that the IBV_EVENT_CLIENT_REREGISTER event was in verbs.h.
I also installed mvapich 0.9.9, but had difficulty running any
programs - including cpi.  So I went back to 0.9.7 which is included
with OFED 1.1.  I am getting very similar errors as I was getting
before the OFED upgrade.  I can run cpi, cpip, hello++ and simpleio
with no problems and I get expected output.

None of the osu benchmarks work for me, however.
All of the osu benchmarks give me the following error message:

[0] Abort: [b09n010.oic.ornl.gov:0] Got completion with error, code=4
 at line 2365 in file viacheck.c
done.


Has anyone seen this before?  I'm puzzled.  If it is a connectivity
issue, how do I know, and why would cpi and the other very simple
programs run?  Non-ib batch programs using mpich run successfully. We
use ssh for everything and have disabled rsh.  The average user cannot
ssh to a compute node from the head node, but, once on a compute node,
he/she can ssh to other compute nodes without supplying passwords.
For mpich, we use mpiexec and this works well for us, but we'd really
like to use the infiniband.  :(  Please point me in the right
direction if this is not a mvapich related problem.  I administer this
cluster, so asking my sysadmin is not an option. ;)

Thanks so much for any assistance you can give!

Jennifer



On 2/22/07, Abhinav Vishnu <vishnu at cse.ohio-state.edu> wrote:
> Hi jen,
>
> > I'm not 100% sure of what information will be most helpful, but the
> > error output for osu_bw (as an example) is:
> > --------------------------------------------------
> > Connection closed by 172.16.4.36^M
> > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1
> > at line 2355 in file viacheck.c
> > done.
> > ---------------------------------------------------
> >
> I think the problem is occuring, because your ssh connection
> got terminated during the execution of the application. As a result,
> any process which tries to communicate with the process present
> on the node which died, it will get the "completion with error" during
> data transmission. IMHO, your sysadmin should be able to help you
> with respect to terminating ssh connection.
>
> >
> > The BUILD_ID file of my ofed is:
> > ---------------------------------------------------------------------
> > OFED-1.0
> >
> > openib-1.0 (REV=8031)
> > # User space
> > https://openib.org/svn/gen2/branches/1.0/src/userspace
> > # Kernel space
> > https://openib.org/svn/gen2/branches/1.0/ofed/tags/1.0/linux-kernel
> > Git:
> > ref: refs/heads/for-2.6.17
> > commit 959eb39297e8c82f61fbfc283ad4ff11c883bf1e
> >
> > # MPI
> > mpi_osu-0.9.7-mlx2.1.0.tgz
> > openmpi-1.1b1-1.src.rpm
> > mpitests-1.0-0.src.rpm
> > ---------------------------------------------------------------------------
> >
> >
> > so that may be a problem - that it is ofed 1.0?
> >
> >
> > -------------------------------------------------------------------------------
> >
> > enum ibv_event_type {
> >        IBV_EVENT_CQ_ERR,
> >        IBV_EVENT_QP_FATAL,
> >        IBV_EVENT_QP_REQ_ERR,
> >        IBV_EVENT_QP_ACCESS_ERR,
> >        IBV_EVENT_COMM_EST,
> >        IBV_EVENT_SQ_DRAINED,
> >        IBV_EVENT_PATH_MIG,
> >        IBV_EVENT_PATH_MIG_ERR,
> >        IBV_EVENT_DEVICE_FATAL,
> >        IBV_EVENT_PORT_ACTIVE,
> >        IBV_EVENT_PORT_ERR,
> >        IBV_EVENT_LID_CHANGE,
> >        IBV_EVENT_PKEY_CHANGE,
> >        IBV_EVENT_SM_CHANGE,
> >        IBV_EVENT_SRQ_ERR,
> >        IBV_EVENT_SRQ_LIMIT_REACHED,
> >        IBV_EVENT_QP_LAST_WQE_REACHED
> > };
> > ------------------------------------------------------------------------------------
> >
> >
> > Soooo, I assume my new mission is to get ofed 1.1?   :}
> >
>
> Yes, i guess this should be the safest bet. Please let us know
> the outcome of your experimentation.
>
> Thanks,
>
> :- Abhinav
> > Thanks!!!!
> > Jen
> >
> >
>
>


-- 
When I play with my cat, who knows whether she is not amusing herself
with me more than I with her.
Michel de Montaigne


More information about the mvapich-discuss mailing list