[mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED

Matthew Koop koop at cse.ohio-state.edu
Mon Jan 7 12:27:38 EST 2008


Michael,

Also, is your code making any system calls or forking?

Matt

On Mon, 7 Jan 2008, Matthew Koop wrote:

> Michael,
>
> Do other more simple benchmarks work (e.g. osu_benchmarks/osu_bw)?
>
> If they do, this is something we'd like to take a closer look at. I'd be
> interested to know if setting VIADEV_USE_COALESCE=0 resolves the issue:
>
> e.g.
>   mpirun_rsh -np 2 h1 h2 VIADEV_USE_COALESCE=0 ./exec
>
>
> Matt
>
> On Mon, 7 Jan 2008, Michael Ethier wrote:
>
> > Hello,
> >
> >
> >
> > I am new to this forum and hoping someone can help solve the following
> > problem for me.
> >
> >
> >
> > We have a modeling application that initializes and runs fine using an
> > ordinary Ethernet connection.
> >
> >
> >
> > When we compile using the Infiniband software package (mvapich-0.9.9)
> > and run, the application fails with the following
> >
> > at then end:
> >
> >
> >
> > [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error
> > IBV_WC_LOC_LEN_ERR, code=1, dest rank=1
> >
> >  at line 388 in file viacheck.c
> >
> > [0:moorcrofth] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED,
> > code=16
> >
> >  at line 2552 in file viacheck.c
> >
> > mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 :
> > moorcroft8 ]
> >
> > forrtl: error (78): process killed (SIGTERM)
> >
> > forrtl: error (78): process killed (SIGTERM)
> >
> > done.
> >
> >
> >
> > This occurs at the initialization phase it seems when communication
> > starts between different nodes.
> >
> > If I set the hostfile to contain the same node so that all the cpus used
> > are on 1 node, it initializes fine and runs.
> >
> >
> >
> > We are using Redhat Enterprise 4 Update 5 on x86_64
> >
> >
> >
> > uname -a
> >
> > Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007
> > x86_64 x86_64 x86_64 GNU/Linux
> >
> >
> >
> > In addition we are using mvapich-0.9.9 for our Infiniband software
> > package, and Intel 9.1:
> >
> >
> >
> > [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version
> >
> > icc (ICC) 9.1 20070510
> >
> > Copyright (C) 1985-2007 Intel Corporation.  All rights reserved.
> >
> >
> >
> > [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90 --version
> >
> > ifort (IFORT) 9.1 20070510
> >
> > Copyright (C) 1985-2007 Intel Corporation.  All rights reserved.
> >
> >
> >
> > We are using the rsh communication protocol for this:
> >
> > /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........
> >
> >
> >
> > Can anyone suggest how this problem can be solved ?
> >
> >
> >
> > Thank You in advance,
> >
> > Mike
> >
> >
> >
> >
>
>



More information about the mvapich-discuss mailing list