[mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED

Matthew Koop koop at cse.ohio-state.edu
Mon Jan 7 12:26:01 EST 2008


Michael,

Do other more simple benchmarks work (e.g. osu_benchmarks/osu_bw)?

If they do, this is something we'd like to take a closer look at. I'd be
interested to know if setting VIADEV_USE_COALESCE=0 resolves the issue:

e.g.
  mpirun_rsh -np 2 h1 h2 VIADEV_USE_COALESCE=0 ./exec


Matt

On Mon, 7 Jan 2008, Michael Ethier wrote:

> Hello,
>
>
>
> I am new to this forum and hoping someone can help solve the following
> problem for me.
>
>
>
> We have a modeling application that initializes and runs fine using an
> ordinary Ethernet connection.
>
>
>
> When we compile using the Infiniband software package (mvapich-0.9.9)
> and run, the application fails with the following
>
> at then end:
>
>
>
> [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error
> IBV_WC_LOC_LEN_ERR, code=1, dest rank=1
>
>  at line 388 in file viacheck.c
>
> [0:moorcrofth] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED,
> code=16
>
>  at line 2552 in file viacheck.c
>
> mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 :
> moorcroft8 ]
>
> forrtl: error (78): process killed (SIGTERM)
>
> forrtl: error (78): process killed (SIGTERM)
>
> done.
>
>
>
> This occurs at the initialization phase it seems when communication
> starts between different nodes.
>
> If I set the hostfile to contain the same node so that all the cpus used
> are on 1 node, it initializes fine and runs.
>
>
>
> We are using Redhat Enterprise 4 Update 5 on x86_64
>
>
>
> uname -a
>
> Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007
> x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> In addition we are using mvapich-0.9.9 for our Infiniband software
> package, and Intel 9.1:
>
>
>
> [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version
>
> icc (ICC) 9.1 20070510
>
> Copyright (C) 1985-2007 Intel Corporation.  All rights reserved.
>
>
>
> [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90 --version
>
> ifort (IFORT) 9.1 20070510
>
> Copyright (C) 1985-2007 Intel Corporation.  All rights reserved.
>
>
>
> We are using the rsh communication protocol for this:
>
> /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........
>
>
>
> Can anyone suggest how this problem can be solved ?
>
>
>
> Thank You in advance,
>
> Mike
>
>
>
>



More information about the mvapich-discuss mailing list