[mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED

Michael Ethier methier at CGR.Harvard.edu
Mon Jan 7 13:10:03 EST 2008


Hi Matthew,

The osu_bw test ran ok as seen below. I added the VIADEV_USE_COALESCE=0
variable to the command line and in the environment, and it made no
difference, I set get the same errors.

#!/bin/tcsh
setenv VIADEV_USE_COALESCE 0
/usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 -hostfile
./hostfile VIADEV_USE_COALESCE=0 ./raflesi -f ./EDRAFLES_IN

Thank You,
Mike


The benchmark test:
foo.test script has in it

#!/bin/tcsh
/usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 -hostfile
./hostfile VIADEV_USE_COALESCE=0
/usr/mpi/intel/mvapich-0.9.9/tests/osu_benchmarks-2.2/osu_bw

[gb16 at moorcrofth run]$ ./foo.test
# OSU MPI Bandwidth Test (Version 2.2)
# Size          Bandwidth (MB/s)
1               0.135198
2               0.273329
4               0.540415
8               1.087788
16              2.179976
32              4.371585
64              8.668233
128             17.290726
256             34.458536
512             68.269511
1024            129.384822
2048            239.992676
4096            392.348909
8192            542.819870
16384           452.196563
32768           625.604678
65536           764.094184
131072          836.010006
262144          871.899242
524288          890.772813
1048576         901.838432
2097152         906.494955
4194304         909.296621

[gb16 at moorcrofth run]$ more ./hostfile
moorcrofth
moorcroft8
moorcroft11

-----Original Message-----
From: Matthew Koop [mailto:koop at cse.ohio-state.edu] 
Sent: Monday, January 07, 2008 12:26 PM
To: Michael Ethier
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event
IBV_EVENT_QP_LAST_WQE_REACHED

Michael,

Do other more simple benchmarks work (e.g. osu_benchmarks/osu_bw)?

If they do, this is something we'd like to take a closer look at. I'd be
interested to know if setting VIADEV_USE_COALESCE=0 resolves the issue:

e.g.
  mpirun_rsh -np 2 h1 h2 VIADEV_USE_COALESCE=0 ./exec


Matt

On Mon, 7 Jan 2008, Michael Ethier wrote:

> Hello,
>
>
>
> I am new to this forum and hoping someone can help solve the following
> problem for me.
>
>
>
> We have a modeling application that initializes and runs fine using an
> ordinary Ethernet connection.
>
>
>
> When we compile using the Infiniband software package (mvapich-0.9.9)
> and run, the application fails with the following
>
> at then end:
>
>
>
> [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error
> IBV_WC_LOC_LEN_ERR, code=1, dest rank=1
>
>  at line 388 in file viacheck.c
>
> [0:moorcrofth] Abort: [0] Got FATAL event
IBV_EVENT_QP_LAST_WQE_REACHED,
> code=16
>
>  at line 2552 in file viacheck.c
>
> mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 :
> moorcroft8 ]
>
> forrtl: error (78): process killed (SIGTERM)
>
> forrtl: error (78): process killed (SIGTERM)
>
> done.
>
>
>
> This occurs at the initialization phase it seems when communication
> starts between different nodes.
>
> If I set the hostfile to contain the same node so that all the cpus
used
> are on 1 node, it initializes fine and runs.
>
>
>
> We are using Redhat Enterprise 4 Update 5 on x86_64
>
>
>
> uname -a
>
> Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007
> x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> In addition we are using mvapich-0.9.9 for our Infiniband software
> package, and Intel 9.1:
>
>
>
> [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version
>
> icc (ICC) 9.1 20070510
>
> Copyright (C) 1985-2007 Intel Corporation.  All rights reserved.
>
>
>
> [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90
--version
>
> ifort (IFORT) 9.1 20070510
>
> Copyright (C) 1985-2007 Intel Corporation.  All rights reserved.
>
>
>
> We are using the rsh communication protocol for this:
>
> /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........
>
>
>
> Can anyone suggest how this problem can be solved ?
>
>
>
> Thank You in advance,
>
> Mike
>
>
>
>




More information about the mvapich-discuss mailing list