[mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event
IBV_EVENT_QP_LAST_WQE_REACHED
Michael Ethier
methier at CGR.Harvard.edu
Mon Jan 7 13:10:03 EST 2008
Hi Matthew,
The osu_bw test ran ok as seen below. I added the VIADEV_USE_COALESCE=0
variable to the command line and in the environment, and it made no
difference, I set get the same errors.
#!/bin/tcsh
setenv VIADEV_USE_COALESCE 0
/usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 -hostfile
./hostfile VIADEV_USE_COALESCE=0 ./raflesi -f ./EDRAFLES_IN
Thank You,
Mike
The benchmark test:
foo.test script has in it
#!/bin/tcsh
/usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 -hostfile
./hostfile VIADEV_USE_COALESCE=0
/usr/mpi/intel/mvapich-0.9.9/tests/osu_benchmarks-2.2/osu_bw
[gb16 at moorcrofth run]$ ./foo.test
# OSU MPI Bandwidth Test (Version 2.2)
# Size Bandwidth (MB/s)
1 0.135198
2 0.273329
4 0.540415
8 1.087788
16 2.179976
32 4.371585
64 8.668233
128 17.290726
256 34.458536
512 68.269511
1024 129.384822
2048 239.992676
4096 392.348909
8192 542.819870
16384 452.196563
32768 625.604678
65536 764.094184
131072 836.010006
262144 871.899242
524288 890.772813
1048576 901.838432
2097152 906.494955
4194304 909.296621
[gb16 at moorcrofth run]$ more ./hostfile
moorcrofth
moorcroft8
moorcroft11
-----Original Message-----
From: Matthew Koop [mailto:koop at cse.ohio-state.edu]
Sent: Monday, January 07, 2008 12:26 PM
To: Michael Ethier
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event
IBV_EVENT_QP_LAST_WQE_REACHED
Michael,
Do other more simple benchmarks work (e.g. osu_benchmarks/osu_bw)?
If they do, this is something we'd like to take a closer look at. I'd be
interested to know if setting VIADEV_USE_COALESCE=0 resolves the issue:
e.g.
mpirun_rsh -np 2 h1 h2 VIADEV_USE_COALESCE=0 ./exec
Matt
On Mon, 7 Jan 2008, Michael Ethier wrote:
> Hello,
>
>
>
> I am new to this forum and hoping someone can help solve the following
> problem for me.
>
>
>
> We have a modeling application that initializes and runs fine using an
> ordinary Ethernet connection.
>
>
>
> When we compile using the Infiniband software package (mvapich-0.9.9)
> and run, the application fails with the following
>
> at then end:
>
>
>
> [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error
> IBV_WC_LOC_LEN_ERR, code=1, dest rank=1
>
> at line 388 in file viacheck.c
>
> [0:moorcrofth] Abort: [0] Got FATAL event
IBV_EVENT_QP_LAST_WQE_REACHED,
> code=16
>
> at line 2552 in file viacheck.c
>
> mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 :
> moorcroft8 ]
>
> forrtl: error (78): process killed (SIGTERM)
>
> forrtl: error (78): process killed (SIGTERM)
>
> done.
>
>
>
> This occurs at the initialization phase it seems when communication
> starts between different nodes.
>
> If I set the hostfile to contain the same node so that all the cpus
used
> are on 1 node, it initializes fine and runs.
>
>
>
> We are using Redhat Enterprise 4 Update 5 on x86_64
>
>
>
> uname -a
>
> Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007
> x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> In addition we are using mvapich-0.9.9 for our Infiniband software
> package, and Intel 9.1:
>
>
>
> [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version
>
> icc (ICC) 9.1 20070510
>
> Copyright (C) 1985-2007 Intel Corporation. All rights reserved.
>
>
>
> [gb16 at moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90
--version
>
> ifort (IFORT) 9.1 20070510
>
> Copyright (C) 1985-2007 Intel Corporation. All rights reserved.
>
>
>
> We are using the rsh communication protocol for this:
>
> /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........
>
>
>
> Can anyone suggest how this problem can be solved ?
>
>
>
> Thank You in advance,
>
> Mike
>
>
>
>
More information about the mvapich-discuss
mailing list