[mvapich-discuss] IBV_EVENT_QP_LAST_WQE_REACHED error

Abhinav Vishnu vishnu at cse.ohio-state.edu
Tue Oct 2 15:48:16 EDT 2007


Hi Steve,

Thanks for your detailed reply.

May i suggest you to do a fresh install of MVAPICH downloaded from
our webpage. I am not sure what the dependencies are with respect to
your installation, but it looks like it may be something simple.

MVAPICH can be downloaded from our webpage:

http://mvapich.cse.ohio-state.edu/

The build and usage instructions are available here:

Build:

http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-90004.4

Usage:

http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-180005

There are a couple of other things, which i would suggest. To use an
environment variable, please mention it explicitly on the command line:

[smjones at compute-5-0 Test_for_Steve]$ /share/apps/mvapich/intel/bin/mpirun_rsh -ssh -np 16 -hostfile  $PBS_NODEFILE VIADEV_USE_SHMEM_COLL=0 ~/NGA/bin/arts

Also in addition to your application (which i am assuming you compiled
with the version installed earlier), can you also compile simple MPI
programs like osu_benchmarks and see if they work fine.

Finally, it will be helpful if the application can be executed from 2
processes to 16 processes (2x1, 2x2, 2x4, 2x8) to see if there are some
other issues with the code.

Please keep us updated of your experimentation.

Thanks and regards,

:- Abhinav
 
> >Can you do these couple of checks?
> >1. Making sure that the IB installation is the same on both the nodes.
> >2. Using mpirun_rsh instead of mpiexec.
> >3. Disabling shared memory collectives by using the environment variable
> >  VIADEV_USE_SHMEM_COLL=0,
> 
> Hi Amith.
> 
> I checked the installation on nodes, switched to mpirun_rsh, and used  
> the environment variable. I also ran ldd to verify libs that are being  
> called within the session that's failing. No changes.
> 
> Let me know what else I can do to debug.
> 
> Thanks.
> 
> Steve
> 
> [smjones at compute-5-0 Test_for_Steve]$ env |grep VIA
> VIADEV_USE_SHMEM_COLL=0
> 
> [smjones at compute-5-0 Test_for_Steve]$  
> /share/apps/mvapich/intel/bin/mpirun_rsh -ssh -np 16 -hostfile  
> $PBS_NODEFILE ~/NGA/bin/arts
>  No input file name was detected, using "input".
>         Step        Time        CFLmax      Umax        Vmax         
> Wmax    Divergence
> mpirun_rsh: Abort signaled from [0]
> [0:compute-5-0.local] Abort: [0] Got FATAL event  
> IBV_EVENT_QP_LAST_WQE_REACHED, code=16
>  at line 2551 in file viacheck.c
> done.
> 
> [smjones at compute-5-0 Test_for_Steve]$  
> /share/apps/mvapich/intel/bin/mpirun_rsh -np 16 -hostfile  
> $PBS_NODEFILE ~/NGA/bin/arts
>  No input file name was detected, using "input".
>         Step        Time        CFLmax      Umax        Vmax         
> Wmax    Divergence
> [0:compute-5-0.local] Abort: [0] Got FATAL event  
> IBV_EVENT_QP_LAST_WQE_REACHED, code=16
> mpirun_rsh: Abort signaled from [0]
>  at line 2551 in file viacheck.c
> done.
> 
> [smjones at compute-5-0 Test_for_Steve]$ mpiexec -npernode 1 ldd ~/NGA/bin/arts
>         libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a95573000)
>         libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a9567f000)
>         libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x0000002a95789000)
>         libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003133900000)
>         librt.so.1 => /lib64/tls/librt.so.1 (0x0000003134300000)
>         libm.so.6 => /lib64/tls/libm.so.6 (0x0000003133700000)
>         libc.so.6 => /lib64/tls/libc.so.6 (0x0000003133200000)
>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003134100000)
>         libdl.so.2 => /lib64/libdl.so.2 (0x0000003133500000)
>         /lib64/ld-linux-x86-64.so.2 (0x0000003133000000)
>         libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a95573000)
>         libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a9567f000)
>         libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x0000002a95789000)
>         libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003817c00000)
>         librt.so.1 => /lib64/tls/librt.so.1 (0x0000003818200000)
>         libm.so.6 => /lib64/tls/libm.so.6 (0x0000003817a00000)
>         libc.so.6 => /lib64/tls/libc.so.6 (0x0000003817500000)
>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003818800000)
>         libdl.so.2 => /lib64/libdl.so.2 (0x0000003817800000)
>         /lib64/ld-linux-x86-64.so.2 (0x0000003817300000)
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list