[mvapich-discuss] IBV_EVENT_QP_LAST_WQE_REACHED error
Abhinav Vishnu
vishnu at cse.ohio-state.edu
Tue Oct 2 15:48:16 EDT 2007
Hi Steve,
Thanks for your detailed reply.
May i suggest you to do a fresh install of MVAPICH downloaded from
our webpage. I am not sure what the dependencies are with respect to
your installation, but it looks like it may be something simple.
MVAPICH can be downloaded from our webpage:
http://mvapich.cse.ohio-state.edu/
The build and usage instructions are available here:
Build:
http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-90004.4
Usage:
http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-180005
There are a couple of other things, which i would suggest. To use an
environment variable, please mention it explicitly on the command line:
[smjones at compute-5-0 Test_for_Steve]$ /share/apps/mvapich/intel/bin/mpirun_rsh -ssh -np 16 -hostfile $PBS_NODEFILE VIADEV_USE_SHMEM_COLL=0 ~/NGA/bin/arts
Also in addition to your application (which i am assuming you compiled
with the version installed earlier), can you also compile simple MPI
programs like osu_benchmarks and see if they work fine.
Finally, it will be helpful if the application can be executed from 2
processes to 16 processes (2x1, 2x2, 2x4, 2x8) to see if there are some
other issues with the code.
Please keep us updated of your experimentation.
Thanks and regards,
:- Abhinav
> >Can you do these couple of checks?
> >1. Making sure that the IB installation is the same on both the nodes.
> >2. Using mpirun_rsh instead of mpiexec.
> >3. Disabling shared memory collectives by using the environment variable
> > VIADEV_USE_SHMEM_COLL=0,
>
> Hi Amith.
>
> I checked the installation on nodes, switched to mpirun_rsh, and used
> the environment variable. I also ran ldd to verify libs that are being
> called within the session that's failing. No changes.
>
> Let me know what else I can do to debug.
>
> Thanks.
>
> Steve
>
> [smjones at compute-5-0 Test_for_Steve]$ env |grep VIA
> VIADEV_USE_SHMEM_COLL=0
>
> [smjones at compute-5-0 Test_for_Steve]$
> /share/apps/mvapich/intel/bin/mpirun_rsh -ssh -np 16 -hostfile
> $PBS_NODEFILE ~/NGA/bin/arts
> No input file name was detected, using "input".
> Step Time CFLmax Umax Vmax
> Wmax Divergence
> mpirun_rsh: Abort signaled from [0]
> [0:compute-5-0.local] Abort: [0] Got FATAL event
> IBV_EVENT_QP_LAST_WQE_REACHED, code=16
> at line 2551 in file viacheck.c
> done.
>
> [smjones at compute-5-0 Test_for_Steve]$
> /share/apps/mvapich/intel/bin/mpirun_rsh -np 16 -hostfile
> $PBS_NODEFILE ~/NGA/bin/arts
> No input file name was detected, using "input".
> Step Time CFLmax Umax Vmax
> Wmax Divergence
> [0:compute-5-0.local] Abort: [0] Got FATAL event
> IBV_EVENT_QP_LAST_WQE_REACHED, code=16
> mpirun_rsh: Abort signaled from [0]
> at line 2551 in file viacheck.c
> done.
>
> [smjones at compute-5-0 Test_for_Steve]$ mpiexec -npernode 1 ldd ~/NGA/bin/arts
> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a95573000)
> libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a9567f000)
> libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x0000002a95789000)
> libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003133900000)
> librt.so.1 => /lib64/tls/librt.so.1 (0x0000003134300000)
> libm.so.6 => /lib64/tls/libm.so.6 (0x0000003133700000)
> libc.so.6 => /lib64/tls/libc.so.6 (0x0000003133200000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003134100000)
> libdl.so.2 => /lib64/libdl.so.2 (0x0000003133500000)
> /lib64/ld-linux-x86-64.so.2 (0x0000003133000000)
> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a95573000)
> libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a9567f000)
> libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x0000002a95789000)
> libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003817c00000)
> librt.so.1 => /lib64/tls/librt.so.1 (0x0000003818200000)
> libm.so.6 => /lib64/tls/libm.so.6 (0x0000003817a00000)
> libc.so.6 => /lib64/tls/libc.so.6 (0x0000003817500000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003818800000)
> libdl.so.2 => /lib64/libdl.so.2 (0x0000003817800000)
> /lib64/ld-linux-x86-64.so.2 (0x0000003817300000)
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list