[mvapich-discuss] Re: VAPI_PROTOCOL_R3' failed

Sangamesh B forum.san at gmail.com
Fri May 30 08:41:00 EDT 2008


Hi,

  My answers are inline:

On Thu, May 29, 2008 at 1:20 AM, wei huang <huanwei at cse.ohio-state.edu>
wrote:

> Hi Sangamesh,
>
> Would you please let us know more information so that we can look further
> into this issue?
>
> *) Is mvapich2-1.0.2 being used here?
>
 [root at compute-0-8 mvapich2-1.0.2]# /opt/mvapich2/bin/mpich2version
Version:           mvapich2-1.0
Device:            osu_ch3:mrail
Configure Options: '--prefix=/opt/mvapich2' '--with-device=osu_ch3:mrail'
'--with-rdma=gen2' '--with-pm=mpd' '--disable-romio' '--without-mpe'
'CC=gcc' 'CFLAGS=-D_X86_64_ -D_SMP_ -DUSE_HEADER_CACHING  -DONE_SIDED
-DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS  -DRDMA_CM
-I/usr//include -O2' 'CXX=g++' 'F77=gfortran' 'F90=gfortran'
'FFLAGS=-L/usr//lib64'
CC:  gcc -D_X86_64_ -D_SMP_ -DUSE_HEADER_CACHING  -DONE_SIDED
-DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS  -DRDMA_CM
-I/usr//include -O2
CXX: g++
F77: gfortran -L/usr//lib64
F90: gfortran


> *) Are you using the default compiling scripts and default environment
> variables;
>
Yes.
We built mvapich2 with make.mvapich2.ofa script + GNU (gcc, gfortran)
compilers + ofed-1.2.5.5.

>
> *) Is your application using thread at some stage or not?


> *) Would you please apply the attached patch which prints out some
> information wrt the assertion failure?
>
After patch:
 [root at compute-0-8 mvapich2-1.0.2]# patch -p0 <patch1
patching file src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c
Hunk #1 succeeded at 223 with fuzz 1.

Then recompiled mvapich2 and DL-Poly, submitted 8 process job on 2 nodes. It
ran upto 65000 iterations and gave:

$ /opt/mvapich2.ofa.patch/bin/mpiexec -np 8
/home/bhaskar/dl_poly_EX/execute/DLPOLY_NEW
Unexpected protocol type 3
Unexpected protocol type 3
Rndv buf (nil) (alloc 0), size 13248, offset 0, r_addr (nil), d_entry (nil)
DLPOLY_NEW: ch3_rndvtransfer.c:235: MPIDI_CH3_Rndv_transfer: Assertion
`rndv->protocol == VAPI_PROTOCOL_R3' failed.
Rndv buf (nil) (alloc 0), size 13248, offset 0, r_addr (nil), d_entry (nil)
DLPOLY_NEW: ch3_rndvtransfer.c:235: MPIDI_CH3_Rndv_transfer: Assertion
`rndv->protocol == VAPI_PROTOCOL_R3' failed.
rank 1 in job 1  compute-0-8.local_36493   caused collective abort of all
ranks
  exit status of rank 1: killed by signal 9


*) Would you please try to set the following environment variable
> (separately) during your run to see if any one of them helps?
>
> -env MV2_USE_RDMA_FAST_PATH 0
> -env MV2_USE_SRQ 0
> -env MV2_USE_COALESCE 0
> -env MV2_USE_SHM_COLL 0
>
Tried these, before applying patch: No success

When  the job run on one node, it progressed upto 65000 and stopped. At this
stage, I used gdb to debug. The gdb output may give some clue:

[bhaskar at compute-0-12 test_NEW]$ gdb infin_check 25063
GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
infin_check: No such file or directory.
Attaching to process 25063
Reading symbols from /home/bhaskar/dl_poly_2.18/execute/infin_check...(no
debugging symbols found)...done.
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging symbols
found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2a95ae74e0 (LWP 25063)]
[New Thread 0x40200960 (LWP 25067)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /usr/lib64/librdmacm.so.1...done.
Loaded symbols for /usr/lib64/librdmacm.so.1
Reading symbols from /usr/lib64/libibverbs.so.1...done.
Loaded symbols for /usr/lib64/libibverbs.so.1
Reading symbols from /usr/lib64/libibumad.so.1...done.
Loaded symbols for /usr/lib64/libibumad.so.1
Reading symbols from /usr/lib64/libgfortran.so.0...done.
Loaded symbols for /usr/lib64/libgfortran.so.0
Reading symbols from /lib64/tls/libm.so.6...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libibcommon.so.1...done.
Loaded symbols for /usr/lib64/libibcommon.so.1
Reading symbols from /usr/lib64/libcxgb3-rdmav2.so...done.
Loaded symbols for /usr/lib64/libcxgb3-rdmav2.so
Reading symbols from /usr/lib64/libmthca-rdmav2.so...done.
Loaded symbols for /usr/lib64/libmthca-rdmav2.so
Reading symbols from /usr/lib64/libipathverbs-rdmav2.so...done.
Loaded symbols for /usr/lib64/libipathverbs-rdmav2.so
Reading symbols from /usr/lib64/libmlx4-rdmav2.so...done.
Loaded symbols for /usr/lib64/libmlx4-rdmav2.so
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2

0x00000000005aafd8 in MPIDI_CH3I_SMP_pull_header ()


(gdb) bt
#0  0x00000000005aafd8 in MPIDI_CH3I_SMP_pull_header ()
#1  0x00000000005abe6e in MPIDI_CH3I_SMP_read_progress ()
#2  0x00000000005a8370 in MPIDI_CH3I_Progress ()
#3  0x0000000000597bd1 in MPIC_Wait ()
#4  0x0000000000598245 in MPIC_Recv ()
#5  0x0000000000593c55 in MPIR_Bcast ()
#6  0x000000000059432e in PMPI_Bcast ()
#7  0x0000000000592797 in PMPI_Allreduce ()
#8  0x000000000059073a in pmpi_allreduce_ ()
#9  0x000000000058ac96 in gdsum_ ()
#10 0x0000000000418b98 in __utility_module__global_sum_forces ()
#11 0x000000000047f1d6 in __forces_module__force_manager ()
#12 0x000000000058a7fd in __driver_module__molecular_dynamics ()
#13 0x000000000058dfb1 in MAIN__ ()
#14 0x00000000005ecd9e in main ()
(gdb)


(gdb) frame
#0  0x00000000005aafd8 in MPIDI_CH3I_SMP_pull_header ()
(gdb) frame 1
#1  0x00000000005abe6e in MPIDI_CH3I_SMP_read_progress ()
(gdb) frame 2
#2  0x00000000005a8370 in MPIDI_CH3I_Progress ()
(gdb) frame 3
#3  0x0000000000597bd1 in MPIC_Wait ()
(gdb) frame 4
#4  0x0000000000598245 in MPIC_Recv ()
(gdb) frame 5
#5  0x0000000000593c55 in MPIR_Bcast ()
(gdb) frame 6
#6  0x000000000059432e in PMPI_Bcast ()
(gdb) frame 7
#7  0x0000000000592797 in PMPI_Allreduce ()
(gdb) frame 8
#8  0x000000000059073a in pmpi_allreduce_ ()
(gdb) frame 9
#9  0x000000000058ac96 in gdsum_ ()
(gdb) frame 10
#10 0x0000000000418b98 in __utility_module__global_sum_forces ()
(gdb) frame 11
#11 0x000000000047f1d6 in __forces_module__force_manager ()
(gdb) frame 12
#12 0x000000000058a7fd in __driver_module__molecular_dynamics ()
(gdb) frame 13
#13 0x000000000058dfb1 in MAIN__ ()
(gdb) frame 14
#14 0x00000000005ecd9e in main ()

I hope this info helps you.

Please let me know if any gdb tests are to be done.

> Thanks
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Wed, 28 May 2008, Dhabaleswar Panda wrote:
>
> > We are taking a look at this error and will get back to you soon.
> >
> > DK
> >
> > On Wed, 28 May 2008, Sangamesh B wrote:
> >
> > > There was no reply from the list.
> > >
> > > Following is some more info about the DLPOLY + mvapich2 + ofed-1.2.5.5
> +
> > > Mellanox HCA job, which gets into hang after a certain number of
> iterations.
> > >
> > > The same job with mpich2 + ethernet runs fine without any problems. And
> > > produces the final result also.
> > >
> > > With mvapich2, the job runs upto some iterations and stops calculation.
> It
> > > doesn't give any error at this point. But the output file which gets
> updated
> > > at each iteration will not show progress.
> > >
> > > One more point is, I repeatedly submitted the same mvapich2 job. In
> each
> > > case it stops at same iteration.
> > >
> > > Any mvapich2 variables have to be set?
> > >
> > > Thanks,
> > > Sangamesh
> > >
> > > On Tue, May 27, 2008 at 4:30 PM, Sangamesh B <forum.san at gmail.com>
> wrote:
> > >
> > > > Hi All,
> > > >
> > > >    A DL-POLY application job on a 5-node Infiniband setup cluster
> gave
> > > > following error:
> > > >
> > > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer:
> Assertion
> > > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer:
> Assertion
> > > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > > infin_check: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer:
> Assertion
> > > > `rndv->protocol == VAPI_PROTOCOL_R3' failed.
> > > > rank 1 in job 20  compute-0-12.local_32785   caused collective abort
> of all
> > > > ranks
> > > >   exit status of rank 1: killed by signal 9
> > > >
> > > > The job runs for 20-30 minutes and gives above error.
> > > >
> > > > This is with mvapich2 + ofed-1.2.5.5 + Mellanox HCA's.
> > > >
> > > > Any idea what might be the wrong?
> > > >
> > > > Thanks,
> > > > Sangamesh
> > > >
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080530/0a147d57/attachment-0001.html


More information about the mvapich-discuss mailing list