[mvapich-discuss] Crashes running wrf with mvapich2 1.8

Devendar Bureddy bureddy at cse.ohio-state.edu
Wed Aug 29 09:11:49 EDT 2012


Hi all.  I'm updating the list to let everyone know that there was a
process synchronization issue in intra node communication.

We have provided a fix for this problem in the 1.8 branch.  The latest
tarball
http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.8/mvapich2-latest.tar.gz
contains this fix.

Thanks Craig for reporting this issue and helping us to test the
patch exhaustively

-Devendar
On Thu, Aug 23, 2012 at 3:14 PM, Craig Tierney <craig.tierney at noaa.gov>wrote:

> On 8/22/12 9:32 PM, Devendar Bureddy wrote:
> > Hi Craig
> >
> > Thanks for reporting the issue. Can you add  '--enable-fast=none' to
> > your debug config flags and add  MV2_DEBUG_SHOW_BACKTRACE=1 to your
> > run-time flags to see if this shows any useful information.
> >
> > Did you get a chance to run with mvapich2-1.7 anytime?. This
> > information might help us to narrow down the issue.
> >
> >  -Devendar
> >
>
> Devendar,
>
> It took a lot longer to get the trace this time (1 out of 127).  Given
> the trace goes through BCAST it matches the trace from below.  Here it is:
>
> Assertion failed in file ch3u_handle_recv_pkt.c at line 215: pkt->type <=
> MPIDI_CH3_PKT_END_CH3
> [s48:mpi_rank_511][print_backtrace]   0:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(print_backtrace+0x30)
> [0x7f92e5392488]
> [s48:mpi_rank_511][print_backtrace]   1:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3_Abort+0x26)
> [0x7f92e52fa976]
> [s48:mpi_rank_511][print_backtrace]   2:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPID_Abort+0x1a4)
> [0x7f92e53fa8d0]
> [s48:mpi_rank_511][print_backtrace]   3:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Assert_fail+0x6a)
> [0x7f92e52d8b7a]
> [s48:mpi_rank_511][print_backtrace]   4:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3U_Handle_recv_pkt+0x7c)
> [0x7f92e531f090]
> [s48:mpi_rank_511][print_backtrace]   5:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(+0xf9f42) [0x7f92e530df42]
> [s48:mpi_rank_511][print_backtrace]   6:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3I_SMP_read_progress+0x149)
> [0x7f92e530ea5d]
> [s48:mpi_rank_511][print_backtrace]   7:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3I_Progress+0x13c)
> [0x7f92e5303012]
> [s48:mpi_rank_511][print_backtrace]   8:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIC_Wait+0x5a)
> [0x7f92e53c5378]
> [s48:mpi_rank_511][print_backtrace]   9:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIC_Recv+0x21b)
> [0x7f92e53c2c5d]
> [s48:mpi_rank_511][print_backtrace]  10:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIC_Recv_ft+0x7a)
> [0x7f92e53c553e]
> [s48:mpi_rank_511][print_backtrace]  11:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(+0xd2886) [0x7f92e52e6886]
> [s48:mpi_rank_511][print_backtrace]  12:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Bcast_intra_MV2+0xc3d)
> [0x7f92e52ec76d]
> [s48:mpi_rank_511][print_backtrace]  13:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Bcast_MV2+0x61)
> [0x7f92e52eca3f]
> [s48:mpi_rank_511][print_backtrace]  14:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Bcast_impl+0xbb)
> [0x7f92e52e4eff]
> [s48:mpi_rank_511][print_backtrace]  15:
> /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPI_Bcast+0xcdc)
> [0x7f92e52e5ffe]
> [s48:mpi_rank_511][print_backtrace]  16:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x13d7ce2]
> [s48:mpi_rank_511][print_backtrace]  17:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x4bd009]
> [s48:mpi_rank_511][print_backtrace]  18:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x112cd7f]
> [s48:mpi_rank_511][print_backtrace]  19:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x112de46]
> [s48:mpi_rank_511][print_backtrace]  20:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0xe366d2]
> [s48:mpi_rank_511][print_backtrace]  21:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0xe2f83f]
> [s48:mpi_rank_511][print_backtrace]  22:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x908f55]
> [s48:mpi_rank_511][print_backtrace]  23:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x606670]
> [s48:mpi_rank_511][print_backtrace]  24:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x602683]
> [s48:mpi_rank_511][print_backtrace]  25:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x4080a7]
> [s48:mpi_rank_511][print_backtrace]  26:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x407652]
> [s48:mpi_rank_511][print_backtrace]  27:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x4075dc]
> [s48:mpi_rank_511][print_backtrace]  28:
> /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f92e3ae8cdd]
> [s48:mpi_rank_511][print_backtrace]  29:
> /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe()
> [0x4074d9]
> [cli_511]: aborting job:
> internal ABORT - process 511
>
> Craig
>
>
>
> > On Wed, Aug 22, 2012 at 8:17 PM, Craig Tierney <craig.tierney at noaa.gov>
> wrote:
> >>
> >> I am trying to run WRF (V3.0.1.1 yes I know it is old) with mvapich2
> 1.8.  I am having problems running
> >> this consistently.  I get crashes on the order of 2-5% of the time.  I
> have been trying to get
> >> debugging information to help figure out the problem, but I am having
> challenges.  Here is what
> >> I know:
> >>
> >> - Mvapich2 1.6 does not have this problem
> >> - Mvapich2 1.8 and nightly builds do show this problem
> >> - It happens on different core count sizes between 256 and 1000s
> >> -- I haven't tried to run it smaller because of my problem size
> >> - I have seen correlation of higher failure rates to higher core count
> sizes
> >>
> >> When I built mvapich2 without debugging, I can get a traceback like:
> >>
> >> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> >> Image              PC                Routine            Line
>  Source
> >> libmpich.so.3      00007F2EF0F2994A  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F2A1D7  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F24A5C  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F1E95F  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F69856  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F697A9  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F6967A  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F15B4B  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F13AA2  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F13916  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F138F8  Unknown               Unknown
>  Unknown
> >> libmpich.so.3      00007F2EF0F13876  Unknown               Unknown
>  Unknown
> >> wrf.exe            00000000013D7F32  Unknown               Unknown
>  Unknown
> >> wrf.exe            00000000004BD259  wrf_dm_bcast_byte        1627
>  module_dm.f90
> >> wrf.exe            000000000112CA5A  module_ra_rrtm_mp        6860
>  module_ra_rrtm.f90
> >> wrf.exe            000000000112E096  module_ra_rrtm_mp        6551
>  module_ra_rrtm.f90
> >> wrf.exe            0000000000E36922  module_physics_in         898
>  module_physics_init.f90
> >> wrf.exe            0000000000E2FA8F  module_physics_in         410
>  module_physics_init.f90
> >> wrf.exe            00000000009091A5  start_domain_em_          641
>  start_em.f90
> >> wrf.exe            00000000006068C0  start_domain_             152
>  start_domain.f90
> >> wrf.exe            00000000006028D3  med_initialdata_i         138
>  mediation_wrfmain.f90
> >> wrf.exe            00000000004082F7  module_wrf_top_mp         241
>  module_wrf_top.f90
> >> wrf.exe            00000000004078A2  MAIN__                     21
>  wrf.f90
> >> wrf.exe            000000000040782C  Unknown               Unknown
>  Unknown
> >> libc.so.6          00007F2EEF74CCDD  Unknown               Unknown
>  Unknown
> >> wrf.exe            0000000000407729  Unknown               Unknown
>  Unknown
> >>
> >> In module_dm.f90, the code that is being called is:
> >>
> >>    CALL BYTE_BCAST ( buf , size, local_communicator )
> >>
> >> Which is call to:
> >>
> >> BYTE_BCAST ( char * buf, int * size, int * Fcomm )
> >> {
> >> #ifndef STUBMPI
> >>     MPI_Comm *comm, dummy_comm ;
> >>
> >>     comm = &dummy_comm ;
> >>     *comm = MPI_Comm_f2c( *Fcomm ) ;
> >> # ifdef crayx1
> >>     if (*size % sizeof(int) == 0) {
> >>        MPI_Bcast ( buf, *size/sizeof(int), MPI_INT, 0, *comm ) ;
> >>     } else {
> >>        MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
> >>     }
> >> # else
> >>     MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
> >> # endif
> >> #endif
> >> }
> >>
> >> I am not sure why this isn't showing up in the stack trace, I will
> recompile WRF to find out.
> >>
> >> But the crash is happening in mpich library.  However, when I compile
> with debugging information
> >> in the library, I no longer get stack traces.  I get no information
> except that when mpiexec
> >> exits I get the error 44544.
> >>
> >> The latest test is with r5609, built with:
> >>
> >> ./configure CC=icc CXX=icpc F77=ifort FC=ifort
> --prefix=/apps/mvapich2/1.8-r5609-intel --with-rdma=gen2
> --with-ib-libpath=/usr/lib64 --enable-romio=yes
> >> --with-file-system=lustre+panfs --enable-shared --enable-debuginfo
> --enable-g=debug
> >>
> >> Any suggestions on where to go next would be appreciated.
> >>
> >> Craig
> >>
> >>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
> >
>
>


-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120829/cd717a40/attachment-0001.html


More information about the mvapich-discuss mailing list