[mvapich-discuss] Crashes running wrf with mvapich2 1.8

Craig Tierney craig.tierney at noaa.gov
Wed Aug 22 21:17:51 EDT 2012


I am trying to run WRF (V3.0.1.1 yes I know it is old) with mvapich2 1.8.  I am having problems running
this consistently.  I get crashes on the order of 2-5% of the time.  I have been trying to get
debugging information to help figure out the problem, but I am having challenges.  Here is what
I know:

- Mvapich2 1.6 does not have this problem
- Mvapich2 1.8 and nightly builds do show this problem
- It happens on different core count sizes between 256 and 1000s
-- I haven't tried to run it smaller because of my problem size
- I have seen correlation of higher failure rates to higher core count sizes

When I built mvapich2 without debugging, I can get a traceback like:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libmpich.so.3      00007F2EF0F2994A  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F2A1D7  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F24A5C  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F1E95F  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F69856  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F697A9  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F6967A  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F15B4B  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F13AA2  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F13916  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F138F8  Unknown               Unknown  Unknown
libmpich.so.3      00007F2EF0F13876  Unknown               Unknown  Unknown
wrf.exe            00000000013D7F32  Unknown               Unknown  Unknown
wrf.exe            00000000004BD259  wrf_dm_bcast_byte        1627  module_dm.f90
wrf.exe            000000000112CA5A  module_ra_rrtm_mp        6860  module_ra_rrtm.f90
wrf.exe            000000000112E096  module_ra_rrtm_mp        6551  module_ra_rrtm.f90
wrf.exe            0000000000E36922  module_physics_in         898  module_physics_init.f90
wrf.exe            0000000000E2FA8F  module_physics_in         410  module_physics_init.f90
wrf.exe            00000000009091A5  start_domain_em_          641  start_em.f90
wrf.exe            00000000006068C0  start_domain_             152  start_domain.f90
wrf.exe            00000000006028D3  med_initialdata_i         138  mediation_wrfmain.f90
wrf.exe            00000000004082F7  module_wrf_top_mp         241  module_wrf_top.f90
wrf.exe            00000000004078A2  MAIN__                     21  wrf.f90
wrf.exe            000000000040782C  Unknown               Unknown  Unknown
libc.so.6          00007F2EEF74CCDD  Unknown               Unknown  Unknown
wrf.exe            0000000000407729  Unknown               Unknown  Unknown

In module_dm.f90, the code that is being called is:

   CALL BYTE_BCAST ( buf , size, local_communicator )

Which is call to:

BYTE_BCAST ( char * buf, int * size, int * Fcomm )
{
#ifndef STUBMPI
    MPI_Comm *comm, dummy_comm ;

    comm = &dummy_comm ;
    *comm = MPI_Comm_f2c( *Fcomm ) ;
# ifdef crayx1
    if (*size % sizeof(int) == 0) {
       MPI_Bcast ( buf, *size/sizeof(int), MPI_INT, 0, *comm ) ;
    } else {
       MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
    }
# else
    MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
# endif
#endif
}

I am not sure why this isn't showing up in the stack trace, I will recompile WRF to find out.

But the crash is happening in mpich library.  However, when I compile with debugging information
in the library, I no longer get stack traces.  I get no information except that when mpiexec
exits I get the error 44544.

The latest test is with r5609, built with:

./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/apps/mvapich2/1.8-r5609-intel --with-rdma=gen2 --with-ib-libpath=/usr/lib64 --enable-romio=yes
--with-file-system=lustre+panfs --enable-shared --enable-debuginfo --enable-g=debug

Any suggestions on where to go next would be appreciated.

Craig




More information about the mvapich-discuss mailing list