[mvapich-discuss] Crashes running wrf with mvapich2 1.8
Craig Tierney
craig.tierney at noaa.gov
Wed Aug 22 21:17:51 EDT 2012
I am trying to run WRF (V3.0.1.1 yes I know it is old) with mvapich2 1.8. I am having problems running
this consistently. I get crashes on the order of 2-5% of the time. I have been trying to get
debugging information to help figure out the problem, but I am having challenges. Here is what
I know:
- Mvapich2 1.6 does not have this problem
- Mvapich2 1.8 and nightly builds do show this problem
- It happens on different core count sizes between 256 and 1000s
-- I haven't tried to run it smaller because of my problem size
- I have seen correlation of higher failure rates to higher core count sizes
When I built mvapich2 without debugging, I can get a traceback like:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmpich.so.3 00007F2EF0F2994A Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F2A1D7 Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F24A5C Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F1E95F Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F69856 Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F697A9 Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F6967A Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F15B4B Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F13AA2 Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F13916 Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F138F8 Unknown Unknown Unknown
libmpich.so.3 00007F2EF0F13876 Unknown Unknown Unknown
wrf.exe 00000000013D7F32 Unknown Unknown Unknown
wrf.exe 00000000004BD259 wrf_dm_bcast_byte 1627 module_dm.f90
wrf.exe 000000000112CA5A module_ra_rrtm_mp 6860 module_ra_rrtm.f90
wrf.exe 000000000112E096 module_ra_rrtm_mp 6551 module_ra_rrtm.f90
wrf.exe 0000000000E36922 module_physics_in 898 module_physics_init.f90
wrf.exe 0000000000E2FA8F module_physics_in 410 module_physics_init.f90
wrf.exe 00000000009091A5 start_domain_em_ 641 start_em.f90
wrf.exe 00000000006068C0 start_domain_ 152 start_domain.f90
wrf.exe 00000000006028D3 med_initialdata_i 138 mediation_wrfmain.f90
wrf.exe 00000000004082F7 module_wrf_top_mp 241 module_wrf_top.f90
wrf.exe 00000000004078A2 MAIN__ 21 wrf.f90
wrf.exe 000000000040782C Unknown Unknown Unknown
libc.so.6 00007F2EEF74CCDD Unknown Unknown Unknown
wrf.exe 0000000000407729 Unknown Unknown Unknown
In module_dm.f90, the code that is being called is:
CALL BYTE_BCAST ( buf , size, local_communicator )
Which is call to:
BYTE_BCAST ( char * buf, int * size, int * Fcomm )
{
#ifndef STUBMPI
MPI_Comm *comm, dummy_comm ;
comm = &dummy_comm ;
*comm = MPI_Comm_f2c( *Fcomm ) ;
# ifdef crayx1
if (*size % sizeof(int) == 0) {
MPI_Bcast ( buf, *size/sizeof(int), MPI_INT, 0, *comm ) ;
} else {
MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
}
# else
MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
# endif
#endif
}
I am not sure why this isn't showing up in the stack trace, I will recompile WRF to find out.
But the crash is happening in mpich library. However, when I compile with debugging information
in the library, I no longer get stack traces. I get no information except that when mpiexec
exits I get the error 44544.
The latest test is with r5609, built with:
./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/apps/mvapich2/1.8-r5609-intel --with-rdma=gen2 --with-ib-libpath=/usr/lib64 --enable-romio=yes
--with-file-system=lustre+panfs --enable-shared --enable-debuginfo --enable-g=debug
Any suggestions on where to go next would be appreciated.
Craig
More information about the mvapich-discuss
mailing list