[mvapich-discuss] Crashes running wrf with mvapich2 1.8
Devendar Bureddy
bureddy at cse.ohio-state.edu
Wed Aug 22 23:32:36 EDT 2012
Hi Craig
Thanks for reporting the issue. Can you add '--enable-fast=none' to
your debug config flags and add MV2_DEBUG_SHOW_BACKTRACE=1 to your
run-time flags to see if this shows any useful information.
Did you get a chance to run with mvapich2-1.7 anytime?. This
information might help us to narrow down the issue.
-Devendar
On Wed, Aug 22, 2012 at 8:17 PM, Craig Tierney <craig.tierney at noaa.gov> wrote:
>
> I am trying to run WRF (V3.0.1.1 yes I know it is old) with mvapich2 1.8. I am having problems running
> this consistently. I get crashes on the order of 2-5% of the time. I have been trying to get
> debugging information to help figure out the problem, but I am having challenges. Here is what
> I know:
>
> - Mvapich2 1.6 does not have this problem
> - Mvapich2 1.8 and nightly builds do show this problem
> - It happens on different core count sizes between 256 and 1000s
> -- I haven't tried to run it smaller because of my problem size
> - I have seen correlation of higher failure rates to higher core count sizes
>
> When I built mvapich2 without debugging, I can get a traceback like:
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> libmpich.so.3 00007F2EF0F2994A Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F2A1D7 Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F24A5C Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F1E95F Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F69856 Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F697A9 Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F6967A Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F15B4B Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F13AA2 Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F13916 Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F138F8 Unknown Unknown Unknown
> libmpich.so.3 00007F2EF0F13876 Unknown Unknown Unknown
> wrf.exe 00000000013D7F32 Unknown Unknown Unknown
> wrf.exe 00000000004BD259 wrf_dm_bcast_byte 1627 module_dm.f90
> wrf.exe 000000000112CA5A module_ra_rrtm_mp 6860 module_ra_rrtm.f90
> wrf.exe 000000000112E096 module_ra_rrtm_mp 6551 module_ra_rrtm.f90
> wrf.exe 0000000000E36922 module_physics_in 898 module_physics_init.f90
> wrf.exe 0000000000E2FA8F module_physics_in 410 module_physics_init.f90
> wrf.exe 00000000009091A5 start_domain_em_ 641 start_em.f90
> wrf.exe 00000000006068C0 start_domain_ 152 start_domain.f90
> wrf.exe 00000000006028D3 med_initialdata_i 138 mediation_wrfmain.f90
> wrf.exe 00000000004082F7 module_wrf_top_mp 241 module_wrf_top.f90
> wrf.exe 00000000004078A2 MAIN__ 21 wrf.f90
> wrf.exe 000000000040782C Unknown Unknown Unknown
> libc.so.6 00007F2EEF74CCDD Unknown Unknown Unknown
> wrf.exe 0000000000407729 Unknown Unknown Unknown
>
> In module_dm.f90, the code that is being called is:
>
> CALL BYTE_BCAST ( buf , size, local_communicator )
>
> Which is call to:
>
> BYTE_BCAST ( char * buf, int * size, int * Fcomm )
> {
> #ifndef STUBMPI
> MPI_Comm *comm, dummy_comm ;
>
> comm = &dummy_comm ;
> *comm = MPI_Comm_f2c( *Fcomm ) ;
> # ifdef crayx1
> if (*size % sizeof(int) == 0) {
> MPI_Bcast ( buf, *size/sizeof(int), MPI_INT, 0, *comm ) ;
> } else {
> MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
> }
> # else
> MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
> # endif
> #endif
> }
>
> I am not sure why this isn't showing up in the stack trace, I will recompile WRF to find out.
>
> But the crash is happening in mpich library. However, when I compile with debugging information
> in the library, I no longer get stack traces. I get no information except that when mpiexec
> exits I get the error 44544.
>
> The latest test is with r5609, built with:
>
> ./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/apps/mvapich2/1.8-r5609-intel --with-rdma=gen2 --with-ib-libpath=/usr/lib64 --enable-romio=yes
> --with-file-system=lustre+panfs --enable-shared --enable-debuginfo --enable-g=debug
>
> Any suggestions on where to go next would be appreciated.
>
> Craig
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
Devendar
More information about the mvapich-discuss
mailing list