[mvapich-discuss] Crashes running wrf with mvapich2 1.8

Craig Tierney craig.tierney at noaa.gov
Thu Aug 23 00:16:15 EDT 2012


On 8/22/12 9:32 PM, Devendar Bureddy wrote:
> Hi Craig
> 
> Thanks for reporting the issue. Can you add  '--enable-fast=none' to
> your debug config flags and add  MV2_DEBUG_SHOW_BACKTRACE=1 to your
> run-time flags to see if this shows any useful information.
> 
> Did you get a chance to run with mvapich2-1.7 anytime?. This
> information might help us to narrow down the issue.
> 

Devendar,

I will rebuild mvapich2 with the option you suggest.   I will give
1.7 a try as well.

Craig

>  -Devendar
> 
> On Wed, Aug 22, 2012 at 8:17 PM, Craig Tierney <craig.tierney at noaa.gov> wrote:
>>
>> I am trying to run WRF (V3.0.1.1 yes I know it is old) with mvapich2 1.8.  I am having problems running
>> this consistently.  I get crashes on the order of 2-5% of the time.  I have been trying to get
>> debugging information to help figure out the problem, but I am having challenges.  Here is what
>> I know:
>>
>> - Mvapich2 1.6 does not have this problem
>> - Mvapich2 1.8 and nightly builds do show this problem
>> - It happens on different core count sizes between 256 and 1000s
>> -- I haven't tried to run it smaller because of my problem size
>> - I have seen correlation of higher failure rates to higher core count sizes
>>
>> When I built mvapich2 without debugging, I can get a traceback like:
>>
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image              PC                Routine            Line        Source
>> libmpich.so.3      00007F2EF0F2994A  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F2A1D7  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F24A5C  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F1E95F  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F69856  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F697A9  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F6967A  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F15B4B  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F13AA2  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F13916  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F138F8  Unknown               Unknown  Unknown
>> libmpich.so.3      00007F2EF0F13876  Unknown               Unknown  Unknown
>> wrf.exe            00000000013D7F32  Unknown               Unknown  Unknown
>> wrf.exe            00000000004BD259  wrf_dm_bcast_byte        1627  module_dm.f90
>> wrf.exe            000000000112CA5A  module_ra_rrtm_mp        6860  module_ra_rrtm.f90
>> wrf.exe            000000000112E096  module_ra_rrtm_mp        6551  module_ra_rrtm.f90
>> wrf.exe            0000000000E36922  module_physics_in         898  module_physics_init.f90
>> wrf.exe            0000000000E2FA8F  module_physics_in         410  module_physics_init.f90
>> wrf.exe            00000000009091A5  start_domain_em_          641  start_em.f90
>> wrf.exe            00000000006068C0  start_domain_             152  start_domain.f90
>> wrf.exe            00000000006028D3  med_initialdata_i         138  mediation_wrfmain.f90
>> wrf.exe            00000000004082F7  module_wrf_top_mp         241  module_wrf_top.f90
>> wrf.exe            00000000004078A2  MAIN__                     21  wrf.f90
>> wrf.exe            000000000040782C  Unknown               Unknown  Unknown
>> libc.so.6          00007F2EEF74CCDD  Unknown               Unknown  Unknown
>> wrf.exe            0000000000407729  Unknown               Unknown  Unknown
>>
>> In module_dm.f90, the code that is being called is:
>>
>>    CALL BYTE_BCAST ( buf , size, local_communicator )
>>
>> Which is call to:
>>
>> BYTE_BCAST ( char * buf, int * size, int * Fcomm )
>> {
>> #ifndef STUBMPI
>>     MPI_Comm *comm, dummy_comm ;
>>
>>     comm = &dummy_comm ;
>>     *comm = MPI_Comm_f2c( *Fcomm ) ;
>> # ifdef crayx1
>>     if (*size % sizeof(int) == 0) {
>>        MPI_Bcast ( buf, *size/sizeof(int), MPI_INT, 0, *comm ) ;
>>     } else {
>>        MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
>>     }
>> # else
>>     MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
>> # endif
>> #endif
>> }
>>
>> I am not sure why this isn't showing up in the stack trace, I will recompile WRF to find out.
>>
>> But the crash is happening in mpich library.  However, when I compile with debugging information
>> in the library, I no longer get stack traces.  I get no information except that when mpiexec
>> exits I get the error 44544.
>>
>> The latest test is with r5609, built with:
>>
>> ./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/apps/mvapich2/1.8-r5609-intel --with-rdma=gen2 --with-ib-libpath=/usr/lib64 --enable-romio=yes
>> --with-file-system=lustre+panfs --enable-shared --enable-debuginfo --enable-g=debug
>>
>> Any suggestions on where to go next would be appreciated.
>>
>> Craig
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
> 



More information about the mvapich-discuss mailing list