[mvapich-discuss] Crashes running wrf with mvapich2 1.8
Craig Tierney
craig.tierney at noaa.gov
Thu Aug 23 15:14:52 EDT 2012
On 8/22/12 9:32 PM, Devendar Bureddy wrote:
> Hi Craig
>
> Thanks for reporting the issue. Can you add '--enable-fast=none' to
> your debug config flags and add MV2_DEBUG_SHOW_BACKTRACE=1 to your
> run-time flags to see if this shows any useful information.
>
> Did you get a chance to run with mvapich2-1.7 anytime?. This
> information might help us to narrow down the issue.
>
> -Devendar
>
Devendar,
It took a lot longer to get the trace this time (1 out of 127). Given
the trace goes through BCAST it matches the trace from below. Here it is:
Assertion failed in file ch3u_handle_recv_pkt.c at line 215: pkt->type <= MPIDI_CH3_PKT_END_CH3
[s48:mpi_rank_511][print_backtrace] 0: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(print_backtrace+0x30) [0x7f92e5392488]
[s48:mpi_rank_511][print_backtrace] 1: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3_Abort+0x26) [0x7f92e52fa976]
[s48:mpi_rank_511][print_backtrace] 2: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPID_Abort+0x1a4) [0x7f92e53fa8d0]
[s48:mpi_rank_511][print_backtrace] 3: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Assert_fail+0x6a) [0x7f92e52d8b7a]
[s48:mpi_rank_511][print_backtrace] 4: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3U_Handle_recv_pkt+0x7c) [0x7f92e531f090]
[s48:mpi_rank_511][print_backtrace] 5: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(+0xf9f42) [0x7f92e530df42]
[s48:mpi_rank_511][print_backtrace] 6: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3I_SMP_read_progress+0x149) [0x7f92e530ea5d]
[s48:mpi_rank_511][print_backtrace] 7: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIDI_CH3I_Progress+0x13c) [0x7f92e5303012]
[s48:mpi_rank_511][print_backtrace] 8: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIC_Wait+0x5a) [0x7f92e53c5378]
[s48:mpi_rank_511][print_backtrace] 9: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIC_Recv+0x21b) [0x7f92e53c2c5d]
[s48:mpi_rank_511][print_backtrace] 10: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIC_Recv_ft+0x7a) [0x7f92e53c553e]
[s48:mpi_rank_511][print_backtrace] 11: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(+0xd2886) [0x7f92e52e6886]
[s48:mpi_rank_511][print_backtrace] 12: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Bcast_intra_MV2+0xc3d) [0x7f92e52ec76d]
[s48:mpi_rank_511][print_backtrace] 13: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Bcast_MV2+0x61) [0x7f92e52eca3f]
[s48:mpi_rank_511][print_backtrace] 14: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPIR_Bcast_impl+0xbb) [0x7f92e52e4eff]
[s48:mpi_rank_511][print_backtrace] 15: /apps/mvapich2/1.8-r5609-intel/lib/libmpich.so.3(MPI_Bcast+0xcdc) [0x7f92e52e5ffe]
[s48:mpi_rank_511][print_backtrace] 16: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x13d7ce2]
[s48:mpi_rank_511][print_backtrace] 17: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x4bd009]
[s48:mpi_rank_511][print_backtrace] 18: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x112cd7f]
[s48:mpi_rank_511][print_backtrace] 19: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x112de46]
[s48:mpi_rank_511][print_backtrace] 20: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0xe366d2]
[s48:mpi_rank_511][print_backtrace] 21: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0xe2f83f]
[s48:mpi_rank_511][print_backtrace] 22: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x908f55]
[s48:mpi_rank_511][print_backtrace] 23: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x606670]
[s48:mpi_rank_511][print_backtrace] 24: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x602683]
[s48:mpi_rank_511][print_backtrace] 25: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x4080a7]
[s48:mpi_rank_511][print_backtrace] 26: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x407652]
[s48:mpi_rank_511][print_backtrace] 27: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x4075dc]
[s48:mpi_rank_511][print_backtrace] 28: /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f92e3ae8cdd]
[s48:mpi_rank_511][print_backtrace] 29: /lfs1/jetmgmt/ctierney/noaa-bm/sjet/wrf_arw/WRFV3.0.1.1-mvapich18/main/wrf.exe() [0x4074d9]
[cli_511]: aborting job:
internal ABORT - process 511
Craig
> On Wed, Aug 22, 2012 at 8:17 PM, Craig Tierney <craig.tierney at noaa.gov> wrote:
>>
>> I am trying to run WRF (V3.0.1.1 yes I know it is old) with mvapich2 1.8. I am having problems running
>> this consistently. I get crashes on the order of 2-5% of the time. I have been trying to get
>> debugging information to help figure out the problem, but I am having challenges. Here is what
>> I know:
>>
>> - Mvapich2 1.6 does not have this problem
>> - Mvapich2 1.8 and nightly builds do show this problem
>> - It happens on different core count sizes between 256 and 1000s
>> -- I haven't tried to run it smaller because of my problem size
>> - I have seen correlation of higher failure rates to higher core count sizes
>>
>> When I built mvapich2 without debugging, I can get a traceback like:
>>
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> Image PC Routine Line Source
>> libmpich.so.3 00007F2EF0F2994A Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F2A1D7 Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F24A5C Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F1E95F Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F69856 Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F697A9 Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F6967A Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F15B4B Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F13AA2 Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F13916 Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F138F8 Unknown Unknown Unknown
>> libmpich.so.3 00007F2EF0F13876 Unknown Unknown Unknown
>> wrf.exe 00000000013D7F32 Unknown Unknown Unknown
>> wrf.exe 00000000004BD259 wrf_dm_bcast_byte 1627 module_dm.f90
>> wrf.exe 000000000112CA5A module_ra_rrtm_mp 6860 module_ra_rrtm.f90
>> wrf.exe 000000000112E096 module_ra_rrtm_mp 6551 module_ra_rrtm.f90
>> wrf.exe 0000000000E36922 module_physics_in 898 module_physics_init.f90
>> wrf.exe 0000000000E2FA8F module_physics_in 410 module_physics_init.f90
>> wrf.exe 00000000009091A5 start_domain_em_ 641 start_em.f90
>> wrf.exe 00000000006068C0 start_domain_ 152 start_domain.f90
>> wrf.exe 00000000006028D3 med_initialdata_i 138 mediation_wrfmain.f90
>> wrf.exe 00000000004082F7 module_wrf_top_mp 241 module_wrf_top.f90
>> wrf.exe 00000000004078A2 MAIN__ 21 wrf.f90
>> wrf.exe 000000000040782C Unknown Unknown Unknown
>> libc.so.6 00007F2EEF74CCDD Unknown Unknown Unknown
>> wrf.exe 0000000000407729 Unknown Unknown Unknown
>>
>> In module_dm.f90, the code that is being called is:
>>
>> CALL BYTE_BCAST ( buf , size, local_communicator )
>>
>> Which is call to:
>>
>> BYTE_BCAST ( char * buf, int * size, int * Fcomm )
>> {
>> #ifndef STUBMPI
>> MPI_Comm *comm, dummy_comm ;
>>
>> comm = &dummy_comm ;
>> *comm = MPI_Comm_f2c( *Fcomm ) ;
>> # ifdef crayx1
>> if (*size % sizeof(int) == 0) {
>> MPI_Bcast ( buf, *size/sizeof(int), MPI_INT, 0, *comm ) ;
>> } else {
>> MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
>> }
>> # else
>> MPI_Bcast ( buf, *size, MPI_BYTE, 0, *comm ) ;
>> # endif
>> #endif
>> }
>>
>> I am not sure why this isn't showing up in the stack trace, I will recompile WRF to find out.
>>
>> But the crash is happening in mpich library. However, when I compile with debugging information
>> in the library, I no longer get stack traces. I get no information except that when mpiexec
>> exits I get the error 44544.
>>
>> The latest test is with r5609, built with:
>>
>> ./configure CC=icc CXX=icpc F77=ifort FC=ifort --prefix=/apps/mvapich2/1.8-r5609-intel --with-rdma=gen2 --with-ib-libpath=/usr/lib64 --enable-romio=yes
>> --with-file-system=lustre+panfs --enable-shared --enable-debuginfo --enable-g=debug
>>
>> Any suggestions on where to go next would be appreciated.
>>
>> Craig
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
More information about the mvapich-discuss
mailing list