[mvapich-discuss] Not Getting MV2 Debug Backtraces when Application Crashes on Large Number of Cores

sindimo at gmail.com sindimo at gmail.com
Thu Dec 12 05:27:18 EST 2013


Dear MV2 Support,

We are currently running on RedHat Linux 6.2 64-bit with MVAPICH2 1.8.1
compiled with Intel compiler 12.1.0 over Qlogic Infiniband QDR (PSM). We
are trying to debug an MPI code problem when running on a large number of
cores (>6000). The application is mixed FORTRAN, C, and C++ and most MPI
calls are in FORTRAN and we compile the application with debug and
traceback flags (e.g. -g -traceback).


MVAPICH2 was compiled with the below options which include debug support as
we usually use Totalview and DDT for debugging:




 [sindimo at superbeast ~]$ mpiname -a

MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:psm



Compilation

CC:  /usr/local/intel/ics12/ics12/bin/icc    -g -DNDEBUG -DNVALGRIND -O2

CXX:  /usr/local/intel/ics12/ics12/bin/icpc   -g -DNDEBUG -DNVALGRIND -O2

F77:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2

FC:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2



Configuration

--prefix=/usr/local/mpi/mvapich2/intel12/1.8.1 --enable-g=dbg
--enable-romio --enable-sharedlibs=gcc --enable-shared --enable-debuginfo
--with-file-system=panfs+nfs+ufs --with-device=ch3:psm
--with-psm-include=/usr/include --with-psm=/usr/lib64 --enable-shared





Currently some of the processes crash and we’re not getting any back
traces neither
from the MPI layer nor from Intel debug and traceback flags.



We went through the MVAPICH2 user guide and we’re using the below
environment variables when launching the job to get some sort of debug back
traces:

mpirun_rsh  -np 6020 -hostfile myhosts MV2_DEBUG_SHOW_BACKTRACE=1
MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_FORK_VERBOSE=2 myapp.exe



The only info we get when the crash occurs is the typical error below as
some of the processes have died:

[superbeast381:mpispawn_37][read_size] Unexpected End-Of-File on file
descriptor 23. MPI process died?

[superbeast381:mpispawn_37][handle_mt_peer] Error while reading PMI socket.
MPI process died?



We also noticed that the below file gets crated but it’s empty:

mpispawn.80s-6523,superbeast020.btr



We have a few questions please:

   1. Are the debug and back trace environment variables of MV2 honored
   when running with Qlogic PSM or is this more targeted for OFED?
   2. If it works with Qlogic PSM, what are we doing wrong here since we’re
   not getting any back traces?
   3. Are there any other options in MV2 that we're not aware of that could
   help us with debugging on the MPI layer?


Thank you for your help, we really appreciate it.



Mohamad Sindi

EXPEC Advanced Research Center

Saudi Aramco
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131212/1a8ea4a6/attachment.html>


More information about the mvapich-discuss mailing list