[mvapich-discuss] Not Getting MV2 Debug Backtraces when Application Crashes on Large Number of Cores

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Dec 12 09:51:16 EST 2013


Thank you for your posting.  The environment variables you mentioned
seem to be only applicable to gen2/mrail. We plan to expand it to
other
channels in the upcoming release.

To help rule out an issue resolved in the library you can try using
the latest versions (1.9 or 2.0b).  You should also be sure to add
--disable-fast to configure to have more debugging info for your debug
builds.

On Thu, Dec 12, 2013 at 5:27 AM, sindimo at gmail.com <sindimo at gmail.com> wrote:
> Dear MV2 Support,
>
> We are currently running on RedHat Linux 6.2 64-bit with MVAPICH2 1.8.1
> compiled with Intel compiler 12.1.0 over Qlogic Infiniband QDR (PSM). We are
> trying to debug an MPI code problem when running on a large number of cores
> (>6000). The application is mixed FORTRAN, C, and C++ and most MPI calls are
> in FORTRAN and we compile the application with debug and traceback flags
> (e.g. -g -traceback).
>
>
> MVAPICH2 was compiled with the below options which include debug support as
> we usually use Totalview and DDT for debugging:
>
>
>
>
> [sindimo at superbeast ~]$ mpiname -a
>
> MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:psm
>
>
>
> Compilation
>
> CC:  /usr/local/intel/ics12/ics12/bin/icc    -g -DNDEBUG -DNVALGRIND -O2
>
> CXX:  /usr/local/intel/ics12/ics12/bin/icpc   -g -DNDEBUG -DNVALGRIND -O2
>
> F77:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
>
> FC:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
>
>
>
> Configuration
>
> --prefix=/usr/local/mpi/mvapich2/intel12/1.8.1 --enable-g=dbg --enable-romio
> --enable-sharedlibs=gcc --enable-shared --enable-debuginfo
> --with-file-system=panfs+nfs+ufs --with-device=ch3:psm
> --with-psm-include=/usr/include --with-psm=/usr/lib64 --enable-shared
>
>
>
>
>
> Currently some of the processes crash and we’re not getting any back traces
> neither from the MPI layer nor from Intel debug and traceback flags.
>
>
>
> We went through the MVAPICH2 user guide and we’re using the below
> environment variables when launching the job to get some sort of debug back
> traces:
>
> mpirun_rsh  -np 6020 -hostfile myhosts MV2_DEBUG_SHOW_BACKTRACE=1
> MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_FORK_VERBOSE=2 myapp.exe
>
>
>
> The only info we get when the crash occurs is the typical error below as
> some of the processes have died:
>
> [superbeast381:mpispawn_37][read_size] Unexpected End-Of-File on file
> descriptor 23. MPI process died?
>
> [superbeast381:mpispawn_37][handle_mt_peer] Error while reading PMI socket.
> MPI process died?
>
>
>
> We also noticed that the below file gets crated but it’s empty:
>
> mpispawn.80s-6523,superbeast020.btr
>
>
>
> We have a few questions please:
>
> Are the debug and back trace environment variables of MV2 honored when
> running with Qlogic PSM or is this more targeted for OFED?
> If it works with Qlogic PSM, what are we doing wrong here since we’re not
> getting any back traces?
> Are there any other options in MV2 that we're not aware of that could help
> us with debugging on the MPI layer?
>
>
> Thank you for your help, we really appreciate it.
>
>
>
> Mohamad Sindi
>
> EXPEC Advanced Research Center
>
> Saudi Aramco
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list