[mvapich-discuss] Not Getting MV2 Debug Backtraces when Application Crashes on Large Number of Cores

Walid walid.shaari at gmail.com
Thu Dec 12 08:48:44 EST 2013


the latest stable release is 1.9, are you using mvapich supplied by Qlogic,
or downloaded from the site?


On 12 December 2013 13:27, sindimo at gmail.com <sindimo at gmail.com> wrote:

>  Dear MV2 Support,
>
> We are currently running on RedHat Linux 6.2 64-bit with MVAPICH2 1.8.1
> compiled with Intel compiler 12.1.0 over Qlogic Infiniband QDR (PSM). We
> are trying to debug an MPI code problem when running on a large number of
> cores (>6000). The application is mixed FORTRAN, C, and C++ and most MPI
> calls are in FORTRAN and we compile the application with debug and
> traceback flags (e.g. -g -traceback).
>
>
> MVAPICH2 was compiled with the below options which include debug support
> as we usually use Totalview and DDT for debugging:
>
>
>
>
>  [sindimo at superbeast ~]$ mpiname -a
>
> MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:psm
>
>
>
> Compilation
>
> CC:  /usr/local/intel/ics12/ics12/bin/icc    -g -DNDEBUG -DNVALGRIND -O2
>
> CXX:  /usr/local/intel/ics12/ics12/bin/icpc   -g -DNDEBUG -DNVALGRIND -O2
>
> F77:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
>
> FC:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
>
>
>
> Configuration
>
> --prefix=/usr/local/mpi/mvapich2/intel12/1.8.1 --enable-g=dbg
> --enable-romio --enable-sharedlibs=gcc --enable-shared --enable-debuginfo
> --with-file-system=panfs+nfs+ufs --with-device=ch3:psm
> --with-psm-include=/usr/include --with-psm=/usr/lib64 --enable-shared
>
>
>
>
>
> Currently some of the processes crash and we’re not getting any back
> traces neither from the MPI layer nor from Intel debug and traceback
> flags.
>
>
>
> We went through the MVAPICH2 user guide and we’re using the below
> environment variables when launching the job to get some sort of debug back
> traces:
>
> mpirun_rsh  -np 6020 -hostfile myhosts MV2_DEBUG_SHOW_BACKTRACE=1
> MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_FORK_VERBOSE=2 myapp.exe
>
>
>
> The only info we get when the crash occurs is the typical error below as
> some of the processes have died:
>
> [superbeast381:mpispawn_37][read_size] Unexpected End-Of-File on file
> descriptor 23. MPI process died?
>
> [superbeast381:mpispawn_37][handle_mt_peer] Error while reading PMI
> socket. MPI process died?
>
>
>
> We also noticed that the below file gets crated but it’s empty:
>
> mpispawn.80s-6523,superbeast020.btr
>
>
>
> We have a few questions please:
>
>    1. Are the debug and back trace environment variables of MV2 honored
>    when running with Qlogic PSM or is this more targeted for OFED?
>    2. If it works with Qlogic PSM, what are we doing wrong here since
>    we’re not getting any back traces?
>    3. Are there any other options in MV2 that we're not aware of that
>    could help us with debugging on the MPI layer?
>
>
> Thank you for your help, we really appreciate it.
>
>
>
> Mohamad Sindi
>
> EXPEC Advanced Research Center
>
> Saudi Aramco
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131212/8cf7706c/attachment-0001.html>


More information about the mvapich-discuss mailing list