[mvapich-discuss] Not Getting MV2 Debug Backtraces when Application Crashes on Large Number of Cores

sindimo sindimo at gmail.com
Wed Jan 22 03:24:59 EST 2014


Dear MV2 Support,



We’ve tried rebuilding MV2 with the options suggested (--disable-fast ), we
see this error now when the crash happens, the error is repeated 16 times
which is the same number of cores we have on a single node of Sandy Bridge
processor:


[][set_coresize_limit] setrlimit: Invalid argument (22)



The “set_coresize_limit” seems to be in the MV2 source code.



We only see this when running with 6000+ cores as the jobs crash, runs on
less cores (e.g. 3000) don’t show this error and they do run fine.



We use UGE as a job scheduler for the record.



These are the system limits we have set:


[sindimo at superbeast064 ~]$ limit

cputime      unlimited

filesize     unlimited

datasize     unlimited

stacksize    unlimited

coredumpsize 0 kbytes

memoryuse    unlimited

vmemoryuse   unlimited

descriptors  16384

memorylocked unlimited

maxproc      202752



We’re not sure if this error is due to a limitation setting in MV2 or
something else, we would truly appreciate your help with this.



Thank you.


Sincerely,

Mohamad Sindi

EXPEC Advanced Research Center

Saudi Aramco


On Thu, Dec 12, 2013 at 5:51 PM, Jonathan Perkins <
perkinjo at cse.ohio-state.edu> wrote:

> Thank you for your posting.  The environment variables you mentioned
> seem to be only applicable to gen2/mrail. We plan to expand it to
> other
> channels in the upcoming release.
>
> To help rule out an issue resolved in the library you can try using
> the latest versions (1.9 or 2.0b).  You should also be sure to add
> --disable-fast to configure to have more debugging info for your debug
> builds.
>
> On Thu, Dec 12, 2013 at 5:27 AM, sindimo at gmail.com <sindimo at gmail.com>
> wrote:
> > Dear MV2 Support,
> >
> > We are currently running on RedHat Linux 6.2 64-bit with MVAPICH2 1.8.1
> > compiled with Intel compiler 12.1.0 over Qlogic Infiniband QDR (PSM). We
> are
> > trying to debug an MPI code problem when running on a large number of
> cores
> > (>6000). The application is mixed FORTRAN, C, and C++ and most MPI calls
> are
> > in FORTRAN and we compile the application with debug and traceback flags
> > (e.g. -g -traceback).
> >
> >
> > MVAPICH2 was compiled with the below options which include debug support
> as
> > we usually use Totalview and DDT for debugging:
> >
> >
> >
> >
> > [sindimo at superbeast ~]$ mpiname -a
> >
> > MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:psm
> >
> >
> >
> > Compilation
> >
> > CC:  /usr/local/intel/ics12/ics12/bin/icc    -g -DNDEBUG -DNVALGRIND -O2
> >
> > CXX:  /usr/local/intel/ics12/ics12/bin/icpc   -g -DNDEBUG -DNVALGRIND -O2
> >
> > F77:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
> >
> > FC:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
> >
> >
> >
> > Configuration
> >
> > --prefix=/usr/local/mpi/mvapich2/intel12/1.8.1 --enable-g=dbg
> --enable-romio
> > --enable-sharedlibs=gcc --enable-shared --enable-debuginfo
> > --with-file-system=panfs+nfs+ufs --with-device=ch3:psm
> > --with-psm-include=/usr/include --with-psm=/usr/lib64 --enable-shared
> >
> >
> >
> >
> >
> > Currently some of the processes crash and we’re not getting any back
> traces
> > neither from the MPI layer nor from Intel debug and traceback flags.
> >
> >
> >
> > We went through the MVAPICH2 user guide and we’re using the below
> > environment variables when launching the job to get some sort of debug
> back
> > traces:
> >
> > mpirun_rsh  -np 6020 -hostfile myhosts MV2_DEBUG_SHOW_BACKTRACE=1
> > MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_FORK_VERBOSE=2 myapp.exe
> >
> >
> >
> > The only info we get when the crash occurs is the typical error below as
> > some of the processes have died:
> >
> > [superbeast381:mpispawn_37][read_size] Unexpected End-Of-File on file
> > descriptor 23. MPI process died?
> >
> > [superbeast381:mpispawn_37][handle_mt_peer] Error while reading PMI
> socket.
> > MPI process died?
> >
> >
> >
> > We also noticed that the below file gets crated but it’s empty:
> >
> > mpispawn.80s-6523,superbeast020.btr
> >
> >
> >
> > We have a few questions please:
> >
> > Are the debug and back trace environment variables of MV2 honored when
> > running with Qlogic PSM or is this more targeted for OFED?
> > If it works with Qlogic PSM, what are we doing wrong here since we’re not
> > getting any back traces?
> > Are there any other options in MV2 that we're not aware of that could
> help
> > us with debugging on the MPI layer?
> >
> >
> > Thank you for your help, we really appreciate it.
> >
> >
> >
> > Mohamad Sindi
> >
> > EXPEC Advanced Research Center
> >
> > Saudi Aramco
> >
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20140122/e801ef9a/attachment.html>


More information about the mvapich-discuss mailing list