[mvapich-discuss] Not Getting MV2 Debug Backtraces when Application Crashes on Large Number of Cores

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Jan 22 12:16:25 EST 2014


Hello.  You're getting this output because the application (actually
the user running the application) does not have permission to set the
coredumpsize to the requested value.  Since you're only seeing this
output 16 times I believe that one of your machines in the larger run
(6000 but not 3000) is not setup correctly.

Please check that users can create core files on all of the used
machines.  After this is corrected you should be able to obtain a
backtrace which can help us with debugging your issue further.

P.S. Have you set the MV2_DEBUG_SHOW_BACKTRACE parameter?  This should
help even if a core file is not created.
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.9.html#x1-15800010.5

On Wed, Jan 22, 2014 at 3:24 AM, sindimo <sindimo at gmail.com> wrote:
> Dear MV2 Support,
>
>
>
> We’ve tried rebuilding MV2 with the options suggested (--disable-fast ), we
> see this error now when the crash happens, the error is repeated 16 times
> which is the same number of cores we have on a single node of Sandy Bridge
> processor:
>
>
> [][set_coresize_limit] setrlimit: Invalid argument (22)
>
>
>
> The “set_coresize_limit” seems to be in the MV2 source code.
>
>
>
> We only see this when running with 6000+ cores as the jobs crash, runs on
> less cores (e.g. 3000) don’t show this error and they do run fine.
>
>
>
> We use UGE as a job scheduler for the record.
>
>
>
> These are the system limits we have set:
>
>
> [sindimo at superbeast064 ~]$ limit
>
> cputime      unlimited
>
> filesize     unlimited
>
> datasize     unlimited
>
> stacksize    unlimited
>
> coredumpsize 0 kbytes
>
> memoryuse    unlimited
>
> vmemoryuse   unlimited
>
> descriptors  16384
>
> memorylocked unlimited
>
> maxproc      202752
>
>
>
> We’re not sure if this error is due to a limitation setting in MV2 or
> something else, we would truly appreciate your help with this.
>
>
>
> Thank you.
>
>
>
> Sincerely,
>
> Mohamad Sindi
>
> EXPEC Advanced Research Center
>
> Saudi Aramco
>
>
>
> On Thu, Dec 12, 2013 at 5:51 PM, Jonathan Perkins
> <perkinjo at cse.ohio-state.edu> wrote:
>>
>> Thank you for your posting.  The environment variables you mentioned
>> seem to be only applicable to gen2/mrail. We plan to expand it to
>> other
>> channels in the upcoming release.
>>
>> To help rule out an issue resolved in the library you can try using
>> the latest versions (1.9 or 2.0b).  You should also be sure to add
>> --disable-fast to configure to have more debugging info for your debug
>> builds.
>>
>> On Thu, Dec 12, 2013 at 5:27 AM, sindimo at gmail.com <sindimo at gmail.com>
>> wrote:
>> > Dear MV2 Support,
>> >
>> > We are currently running on RedHat Linux 6.2 64-bit with MVAPICH2 1.8.1
>> > compiled with Intel compiler 12.1.0 over Qlogic Infiniband QDR (PSM). We
>> > are
>> > trying to debug an MPI code problem when running on a large number of
>> > cores
>> > (>6000). The application is mixed FORTRAN, C, and C++ and most MPI calls
>> > are
>> > in FORTRAN and we compile the application with debug and traceback flags
>> > (e.g. -g -traceback).
>> >
>> >
>> > MVAPICH2 was compiled with the below options which include debug support
>> > as
>> > we usually use Totalview and DDT for debugging:
>> >
>> >
>> >
>> >
>> > [sindimo at superbeast ~]$ mpiname -a
>> >
>> > MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:psm
>> >
>> >
>> >
>> > Compilation
>> >
>> > CC:  /usr/local/intel/ics12/ics12/bin/icc    -g -DNDEBUG -DNVALGRIND -O2
>> >
>> > CXX:  /usr/local/intel/ics12/ics12/bin/icpc   -g -DNDEBUG -DNVALGRIND
>> > -O2
>> >
>> > F77:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
>> >
>> > FC:  /usr/local/intel/ics12/ics12/bin/ifort   -g -O2
>> >
>> >
>> >
>> > Configuration
>> >
>> > --prefix=/usr/local/mpi/mvapich2/intel12/1.8.1 --enable-g=dbg
>> > --enable-romio
>> > --enable-sharedlibs=gcc --enable-shared --enable-debuginfo
>> > --with-file-system=panfs+nfs+ufs --with-device=ch3:psm
>> > --with-psm-include=/usr/include --with-psm=/usr/lib64 --enable-shared
>> >
>> >
>> >
>> >
>> >
>> > Currently some of the processes crash and we’re not getting any back
>> > traces
>> > neither from the MPI layer nor from Intel debug and traceback flags.
>> >
>> >
>> >
>> > We went through the MVAPICH2 user guide and we’re using the below
>> > environment variables when launching the job to get some sort of debug
>> > back
>> > traces:
>> >
>> > mpirun_rsh  -np 6020 -hostfile myhosts MV2_DEBUG_SHOW_BACKTRACE=1
>> > MV2_DEBUG_CORESIZE=unlimited MV2_DEBUG_FORK_VERBOSE=2 myapp.exe
>> >
>> >
>> >
>> > The only info we get when the crash occurs is the typical error below as
>> > some of the processes have died:
>> >
>> > [superbeast381:mpispawn_37][read_size] Unexpected End-Of-File on file
>> > descriptor 23. MPI process died?
>> >
>> > [superbeast381:mpispawn_37][handle_mt_peer] Error while reading PMI
>> > socket.
>> > MPI process died?
>> >
>> >
>> >
>> > We also noticed that the below file gets crated but it’s empty:
>> >
>> > mpispawn.80s-6523,superbeast020.btr
>> >
>> >
>> >
>> > We have a few questions please:
>> >
>> > Are the debug and back trace environment variables of MV2 honored when
>> > running with Qlogic PSM or is this more targeted for OFED?
>> > If it works with Qlogic PSM, what are we doing wrong here since we’re
>> > not
>> > getting any back traces?
>> > Are there any other options in MV2 that we're not aware of that could
>> > help
>> > us with debugging on the MPI layer?
>> >
>> >
>> > Thank you for your help, we really appreciate it.
>> >
>> >
>> >
>> > Mohamad Sindi
>> >
>> > EXPEC Advanced Research Center
>> >
>> > Saudi Aramco
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > mvapich-discuss mailing list
>> > mvapich-discuss at cse.ohio-state.edu
>> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >
>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list