[mvapich-discuss] Application failing when run with more than 64 processors

Wesley Emeneker Wesley.Emeneker at oit.gatech.edu
Thu Apr 18 11:49:39 EDT 2013


Configure options: "--with-hwloc --enable-romio --with-file-system=ufs+nfs
--enable-shared --enable-sharedlibs=gcc --disable-registration-cache
--enable-g=all --enable-error-messages=all --disable-fast"
Environment variables set: "MPICH2LIB_CFLAGS=\"-O3
-I/usr/local/packages/panfs/sdk/include\"" "MPICH2LIB_FFLAGS=-O3"
"MPICH2LIB_CXXFLAGS=-O3"


On Thu, Apr 18, 2013 at 9:35 AM, Devendar Bureddy <
bureddy at cse.ohio-state.edu> wrote:

> Hi Wesley
>
> It looks like some where memory corruption is happened. Can you give more
> details on configure flags used to build MVAPICH2?
>
> -Devendar
>
>
> On Thu, Apr 18, 2013 at 5:44 AM, Wesley Emeneker <
> Wesley.Emeneker at oit.gatech.edu> wrote:
>
>> Dear MVAPICH devs,
>>   I have an application that fails when more than 64 processors are
>> assigned to the job.
>> I can split the job between multiple nodes and it will run as long as I
>> don't use more than 64 cores.
>>
>> Any ideas what might be causing this?
>>
>> OS: RHEL6.3
>> Compiler: gcc-4.4.5
>> MVAPICH versions affected: 1.6-1.9b
>> CPU: AMD Interlagos (64-core nodes)
>>
>> $ mpirun_rsh -ssh -np 128 -hostfile $PBS_NODEFILE ./pinch_off_PM100 <args>
>> Surface tensionLB: 0.001
>> deltaRho_g= 8e-05
>> Bond number = 12.5
>> relaxation time fluid 1= 0.6
>> relaxation time fluid 2 = 1.5
>> viscosityRatio = 10
>> contactAngle = 180
>> Size of the domain is 100 x 100 x 200
>> Warning! Rndv Receiver is expecting 550000 Bytes But, is receiving 286000
>> Bytes
>> Assertion failed in file
>> src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c at line 1229: FALSE
>> memcpy argument memory ranges overlap, dst_=0x2c42c98 src_=0x2c42be8
>> len_=262144
>>
>> [cli_127]: aborting job:
>> internal ABORT - process 127
>> terminate called after throwing an instance of 'std::bad_alloc'
>>   what():  std::bad_alloc
>> [iw-k30-29.pace.gatech.edu:mpi_rank_74][error_sighandler] Caught error:
>> Aborted (signal 6)
>>
>>
>> ===================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 134
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> ===================================================================================
>> [proxy:0:0 at iw-k30-30.pace.gatech.edu] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
>> [proxy:0:0 at iw-k30-30.pace.gatech.edu] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:0 at iw-k30-30.pace.gatech.edu] main (./pm/pmiserv/pmip.c:206):
>> demux engine error waiting for event
>> [mpiexec at iw-k30-30.pace.gatech.edu] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
>> badly; aborting
>> [mpiexec at iw-k30-30.pace.gatech.edu] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
>> completion
>> [mpiexec at iw-k30-30.pace.gatech.edu] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>> completion
>> [mpiexec at iw-k30-30.pace.gatech.edu] main (./ui/mpich/mpiexec.c:330):
>> process manager error waiting for completion
>>
>>
>>
>> Thanks,
>> Wesley
>>
>> --
>> Wesley Emeneker, Research Scientist
>> The Partnership for an Advanced Computing Environment
>> Georgia Institute of Technology
>>
>> 404.385.2303
>> Wesley.Emeneker at oit.gatech.edu
>> http://pace.gatech.edu
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>
>
> --
> Devendar
>



-- 
Wesley Emeneker, Research Scientist
The Partnership for an Advanced Computing Environment
Georgia Institute of Technology

404.385.2303
Wesley.Emeneker at oit.gatech.edu
http://pace.gatech.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130418/3f3bb9dd/attachment-0001.html


More information about the mvapich-discuss mailing list