[mvapich-discuss] Application failing when run with more than 64 processors

Devendar Bureddy bureddy at cse.ohio-state.edu
Thu Apr 18 09:35:40 EDT 2013


Hi Wesley

It looks like some where memory corruption is happened. Can you give more
details on configure flags used to build MVAPICH2?

-Devendar


On Thu, Apr 18, 2013 at 5:44 AM, Wesley Emeneker <
Wesley.Emeneker at oit.gatech.edu> wrote:

> Dear MVAPICH devs,
>   I have an application that fails when more than 64 processors are
> assigned to the job.
> I can split the job between multiple nodes and it will run as long as I
> don't use more than 64 cores.
>
> Any ideas what might be causing this?
>
> OS: RHEL6.3
> Compiler: gcc-4.4.5
> MVAPICH versions affected: 1.6-1.9b
> CPU: AMD Interlagos (64-core nodes)
>
> $ mpirun_rsh -ssh -np 128 -hostfile $PBS_NODEFILE ./pinch_off_PM100 <args>
> Surface tensionLB: 0.001
> deltaRho_g= 8e-05
> Bond number = 12.5
> relaxation time fluid 1= 0.6
> relaxation time fluid 2 = 1.5
> viscosityRatio = 10
> contactAngle = 180
> Size of the domain is 100 x 100 x 200
> Warning! Rndv Receiver is expecting 550000 Bytes But, is receiving 286000
> Bytes
> Assertion failed in file
> src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c at line 1229: FALSE
> memcpy argument memory ranges overlap, dst_=0x2c42c98 src_=0x2c42be8
> len_=262144
>
> [cli_127]: aborting job:
> internal ABORT - process 127
> terminate called after throwing an instance of 'std::bad_alloc'
>   what():  std::bad_alloc
> [iw-k30-29.pace.gatech.edu:mpi_rank_74][error_sighandler] Caught error:
> Aborted (signal 6)
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 134
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
> [proxy:0:0 at iw-k30-30.pace.gatech.edu] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at iw-k30-30.pace.gatech.edu] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at iw-k30-30.pace.gatech.edu] main (./pm/pmiserv/pmip.c:206):
> demux engine error waiting for event
> [mpiexec at iw-k30-30.pace.gatech.edu] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at iw-k30-30.pace.gatech.edu] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at iw-k30-30.pace.gatech.edu] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
> completion
> [mpiexec at iw-k30-30.pace.gatech.edu] main (./ui/mpich/mpiexec.c:330):
> process manager error waiting for completion
>
>
>
> Thanks,
> Wesley
>
> --
> Wesley Emeneker, Research Scientist
> The Partnership for an Advanced Computing Environment
> Georgia Institute of Technology
>
> 404.385.2303
> Wesley.Emeneker at oit.gatech.edu
> http://pace.gatech.edu
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130418/c4df7591/attachment.html


More information about the mvapich-discuss mailing list