[mvapich-discuss] Application failing when run with more than 64 processors

Wesley Emeneker Wesley.Emeneker at oit.gatech.edu
Thu Apr 18 06:44:25 EDT 2013


Dear MVAPICH devs,
  I have an application that fails when more than 64 processors are
assigned to the job.
I can split the job between multiple nodes and it will run as long as I
don't use more than 64 cores.

Any ideas what might be causing this?

OS: RHEL6.3
Compiler: gcc-4.4.5
MVAPICH versions affected: 1.6-1.9b
CPU: AMD Interlagos (64-core nodes)

$ mpirun_rsh -ssh -np 128 -hostfile $PBS_NODEFILE ./pinch_off_PM100 <args>
Surface tensionLB: 0.001
deltaRho_g= 8e-05
Bond number = 12.5
relaxation time fluid 1= 0.6
relaxation time fluid 2 = 1.5
viscosityRatio = 10
contactAngle = 180
Size of the domain is 100 x 100 x 200
Warning! Rndv Receiver is expecting 550000 Bytes But, is receiving 286000
Bytes
Assertion failed in file
src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c at line 1229: FALSE
memcpy argument memory ranges overlap, dst_=0x2c42c98 src_=0x2c42be8
len_=262144

[cli_127]: aborting job:
internal ABORT - process 127
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[iw-k30-29.pace.gatech.edu:mpi_rank_74][error_sighandler] Caught error:
Aborted (signal 6)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at iw-k30-30.pace.gatech.edu] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:0 at iw-k30-30.pace.gatech.edu] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at iw-k30-30.pace.gatech.edu] main (./pm/pmiserv/pmip.c:206): demux
engine error waiting for event
[mpiexec at iw-k30-30.pace.gatech.edu] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at iw-k30-30.pace.gatech.edu] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at iw-k30-30.pace.gatech.edu] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion
[mpiexec at iw-k30-30.pace.gatech.edu] main (./ui/mpich/mpiexec.c:330):
process manager error waiting for completion



Thanks,
Wesley

-- 
Wesley Emeneker, Research Scientist
The Partnership for an Advanced Computing Environment
Georgia Institute of Technology

404.385.2303
Wesley.Emeneker at oit.gatech.edu
http://pace.gatech.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130418/04cc0094/attachment-0001.html


More information about the mvapich-discuss mailing list