[mvapich-discuss] Application failing when run with more than 64 processors

Ed Wahl ewahl at osc.edu
Thu Apr 18 10:38:45 EDT 2013


I've seen this over and over in the past.  Down to just one app that causes it with mvapich now, OpenFOAM.  


to work around this, try adding :  

export MV2_ON_DEMAND_THRESHOLD=[>=cores needed]    the default is 64 and is set at compile time if I recall correctly.  


For a while I tried to track this down and gave up as I was always a version or more behind on mvapich and getting more help requires me to be caught up.  If I ever get some spare time I'll try to figure this out again.

Ed Wahl
OSC


________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu [mvapich-discuss-bounces at cse.ohio-state.edu] on behalf of Wesley Emeneker [Wesley.Emeneker at oit.gatech.edu]
Sent: Thursday, April 18, 2013 6:44 AM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] Application failing when run with more than 64       processors

Dear MVAPICH devs,
  I have an application that fails when more than 64 processors are assigned to the job.
I can split the job between multiple nodes and it will run as long as I don't use more than 64 cores.

Any ideas what might be causing this?

OS: RHEL6.3
Compiler: gcc-4.4.5
MVAPICH versions affected: 1.6-1.9b
CPU: AMD Interlagos (64-core nodes)

$ mpirun_rsh -ssh -np 128 -hostfile $PBS_NODEFILE ./pinch_off_PM100 <args>
Surface tensionLB: 0.001
deltaRho_g= 8e-05
Bond number = 12.5
relaxation time fluid 1= 0.6
relaxation time fluid 2 = 1.5
viscosityRatio = 10
contactAngle = 180
Size of the domain is 100 x 100 x 200
Warning! Rndv Receiver is expecting 550000 Bytes But, is receiving 286000 Bytes
Assertion failed in file src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c at line 1229: FALSE
memcpy argument memory ranges overlap, dst_=0x2c42c98 src_=0x2c42be8 len_=262144

[cli_127]: aborting job:
internal ABORT - process 127
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[iw-k30-29.pace.gatech.edu:mpi_rank_74][error_sighandler] Caught error: Aborted (signal 6)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at iw-k30-30.pace.gatech.edu<mailto:proxy%3A0%3A0 at iw-k30-30.pace.gatech.edu>] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:0 at iw-k30-30.pace.gatech.edu<mailto:proxy%3A0%3A0 at iw-k30-30.pace.gatech.edu>] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at iw-k30-30.pace.gatech.edu<mailto:proxy%3A0%3A0 at iw-k30-30.pace.gatech.edu>] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at iw-k30-30.pace.gatech.edu<mailto:mpiexec at iw-k30-30.pace.gatech.edu>] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at iw-k30-30.pace.gatech.edu<mailto:mpiexec at iw-k30-30.pace.gatech.edu>] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at iw-k30-30.pace.gatech.edu<mailto:mpiexec at iw-k30-30.pace.gatech.edu>] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec at iw-k30-30.pace.gatech.edu<mailto:mpiexec at iw-k30-30.pace.gatech.edu>] main (./ui/mpich/mpiexec.c:330): process manager error waiting for completion



Thanks,
Wesley

--
Wesley Emeneker, Research Scientist
The Partnership for an Advanced Computing Environment
Georgia Institute of Technology

404.385.2303
Wesley.Emeneker at oit.gatech.edu<mailto:Wesley.Emeneker at oit.gatech.edu>
http://pace.gatech.edu



More information about the mvapich-discuss mailing list