[mvapich-discuss] Application failing when run with more than 64 processors

Wesley Emeneker Wesley.Emeneker at oit.gatech.edu
Thu Apr 18 12:57:26 EDT 2013


Ed,
  Setting MV2_ON_DEMAND_THRESHOLD at runtime has taken the application from
broken to working.

Thank you for the tip.

Wesley


On Thu, Apr 18, 2013 at 10:38 AM, Ed Wahl <ewahl at osc.edu> wrote:

> I've seen this over and over in the past.  Down to just one app that
> causes it with mvapich now, OpenFOAM.
>
>
> to work around this, try adding :
>
> export MV2_ON_DEMAND_THRESHOLD=[>=cores needed]    the default is 64 and
> is set at compile time if I recall correctly.
>
>
> For a while I tried to track this down and gave up as I was always a
> version or more behind on mvapich and getting more help requires me to be
> caught up.  If I ever get some spare time I'll try to figure this out again.
>
> Ed Wahl
> OSC
>
>
> ________________________________________
> From: mvapich-discuss-bounces at cse.ohio-state.edu [
> mvapich-discuss-bounces at cse.ohio-state.edu] on behalf of Wesley Emeneker [
> Wesley.Emeneker at oit.gatech.edu]
> Sent: Thursday, April 18, 2013 6:44 AM
> To: mvapich-discuss at cse.ohio-state.edu
> Subject: [mvapich-discuss] Application failing when run with more than 64
>       processors
>
> Dear MVAPICH devs,
>   I have an application that fails when more than 64 processors are
> assigned to the job.
> I can split the job between multiple nodes and it will run as long as I
> don't use more than 64 cores.
>
> Any ideas what might be causing this?
>
> OS: RHEL6.3
> Compiler: gcc-4.4.5
> MVAPICH versions affected: 1.6-1.9b
> CPU: AMD Interlagos (64-core nodes)
>
> $ mpirun_rsh -ssh -np 128 -hostfile $PBS_NODEFILE ./pinch_off_PM100 <args>
> Surface tensionLB: 0.001
> deltaRho_g= 8e-05
> Bond number = 12.5
> relaxation time fluid 1= 0.6
> relaxation time fluid 2 = 1.5
> viscosityRatio = 10
> contactAngle = 180
> Size of the domain is 100 x 100 x 200
> Warning! Rndv Receiver is expecting 550000 Bytes But, is receiving 286000
> Bytes
> Assertion failed in file
> src/mpid/ch3/channels/mrail/src/rdma/ch3_rndvtransfer.c at line 1229: FALSE
> memcpy argument memory ranges overlap, dst_=0x2c42c98 src_=0x2c42be8
> len_=262144
>
> [cli_127]: aborting job:
> internal ABORT - process 127
> terminate called after throwing an instance of 'std::bad_alloc'
>   what():  std::bad_alloc
> [iw-k30-29.pace.gatech.edu:mpi_rank_74][error_sighandler] Caught error:
> Aborted (signal 6)
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 134
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
> [proxy:0:0 at iw-k30-30.pace.gatech.edu<mailto:
> proxy%3A0%3A0 at iw-k30-30.pace.gatech.edu>] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
> [proxy:0:0 at iw-k30-30.pace.gatech.edu<mailto:
> proxy%3A0%3A0 at iw-k30-30.pace.gatech.edu>] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at iw-k30-30.pace.gatech.edu<mailto:
> proxy%3A0%3A0 at iw-k30-30.pace.gatech.edu>] main (./pm/pmiserv/pmip.c:206):
> demux engine error waiting for event
> [mpiexec at iw-k30-30.pace.gatech.edu<mailto:
> mpiexec at iw-k30-30.pace.gatech.edu>] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at iw-k30-30.pace.gatech.edu<mailto:
> mpiexec at iw-k30-30.pace.gatech.edu>] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at iw-k30-30.pace.gatech.edu<mailto:
> mpiexec at iw-k30-30.pace.gatech.edu>] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
> completion
> [mpiexec at iw-k30-30.pace.gatech.edu<mailto:
> mpiexec at iw-k30-30.pace.gatech.edu>] main (./ui/mpich/mpiexec.c:330):
> process manager error waiting for completion
>
>
>
> Thanks,
> Wesley
>
> --
> Wesley Emeneker, Research Scientist
> The Partnership for an Advanced Computing Environment
> Georgia Institute of Technology
>
> 404.385.2303
> Wesley.Emeneker at oit.gatech.edu<mailto:Wesley.Emeneker at oit.gatech.edu>
> http://pace.gatech.edu
>



-- 
Wesley Emeneker, Research Scientist
The Partnership for an Advanced Computing Environment
Georgia Institute of Technology

404.385.2303
Wesley.Emeneker at oit.gatech.edu
http://pace.gatech.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130418/518d982a/attachment.html


More information about the mvapich-discuss mailing list