[mvapich-discuss] Can't run jobs on multiple nodes

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Aug 16 11:51:30 EDT 2012


Hello, let's try seeing if a simple case works.

Does something basic like osu_latency work between two nodes?  What does
ulimit -l show when run on the two nodes?

Also, a debug build of mvapich2 should provide more information in this
error case.  In addition to --enable-g=dbg, I suggest adding
--disable-fast.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1120009.1.10

On Thu, Aug 16, 2012 at 10:12:53AM -0500, Xing Wang wrote:
> Hi All,
> 
> Thanks for reading the email. Currently I'm working on a new 44-nodes cluster. I guess my question is silly one but since I'm new to linux/mvapich2, your help/comment would be very helpful to me and sincerely appreciated.
> 
> 
> Problem situation:
> 
> 
> We want to run LAMMPS (a parallel computing software) on the new cluster. The MPI implementation is MVAPICH2-1.8 and batch-queuing system is Oracle Grid Engine (GE) 6.2u5. I've set up a queue and assign 2 compute nodes (compute-0-3 and compute-0-4, each node has 24 processors) to it. Before run LAMMPS, I tested MVAPICH2 and Grid Engine by submitting simple parallel script (free -m, inquire the memory on multiple nodes), it works very well. 
> 
> 
> Then I installed and run LAMMPS as a cluster user. If I run the jobs on multiple processors within a single node, it works very well. However, if I expand the job to two nodes (i.e. I require more than 24 nodes in the parallel submitting scripts), it got stuck and a error message appear as follows:
> 
> 
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> [cli_35]: aborting job:
> Fatal error in MPI_Init:
> Other MPI error
> [proxy:0:0 at compute-0-4.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:955): assert (!closed) failed
> [proxy:0:0 at compute-0-4.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at compute-0-4.local] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec at compute-0-4.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:69): one of the processes terminated badly; aborting
> [mpiexec at compute-0-4.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> [mpiexec at compute-0-4.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for completion
> [mpiexec at compute-0-4.local] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion 
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  
> Does anyone has similar experiences before? Your comment/help/suggestions would be really helpful. 
> 
> 
> Here is more information in case of need:
> 
> 
> 
> 
> 1.The parallel pe:
> pe_name mvapich2_test
> slots 9999
> user_lists NONE
> xuser_lists NONE
> start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile
> stop_proc_args NONE
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
> 
> 
> 2. The queue set up:
> qname Ltest.q
> hostlist @LAMMPShosts
> seq_no 0
> load_thresholds np_load_avg=3.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list make mpich mpi orte mvapich2_test
> rerun FALSE
> slots 6,[compute-0-3.local=24],[compute-0-4.local=24]
> tmpdir /tmp
> shell /bin/bash
> prolog NONE
> epilog NONE
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> 
> 
> 
> 
> 
> 3. The host file @LAMMPShosts:
> 
> 
> # qconf -shgrp @LAMMPShosts
> group_name @LAMMPShosts
> hostlist compute-0-3.local compute-0-4.local
> 
> 
> 
> 
> 
> 
> 
> 4. The submitting script:
> #!/bin/bash
> #$ -N Lammps_test
> 
> 
> # request the queue for this job
> # for VASP test, replace <queue_name> with Vtest.q
> # for LAMMPS test, repalce <queue_name> with Ltest.q
> #$ -q Ltest.q
> 
> 
> # request computational resources for this job as follows
> # replace <num> below with the number of CPUs for the job
> # For Vtest.q, <num>=0~48; fro Ltest.q, <num>=0~48 
> #$ -pe mvapich2_test 36
> 
> 
> # request wall time (max is 96:00:00)
> #$ -l h_rt=48:00:00
> 
> 
> # run the job from the directory of submission.Uncomment only if you don't want the defults.
> #$ -cwd
> # combine SGE standard output and error files
> #$ -o $JOB_NAME.o$JOB_ID
> #$ -e $JOB_NAME.e$JOB_ID
> # transfer all your environment variables. Uncomment only if you don't want the defults
> #$ -V
> 
> 
> # Use full pathname to make sure we are using the right mpi
> MPI_HOME=/share/apps/mvapich2/1.8/intel_Composer_XE_12.2.137/bin
> ## $MPI_HOME/mpiexec -n $NSLOTS lammps-20Aug12/src/lmp_linux < in.poly > out.poly
> $MPI_HOME/mpiexec -n $NSLOTS lammps-20Aug12/src/lmp_linux < lammps-20Aug12/examples/crack/in.crack > out.crack
> 
> 
> 
> --
> Sincerely, 
> Xing Wang
> 
> Graduate Student 
> Department of Engineering Physics 
> UW-Madison
> Madison, WI, 53706
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list