[mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
Bhupender Thakur
bthakur at lsu.edu
Mon Nov 19 20:29:59 EST 2012
Dear Jonathan,
The walltime limit was set to 60 minutes. This can be verified by the time of prologue and
epilogue scripts. From the output:
...
Done clearing all the allocated nodes
------------------------------------------------------
Concluding PBS prologue script - 19-Nov-2012 13:23:02
------------------------------------------------------
mike005
...
------------------------------------------------------
Running PBS epilogue script - 19-Nov-2012 14:24:55
------------------------------------------------------
Checking node mike005
Thanks for taking the time an effort to look into it.
Please let me know if you need further information.
Regards,
Bhupender.
Bhupender Thakur.
IT- Analyst,
High Performance Computing, LSU.
Ph (225)578-5934
________________________________________
From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
Sent: Monday, November 19, 2012 6:11 PM
To: Bhupender Thakur
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
Thanks for the additional information. I have additional question about
the wall time limit. Is it set to 1 hr or 1 min? It looks like
mpirun_rsh and the other processes are still running before pbs kills
them according to your process tree output. The time command seems to
show that mpirun_rsh runs for about 1 minute.
P.S. I've forwarded this message to an internal developer list while we
debug this further.
On Mon, Nov 19, 2012 at 09:52:40PM +0000, Bhupender Thakur wrote:
> Dear Jonathan,
>
> mpiexec without any parameters seems to run fine.
> With the same parameters it seems to clean up better. For this run
>
> for mpi_pn in 1 2 4 8 16
> do
> let sum_mpi=$mpi_pn*$mpi_nodes
> let OMP_NUM_THREADS=$mpi_width/$sum_mpi
>
> for param in "MV2_USE_XRC=1" "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
> do
> echo " $param"
> echo " nodes:$mpi_nodes mpi-per-node:$mpi_pn omp:$OMP_NUM_THREADS"
> #time mpirun_rsh -np $sum_mpi -hostfile hosts $param ./dummy
> time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
> done
> done
>
> I see the following error mesage when running with mpiexec
> ...
> MV2_USE_RoCE=1
> nodes:16 mpi-per-node:4 omp:4
> =====================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 256
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
> MV2_USE_XRC=1
> nodes:16 mpi-per-node:8 omp:2
> System date:
> Mon Nov 19 12:50:03 CST 2012
> I am 0 of 128 on mike005
> I am 10 of 128 on mike005
> ...
> With another code but with more numerics built in(should still finish in few minutes)
> mpirun_rsh still hangs after the last error and I can still see processes, including mpispwan
> still lingering on on the mothr node as well as other nodes until they are killed by the epilogue.
> The process tree on the mother node looks like this
>
> |-ntpd,4066 -u ntp:ntp -p /var/run/ntpd.pid -g
> |-pbs_mom,4097
> | |-bash,79154
> | | |-208.mike3.SC,79860 -x /var/spool/torque/mom_priv/jobs/208.mike3.SC
> | | | `-mpirun_rsh,80150 -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
> | | | |-bash,80152 -c...
> | | | | `-mpispawn,80169 0
> | | | | |-a.out,80171
> | | | | | |-{a.out},80180
> | | | | | |-{a.out},80193
> | | | | | `-{a.out},80194
> ......
> | | | | |-a.out,80178
> | | | | | |-{a.out},80182
> | | | | | |-{a.out},80189
> | | | | | `-{a.out},80190
> | | | | `-{mpispawn},80170
> | | | |-ssh,80153 -q mike006...
> | | | |-ssh,80154 -q mike007...
> ........
> | | | |-ssh,80167 -q mike020...
> | | | |-{mpirun_rsh},80151
> | | | `-{mpirun_rsh},80168
> | | `-pbs_demux,79763
> | |-{pbs_mom},4104
> | `-{pbs_mom},4106
> |-portreserve,3573
>
> Netstat shows the connections being maintained
> Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
> tcp 0 0 mike005:54949 mike005:44480 ESTABLISHED 80172/./a.out
> tcp 0 0 mike005:36529 mike010:ssh ESTABLISHED 80157/ssh
> tcp 0 0 mike005:45949 mike008:33736 ESTABLISHED 80169/mpispawn
> tcp 0 0 mike005:42308 mike006:58824 ESTABLISHED 80169/mpispawn
> tcp 0 0 mike005:ssh mike3:53763 ESTABLISHED -
> ...
>
> >From the output file:
> ...
> MV2_USE_RDMA_CM=1
> nodes:16 mpi-per-node:8 omp:2
> MV2_USE_RoCE=1
> nodes:16 mpi-per-node:8 omp:2
> ------------------------------------------------------
> Running PBS epilogue script - 19-Nov-2012 14:24:55
> ------------------------------------------------------
> Checking node mike005 (MS)
> Checking node mike020 ok
> -> Killing process of bthakur: ./a.out
> -> Killing process of bthakur: ./a.out
> ...
> Job Name: run.sh
> Session Id: 79154
> Resource Limits: ncpus=1,neednodes=16:ppn=16,nodes=16:ppn=16,walltime=01:00:00
> Resources Used: cput=08:03:39,mem=240952kb,vmem=2429576kb,walltime=01:01:52
>
> >From the error file:
> ...
> real 0m1.692s
> user 0m0.099s
> sys 0m0.387s
> + for param in '"MV2_USE_ONLY_UD=1"' '"MV2_USE_XRC=1"' '"MV2_USE_RDMA_CM=1"' '"MV2_USE_RoCE=1"'
> + echo MV2_USE_RoCE=1
> + echo 'nodes:16 mpi-per-node:8 omp:2'
> + mpirun_rsh -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
> =>> PBS: job killed: walltime 3635 exceeded limit 3600
> [mike005:mpirun_rsh][signal_processor] Caught signal 15, killing job
> [mike008:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 13. MPI process died?
>
> I was starting to believe that this might have something to do with multi-threaded
> pbs_moms unable to clean up as was suggested in this thread,
> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
> but I honestly dont know the cause. For your information, we are running RHEL-6.2.
> Let me know if you might know of a possible reason.
>
> Best,
> Bhupender.
>
> Bhupender Thakur.
> IT- Analyst,
> High Performance Computing, LSU.
> Ph (225)578-5934
>
> ________________________________________
> From: Bhupender Thakur
> Sent: Monday, November 19, 2012 1:09 PM
> To: Jonathan Perkins
> Subject: RE: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
>
> Dear Jonathan,
>
> Thankyou for your prompt response. mpiexec without any parameters seems to run fine.
> With the same parameters it seems to clean up better. For this run
>
> for mpi_pn in 1 2 4 8 16
> do
> let sum_mpi=$mpi_pn*$mpi_nodes
> let OMP_NUM_THREADS=$mpi_width/$sum_mpi
>
> for param in "MV2_USE_XRC=1" "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
> do
> echo " $param"
> echo " nodes:$mpi_nodes mpi-per-node:$mpi_pn omp:$OMP_NUM_THREADS"
> #time mpirun_rsh -np $sum_mpi -hostfile hosts $param ./dummy
> time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
> done
> done
>
> I see the following error mesage when running with mpiexec
>
> MV2_USE_RoCE=1
> nodes:16 mpi-per-node:4 omp:4
>
> =====================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 256
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
> MV2_USE_RDMA_CM=1
> nodes:16 mpi-per-node:4 omp:4
>
> =====================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 65280
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
> MV2_USE_XRC=1
> nodes:16 mpi-per-node:8 omp:2
> System date:
> Mon Nov 19 12:50:03 CST 2012
> I am 0 of 128 on mike005
> I am 10 of 128 on mike005
> ...
>
> Usually when jobs crash, processes are left behind they continue to use a lot of CPU time. I will have to
> check with a production job to see if mpispawn is left behind. This was a small instance of cleanup failure I
> was able to generate. ssh connections seem to persist though as is shown by netstat -tp.
> I was starting to believe that this might have something to do with multi-threaded pbs_moms unable to clean
> up as was suggested in this thread,
> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
> but I honestly dont know the cause. For your information, we are running RHEL-6.2.
>
> Best,
> Bhupender.
>
> Bhupender Thakur.
> IT- Analyst,
> High Performance Computing, LSU.
> Ph (225)578-5934
>
> ________________________________________
> From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
> Sent: Monday, November 19, 2012 12:27 PM
> To: Bhupender Thakur
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
>
> Bhupender:
> Thanks for your report. This could be a problem with mvapich2. Can you
> tell us a bit more about the problem you're facing. Which processes in
> particular are being left behind (mpispawn and/or other processes?)
> Also, does this also happen when using mpiexec?
>
> On Mon, Nov 19, 2012 at 05:05:26PM +0000, Bhupender Thakur wrote:
> > Hi,
> >
> > We are working on implementing mvapich2 one out new cluster but have run into some issues
> > with mvapich2 unable to cleanup frequently when jobs fails
> >
> > We are usign mellanox infiniband
> > $ ibv_devinfo
> > hca_id: mlx4_0
> > transport: InfiniBand (0)
> > fw_ver: 2.10.4492
> > node_guid: 0002:c903:00ff:25b0
> > sys_image_guid: 0002:c903:00ff:25b3
> > vendor_id: 0x02c9
> > vendor_part_id: 4099
> > hw_ver: 0x0
> > board_id: DEL0A30000019
> > phys_port_cnt: 1
> > port: 1
> > state: PORT_ACTIVE (4)
> > max_mtu: 2048 (4)
> > active_mtu: 2048 (4)
> > sm_lid: 1
> > port_lid: 300
> > port_lmc: 0x00
> > link_layer: IB
> >
> >
> > mvapich2 1.9a
> > $ mpiname -a
> > MVAPICH2 1.9a Sat Sep 8 15:01:35 EDT 2012 ch3:mrail
> >
> > Compilation
> > CC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc -O2 -fPIC -g -DNDEBUG -DNVALGRIND -O2
> > CXX: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc -O2 -fPIC -g -DNDEBUG -DNVALGRIND -O2
> > F77: /usr/local/compilers/Intel/composer_xe_2013/bin/ifort -g -O2 -L/usr/lib64
> > FC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort -O2 -fPIC -g -O2
> >
> > Configuration
> > --prefix=/usr/local/packages/mvapich2/1.9a/Intel-13.0.0 \
> > FC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort \
> > CC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc \
> > CXX=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc \
> > CFLAGS=-O2 -fPIC FCFLAGS=-O2 -fPIC CXXFLAGS=-O2 -fPIC \
> > LDFLAGS=-L/usr/local/compilers/Intel/composer_xe_2013/lib -L/usr/local/compilers/Intel/composer_xe_2013/lib/intel64 \
> > LIBS= CPPFLAGS= \
> > --enable-rdma-cm --enable-g=dbg --enable-romio --with-file-system=lustre+nfs \
> > --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 \
> > --enable-threads=runtime --enable-mpe --enable-smpcoll --enable-shared --enable-xrc --with-hwloc
> >
> > $ pbs_mom --version
> > version: 3.0.6
> > pbs moms are threaded with the default 3 threads.
> >
> > This does not happen with openmpi. A sample hello world is being run with different parameters
> >
> > program dummy
> > use mpi
> > character*10 name
> > ! Init MPI
> > call MPI_Init(mpierr)
> > ! Get Rank Size
> > call MPI_COMM_Rank(MPI_COMM_WORLD, nrank, mpierr)
> > call MPI_COMM_Size(MPI_COMM_WORLD, nproc, mpierr)
> > ! Get Date
> > if (nrank==0) then
> > write(*,*)'System date: Running mpirun_rsh'
> > call system('date')
> > end if
> > ! Print rank
> > call MPI_Barrier(MPI_COMM_WORLD, mpierr)
> > !
> > call MPI_Get_processor_name(name, nlen, mpierr)
> > write(*,*)" I am ", nrank, " of " ,nproc, " on ", name
> > !
> > call MPI_Barrier(MPI_COMM_WORLD, mpierr)
> > ! Finalize
> > call MPI_Finalize(mpierr)
> > end
> >
> > ===========
> > #
> > cat $PBS_NODEFILE > hostfile
> > cat $PBS_NODEFILE | uniq > hosts
> > mpi_width=`cat hostfile | wc -l`
> > mpi_nodes=`cat hosts | wc -l`
> >
> > for mpi_pn in 8 16
> > do
> > let sum_mpi=$mpi_pn*$mpi_nodes
> > let OMP_NUM_THREADS=$mpi_width/$sum_mpi
> >
> > for param in "MV2_USE_XRC=1" "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
> > do
> > echo " $param"
> > echo " nodes:$mpi_nodes mpi-per-node:$mpi_pn omp:$OMP_NUM_THREADS"
> > time mpirun_rsh -np $sum_mpi -hostfile hosts $param ./dummy
> > done
> > done
> > ===============
> > using parameter "MV2_USE_RoCE=1" and "MV2_USE_RDMA_CM=1" should fail as they
> > not been configured yet(openib.conf), nevertheless, the program does not exit cleanly.
> > We are seeing this with some other applications where the process seems to have crashed
> > and is not producting any useful output, but there are threads lingering on ling after the program has
> > crashed.
> >
> > At this stage we are not sure if this is infiniband or torque or mvapich2 issue. Please let us know if
> > you have seen this behaviour and if there is a way to resolve this.
> >
> > Best,
> > Bhupender.
> >
> >
> > Bhupender Thakur.
> > IT- Analyst,
> > High Performance Computing, LSU.
> > Ph (225)578-5934
>
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
More information about the mvapich-discuss
mailing list