[mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs

Mon Nov 19 19:11:56 EST 2012

Thanks for the additional information.  I have additional question about
the wall time limit.  Is it set to 1 hr or 1 min?  It looks like
mpirun_rsh and the other processes are still running before pbs kills
them according to your process tree output.  The time command seems to
show that mpirun_rsh runs for about 1 minute.

P.S. I've forwarded this message to an internal developer list while we
debug this further.

On Mon, Nov 19, 2012 at 09:52:40PM +0000, Bhupender Thakur wrote:
> Dear Jonathan,
> 
> mpiexec without any parameters seems to run fine.
> With the same parameters it seems to clean up better. For this run
> 
> for mpi_pn in 1 2 4 8 16
>   do
>     let sum_mpi=$mpi_pn*$mpi_nodes
>     let OMP_NUM_THREADS=$mpi_width/$sum_mpi
> 
>     for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
>      do
>        echo "    $param"
>        echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
>        #time mpirun_rsh -np $sum_mpi -hostfile hosts $param  ./dummy
>        time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
>     done
> done
> 
> I see the following error mesage when running with mpiexec
> ...
>     MV2_USE_RoCE=1
>     nodes:16  mpi-per-node:4  omp:4
> =====================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 256
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
>     MV2_USE_XRC=1
>     nodes:16  mpi-per-node:8  omp:2
>  System date: 
> Mon Nov 19 12:50:03 CST 2012
>      I am            0  of          128  on mike005   
>      I am           10  of          128  on mike005
> ...
> With another code but with more numerics built in(should still finish in few minutes)
> mpirun_rsh still hangs after the last error and I can still see processes, including mpispwan
> still lingering on on the mothr node as well as other nodes until they are killed by the epilogue. 
> The process tree on the mother node looks like this
> 
>   |-ntpd,4066 -u ntp:ntp -p /var/run/ntpd.pid -g
>   |-pbs_mom,4097
>   |   |-bash,79154
>   |   |   |-208.mike3.SC,79860 -x /var/spool/torque/mom_priv/jobs/208.mike3.SC
>   |   |   |   `-mpirun_rsh,80150 -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
>   |   |   |       |-bash,80152 -c...
>   |   |   |       |   `-mpispawn,80169 0
>   |   |   |       |       |-a.out,80171
>   |   |   |       |       |   |-{a.out},80180
>   |   |   |       |       |   |-{a.out},80193
>   |   |   |       |       |   `-{a.out},80194
>   ......
>   |   |   |       |       |-a.out,80178
>   |   |   |       |       |   |-{a.out},80182
>   |   |   |       |       |   |-{a.out},80189
>   |   |   |       |       |   `-{a.out},80190
>   |   |   |       |       `-{mpispawn},80170
>   |   |   |       |-ssh,80153 -q mike006...
>   |   |   |       |-ssh,80154 -q mike007...
>  ........
>   |   |   |       |-ssh,80167 -q mike020...
>   |   |   |       |-{mpirun_rsh},80151
>   |   |   |       `-{mpirun_rsh},80168
>   |   |   `-pbs_demux,79763
>   |   |-{pbs_mom},4104
>   |   `-{pbs_mom},4106
>   |-portreserve,3573
> 
> Netstat shows the connections being maintained
> Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
> tcp        0      0 mike005:54949               mike005:44480               ESTABLISHED 80172/./a.out       
> tcp        0      0 mike005:36529               mike010:ssh                 ESTABLISHED 80157/ssh           
> tcp        0      0 mike005:45949               mike008:33736               ESTABLISHED 80169/mpispawn      
> tcp        0      0 mike005:42308               mike006:58824               ESTABLISHED 80169/mpispawn      
> tcp        0      0 mike005:ssh                 mike3:53763                 ESTABLISHED -
> ...
> 
> >From the output file:
> ...
> MV2_USE_RDMA_CM=1
> nodes:16  mpi-per-node:8  omp:2
> MV2_USE_RoCE=1
> nodes:16  mpi-per-node:8  omp:2
> ------------------------------------------------------
> Running PBS epilogue script    - 19-Nov-2012 14:24:55
> ------------------------------------------------------
> Checking node mike005 (MS)
> Checking node mike020 ok
> -> Killing process of bthakur: ./a.out
> -> Killing process of bthakur: ./a.out
> ...
> Job Name:        run.sh
> Session Id:      79154
> Resource Limits: ncpus=1,neednodes=16:ppn=16,nodes=16:ppn=16,walltime=01:00:00
> Resources Used:  cput=08:03:39,mem=240952kb,vmem=2429576kb,walltime=01:01:52
> 
> >From the error file:
> ...
> real	0m1.692s
> user	0m0.099s
> sys	0m0.387s
> + for param in '"MV2_USE_ONLY_UD=1"' '"MV2_USE_XRC=1"' '"MV2_USE_RDMA_CM=1"' '"MV2_USE_RoCE=1"'
> + echo MV2_USE_RoCE=1
> + echo 'nodes:16  mpi-per-node:8  omp:2'
> + mpirun_rsh -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
> =>> PBS: job killed: walltime 3635 exceeded limit 3600
> [mike005:mpirun_rsh][signal_processor] Caught signal 15, killing job
> [mike008:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 13. MPI process died?
> 
> I was starting to believe that this might have something to do with multi-threaded 
> pbs_moms unable to clean up as was suggested in this thread,
> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
> but I honestly dont know the cause. For your information, we are running RHEL-6.2.
> Let me know if you might know of a possible reason.
> 
> Best,
> Bhupender.
> 
> Bhupender Thakur.
> IT- Analyst,
> High Performance Computing, LSU.
> Ph (225)578-5934
> 
> ________________________________________
> From: Bhupender Thakur
> Sent: Monday, November 19, 2012 1:09 PM
> To: Jonathan Perkins
> Subject: RE: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
> 
> Dear Jonathan,
> 
> Thankyou  for your prompt response. mpiexec without any parameters seems to run fine.
> With the same parameters it seems to clean up better. For this run
> 
> for mpi_pn in 1 2 4 8 16
>   do
>     let sum_mpi=$mpi_pn*$mpi_nodes
>     let OMP_NUM_THREADS=$mpi_width/$sum_mpi
> 
>     for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
>      do
>        echo "    $param"
>        echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
>        #time mpirun_rsh -np $sum_mpi -hostfile hosts $param  ./dummy
>        time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
>     done
> done
> 
> I see the following error mesage when running with mpiexec
> 
>     MV2_USE_RoCE=1
>     nodes:16  mpi-per-node:4  omp:4
> 
> =====================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 256
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
>     MV2_USE_RDMA_CM=1
>     nodes:16  mpi-per-node:4  omp:4
> 
> =====================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 65280
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
>     MV2_USE_XRC=1
>     nodes:16  mpi-per-node:8  omp:2
>  System date:
> Mon Nov 19 12:50:03 CST 2012
>      I am            0  of          128  on mike005
>      I am           10  of          128  on mike005
> ...
> 
> Usually when jobs crash, processes are left behind they continue to use a lot of CPU time. I will have to
> check with a production job to see if mpispawn is left behind. This was a small instance of cleanup failure I
> was able to generate. ssh connections seem to persist though as is shown by netstat -tp.
> I was starting to believe that this might have something to do with multi-threaded pbs_moms unable to clean
> up as was suggested in this thread,
> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
> but I honestly dont know the cause. For your information, we are running RHEL-6.2.
> 
> Best,
> Bhupender.
> 
> Bhupender Thakur.
> IT- Analyst,
> High Performance Computing, LSU.
> Ph (225)578-5934
> 
> ________________________________________
> From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
> Sent: Monday, November 19, 2012 12:27 PM
> To: Bhupender Thakur
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
> 
> Bhupender:
> Thanks for your report.  This could be a problem with mvapich2.  Can you
> tell us a bit more about the problem you're facing.  Which processes in
> particular are being left behind (mpispawn and/or other processes?)
> Also, does this also happen when using mpiexec?
> 
> On Mon, Nov 19, 2012 at 05:05:26PM +0000, Bhupender Thakur wrote:
> > Hi,
> >
> > We are working on implementing mvapich2 one out new cluster but have run into some issues
> > with mvapich2 unable to cleanup frequently when jobs fails
> >
> > We are usign mellanox infiniband
> > $ ibv_devinfo
> > hca_id:    mlx4_0
> >     transport:            InfiniBand (0)
> >     fw_ver:                2.10.4492
> >     node_guid:            0002:c903:00ff:25b0
> >     sys_image_guid:            0002:c903:00ff:25b3
> >     vendor_id:            0x02c9
> >     vendor_part_id:            4099
> >     hw_ver:                0x0
> >     board_id:            DEL0A30000019
> >     phys_port_cnt:            1
> >         port:    1
> >             state:            PORT_ACTIVE (4)
> >             max_mtu:        2048 (4)
> >             active_mtu:        2048 (4)
> >             sm_lid:            1
> >             port_lid:        300
> >             port_lmc:        0x00
> >             link_layer:        IB
> >
> >
> > mvapich2 1.9a
> > $ mpiname -a
> > MVAPICH2 1.9a Sat Sep  8 15:01:35 EDT 2012 ch3:mrail
> >
> > Compilation
> > CC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc -O2 -fPIC   -g -DNDEBUG -DNVALGRIND -O2
> > CXX: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc -O2 -fPIC  -g -DNDEBUG -DNVALGRIND -O2
> > F77: /usr/local/compilers/Intel/composer_xe_2013/bin/ifort   -g -O2 -L/usr/lib64
> > FC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort -O2 -fPIC  -g -O2
> >
> > Configuration
> > --prefix=/usr/local/packages/mvapich2/1.9a/Intel-13.0.0 \
> > FC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort \
> > CC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc \
> > CXX=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc \
> > CFLAGS=-O2 -fPIC FCFLAGS=-O2 -fPIC CXXFLAGS=-O2 -fPIC \
> > LDFLAGS=-L/usr/local/compilers/Intel/composer_xe_2013/lib -L/usr/local/compilers/Intel/composer_xe_2013/lib/intel64 \
> > LIBS= CPPFLAGS= \
> > --enable-rdma-cm --enable-g=dbg --enable-romio --with-file-system=lustre+nfs \
> > --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 \
> > --enable-threads=runtime --enable-mpe --enable-smpcoll --enable-shared --enable-xrc --with-hwloc
> >
> > $ pbs_mom --version
> > version: 3.0.6
> > pbs moms are threaded with the default 3 threads.
> >
> > This does not happen with openmpi. A sample hello world is being run with different parameters
> >
> > program dummy
> >   use mpi
> >   character*10 name
> > ! Init MPI
> >     call MPI_Init(mpierr)
> > ! Get Rank Size
> >     call MPI_COMM_Rank(MPI_COMM_WORLD, nrank, mpierr)
> >     call MPI_COMM_Size(MPI_COMM_WORLD, nproc, mpierr)
> > ! Get Date
> >     if (nrank==0) then
> >     write(*,*)'System date: Running mpirun_rsh'
> >     call system('date')
> >     end if
> > ! Print rank
> >     call MPI_Barrier(MPI_COMM_WORLD, mpierr)
> >     !
> >     call MPI_Get_processor_name(name, nlen, mpierr)
> >     write(*,*)"    I am ", nrank, " of " ,nproc, " on ", name
> >     !
> >     call MPI_Barrier(MPI_COMM_WORLD, mpierr)
> > ! Finalize
> >     call MPI_Finalize(mpierr)
> > end
> >
> > ===========
> > #
> >   cat $PBS_NODEFILE > hostfile
> >   cat $PBS_NODEFILE | uniq > hosts
> >   mpi_width=`cat hostfile | wc -l`
> >   mpi_nodes=`cat hosts | wc -l`
> >
> > for mpi_pn in 8 16
> >   do
> >     let sum_mpi=$mpi_pn*$mpi_nodes
> >     let OMP_NUM_THREADS=$mpi_width/$sum_mpi
> >
> >     for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
> >      do
> >        echo "    $param"
> >        echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
> >        time mpirun_rsh -np $sum_mpi -hostfile hosts $param ./dummy
> >     done
> > done
> > ===============
> > using parameter "MV2_USE_RoCE=1" and "MV2_USE_RDMA_CM=1" should fail as they
> > not been configured yet(openib.conf), nevertheless, the program does not exit cleanly.
> > We are seeing this with some other applications where the process seems to have crashed
> > and is not producting any useful output, but there are threads lingering on ling after the program has
> > crashed.
> >
> > At this stage we are not sure if this is infiniband or torque or mvapich2 issue. Please let us know if
> > you have seen this behaviour and if there is a way to resolve this.
> >
> > Best,
> > Bhupender.
> >
> >
> > Bhupender Thakur.
> > IT- Analyst,
> > High Performance Computing, LSU.
> > Ph (225)578-5934
> 
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
> 
> 
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo