[mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs

Tue Nov 20 13:55:00 EST 2012

Thanks for your note.  After some investigation we've found that you are
running your tests by setting MV2_USE_RoCE=1 on a cluster which does not
have RoCE enabled HCAs. In this case, our library will either hang (not
indefinitely) or abort.  As you will see in MVAPICH2 user guide,
MV2_USE_RoCE=1 should only be used on a RoCE cluster.

Our library relies on a set of different mechanisms for bootstrap
communication depending on the launcher used.  Either PMI (TCP based) or
an IB ring is used.  In the case of mpirun_rsh we use PMI as this has
been highly optimized whereas we use the IB ring with hydra (mpiexec).
The reason hydra fails quickly is because this IB ring cannot be
established due to the MV2_USE_RoCE=1 setting.  You can force the same
behavior with mpirun_rsh by using MV2_USE_RING_STARTUP=1

We suggest that you do not set MV2_USE_RoCE unless your HCAs are set in
RoCE mode.  Please do let us know if there are still issues using this
environment variable with a RoCE cluster.

On Tue, Nov 20, 2012 at 01:29:59AM +0000, Bhupender Thakur wrote:
> Dear Jonathan,
> 
> The walltime limit was set to 60 minutes. This can be verified by the time of prologue and
> epilogue scripts. From the output:
> ...
> 
> Done clearing all the allocated nodes
> ------------------------------------------------------
> Concluding PBS prologue script - 19-Nov-2012 13:23:02
> ------------------------------------------------------
> mike005
> ...
> ------------------------------------------------------
> Running PBS epilogue script    - 19-Nov-2012 14:24:55
> ------------------------------------------------------
> Checking node mike005
> 
> Thanks for taking the time an effort to look into it.
> Please let me know if you need further information.
> 
> Regards,
> Bhupender.
> 
> 
> 
> Bhupender Thakur.
> IT- Analyst,
> High Performance Computing, LSU.
> Ph (225)578-5934
> 
> ________________________________________
> From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
> Sent: Monday, November 19, 2012 6:11 PM
> To: Bhupender Thakur
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
> 
> Thanks for the additional information.  I have additional question about
> the wall time limit.  Is it set to 1 hr or 1 min?  It looks like
> mpirun_rsh and the other processes are still running before pbs kills
> them according to your process tree output.  The time command seems to
> show that mpirun_rsh runs for about 1 minute.
> 
> P.S. I've forwarded this message to an internal developer list while we
> debug this further.
> 
> On Mon, Nov 19, 2012 at 09:52:40PM +0000, Bhupender Thakur wrote:
> > Dear Jonathan,
> >
> > mpiexec without any parameters seems to run fine.
> > With the same parameters it seems to clean up better. For this run
> >
> > for mpi_pn in 1 2 4 8 16
> >   do
> >     let sum_mpi=$mpi_pn*$mpi_nodes
> >     let OMP_NUM_THREADS=$mpi_width/$sum_mpi
> >
> >     for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
> >      do
> >        echo "    $param"
> >        echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
> >        #time mpirun_rsh -np $sum_mpi -hostfile hosts $param  ./dummy
> >        time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
> >     done
> > done
> >
> > I see the following error mesage when running with mpiexec
> > ...
> >     MV2_USE_RoCE=1
> >     nodes:16  mpi-per-node:4  omp:4
> > =====================================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   EXIT CODE: 256
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > =====================================================================================
> >     MV2_USE_XRC=1
> >     nodes:16  mpi-per-node:8  omp:2
> >  System date:
> > Mon Nov 19 12:50:03 CST 2012
> >      I am            0  of          128  on mike005
> >      I am           10  of          128  on mike005
> > ...
> > With another code but with more numerics built in(should still finish in few minutes)
> > mpirun_rsh still hangs after the last error and I can still see processes, including mpispwan
> > still lingering on on the mothr node as well as other nodes until they are killed by the epilogue.
> > The process tree on the mother node looks like this
> >
> >   |-ntpd,4066 -u ntp:ntp -p /var/run/ntpd.pid -g
> >   |-pbs_mom,4097
> >   |   |-bash,79154
> >   |   |   |-208.mike3.SC,79860 -x /var/spool/torque/mom_priv/jobs/208.mike3.SC
> >   |   |   |   `-mpirun_rsh,80150 -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
> >   |   |   |       |-bash,80152 -c...
> >   |   |   |       |   `-mpispawn,80169 0
> >   |   |   |       |       |-a.out,80171
> >   |   |   |       |       |   |-{a.out},80180
> >   |   |   |       |       |   |-{a.out},80193
> >   |   |   |       |       |   `-{a.out},80194
> >   ......
> >   |   |   |       |       |-a.out,80178
> >   |   |   |       |       |   |-{a.out},80182
> >   |   |   |       |       |   |-{a.out},80189
> >   |   |   |       |       |   `-{a.out},80190
> >   |   |   |       |       `-{mpispawn},80170
> >   |   |   |       |-ssh,80153 -q mike006...
> >   |   |   |       |-ssh,80154 -q mike007...
> >  ........
> >   |   |   |       |-ssh,80167 -q mike020...
> >   |   |   |       |-{mpirun_rsh},80151
> >   |   |   |       `-{mpirun_rsh},80168
> >   |   |   `-pbs_demux,79763
> >   |   |-{pbs_mom},4104
> >   |   `-{pbs_mom},4106
> >   |-portreserve,3573
> >
> > Netstat shows the connections being maintained
> > Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
> > tcp        0      0 mike005:54949               mike005:44480               ESTABLISHED 80172/./a.out
> > tcp        0      0 mike005:36529               mike010:ssh                 ESTABLISHED 80157/ssh
> > tcp        0      0 mike005:45949               mike008:33736               ESTABLISHED 80169/mpispawn
> > tcp        0      0 mike005:42308               mike006:58824               ESTABLISHED 80169/mpispawn
> > tcp        0      0 mike005:ssh                 mike3:53763                 ESTABLISHED -
> > ...
> >
> > >From the output file:
> > ...
> > MV2_USE_RDMA_CM=1
> > nodes:16  mpi-per-node:8  omp:2
> > MV2_USE_RoCE=1
> > nodes:16  mpi-per-node:8  omp:2
> > ------------------------------------------------------
> > Running PBS epilogue script    - 19-Nov-2012 14:24:55
> > ------------------------------------------------------
> > Checking node mike005 (MS)
> > Checking node mike020 ok
> > -> Killing process of bthakur: ./a.out
> > -> Killing process of bthakur: ./a.out
> > ...
> > Job Name:        run.sh
> > Session Id:      79154
> > Resource Limits: ncpus=1,neednodes=16:ppn=16,nodes=16:ppn=16,walltime=01:00:00
> > Resources Used:  cput=08:03:39,mem=240952kb,vmem=2429576kb,walltime=01:01:52
> >
> > >From the error file:
> > ...
> > real  0m1.692s
> > user  0m0.099s
> > sys   0m0.387s
> > + for param in '"MV2_USE_ONLY_UD=1"' '"MV2_USE_XRC=1"' '"MV2_USE_RDMA_CM=1"' '"MV2_USE_RoCE=1"'
> > + echo MV2_USE_RoCE=1
> > + echo 'nodes:16  mpi-per-node:8  omp:2'
> > + mpirun_rsh -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
> > =>> PBS: job killed: walltime 3635 exceeded limit 3600
> > [mike005:mpirun_rsh][signal_processor] Caught signal 15, killing job
> > [mike008:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 13. MPI process died?
> >
> > I was starting to believe that this might have something to do with multi-threaded
> > pbs_moms unable to clean up as was suggested in this thread,
> > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
> > but I honestly dont know the cause. For your information, we are running RHEL-6.2.
> > Let me know if you might know of a possible reason.
> >
> > Best,
> > Bhupender.
> >
> > Bhupender Thakur.
> > IT- Analyst,
> > High Performance Computing, LSU.
> > Ph (225)578-5934
> >
> > ________________________________________
> > From: Bhupender Thakur
> > Sent: Monday, November 19, 2012 1:09 PM
> > To: Jonathan Perkins
> > Subject: RE: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
> >
> > Dear Jonathan,
> >
> > Thankyou  for your prompt response. mpiexec without any parameters seems to run fine.
> > With the same parameters it seems to clean up better. For this run
> >
> > for mpi_pn in 1 2 4 8 16
> >   do
> >     let sum_mpi=$mpi_pn*$mpi_nodes
> >     let OMP_NUM_THREADS=$mpi_width/$sum_mpi
> >
> >     for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
> >      do
> >        echo "    $param"
> >        echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
> >        #time mpirun_rsh -np $sum_mpi -hostfile hosts $param  ./dummy
> >        time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
> >     done
> > done
> >
> > I see the following error mesage when running with mpiexec
> >
> >     MV2_USE_RoCE=1
> >     nodes:16  mpi-per-node:4  omp:4
> >
> > =====================================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   EXIT CODE: 256
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > =====================================================================================
> >     MV2_USE_RDMA_CM=1
> >     nodes:16  mpi-per-node:4  omp:4
> >
> > =====================================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   EXIT CODE: 65280
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > =====================================================================================
> >     MV2_USE_XRC=1
> >     nodes:16  mpi-per-node:8  omp:2
> >  System date:
> > Mon Nov 19 12:50:03 CST 2012
> >      I am            0  of          128  on mike005
> >      I am           10  of          128  on mike005
> > ...
> >
> > Usually when jobs crash, processes are left behind they continue to use a lot of CPU time. I will have to
> > check with a production job to see if mpispawn is left behind. This was a small instance of cleanup failure I
> > was able to generate. ssh connections seem to persist though as is shown by netstat -tp.
> > I was starting to believe that this might have something to do with multi-threaded pbs_moms unable to clean
> > up as was suggested in this thread,
> > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
> > but I honestly dont know the cause. For your information, we are running RHEL-6.2.
> >
> > Best,
> > Bhupender.
> >
> > Bhupender Thakur.
> > IT- Analyst,
> > High Performance Computing, LSU.
> > Ph (225)578-5934
> >
> > ________________________________________
> > From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
> > Sent: Monday, November 19, 2012 12:27 PM
> > To: Bhupender Thakur
> > Cc: mvapich-discuss at cse.ohio-state.edu
> > Subject: Re: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs
> >
> > Bhupender:
> > Thanks for your report.  This could be a problem with mvapich2.  Can you
> > tell us a bit more about the problem you're facing.  Which processes in
> > particular are being left behind (mpispawn and/or other processes?)
> > Also, does this also happen when using mpiexec?
> >
> > On Mon, Nov 19, 2012 at 05:05:26PM +0000, Bhupender Thakur wrote:
> > > Hi,
> > >
> > > We are working on implementing mvapich2 one out new cluster but have run into some issues
> > > with mvapich2 unable to cleanup frequently when jobs fails
> > >
> > > We are usign mellanox infiniband
> > > $ ibv_devinfo
> > > hca_id:    mlx4_0
> > >     transport:            InfiniBand (0)
> > >     fw_ver:                2.10.4492
> > >     node_guid:            0002:c903:00ff:25b0
> > >     sys_image_guid:            0002:c903:00ff:25b3
> > >     vendor_id:            0x02c9
> > >     vendor_part_id:            4099
> > >     hw_ver:                0x0
> > >     board_id:            DEL0A30000019
> > >     phys_port_cnt:            1
> > >         port:    1
> > >             state:            PORT_ACTIVE (4)
> > >             max_mtu:        2048 (4)
> > >             active_mtu:        2048 (4)
> > >             sm_lid:            1
> > >             port_lid:        300
> > >             port_lmc:        0x00
> > >             link_layer:        IB
> > >
> > >
> > > mvapich2 1.9a
> > > $ mpiname -a
> > > MVAPICH2 1.9a Sat Sep  8 15:01:35 EDT 2012 ch3:mrail
> > >
> > > Compilation
> > > CC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc -O2 -fPIC   -g -DNDEBUG -DNVALGRIND -O2
> > > CXX: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc -O2 -fPIC  -g -DNDEBUG -DNVALGRIND -O2
> > > F77: /usr/local/compilers/Intel/composer_xe_2013/bin/ifort   -g -O2 -L/usr/lib64
> > > FC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort -O2 -fPIC  -g -O2
> > >
> > > Configuration
> > > --prefix=/usr/local/packages/mvapich2/1.9a/Intel-13.0.0 \
> > > FC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort \
> > > CC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc \
> > > CXX=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc \
> > > CFLAGS=-O2 -fPIC FCFLAGS=-O2 -fPIC CXXFLAGS=-O2 -fPIC \
> > > LDFLAGS=-L/usr/local/compilers/Intel/composer_xe_2013/lib -L/usr/local/compilers/Intel/composer_xe_2013/lib/intel64 \
> > > LIBS= CPPFLAGS= \
> > > --enable-rdma-cm --enable-g=dbg --enable-romio --with-file-system=lustre+nfs \
> > > --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 \
> > > --enable-threads=runtime --enable-mpe --enable-smpcoll --enable-shared --enable-xrc --with-hwloc
> > >
> > > $ pbs_mom --version
> > > version: 3.0.6
> > > pbs moms are threaded with the default 3 threads.
> > >
> > > This does not happen with openmpi. A sample hello world is being run with different parameters
> > >
> > > program dummy
> > >   use mpi
> > >   character*10 name
> > > ! Init MPI
> > >     call MPI_Init(mpierr)
> > > ! Get Rank Size
> > >     call MPI_COMM_Rank(MPI_COMM_WORLD, nrank, mpierr)
> > >     call MPI_COMM_Size(MPI_COMM_WORLD, nproc, mpierr)
> > > ! Get Date
> > >     if (nrank==0) then
> > >     write(*,*)'System date: Running mpirun_rsh'
> > >     call system('date')
> > >     end if
> > > ! Print rank
> > >     call MPI_Barrier(MPI_COMM_WORLD, mpierr)
> > >     !
> > >     call MPI_Get_processor_name(name, nlen, mpierr)
> > >     write(*,*)"    I am ", nrank, " of " ,nproc, " on ", name
> > >     !
> > >     call MPI_Barrier(MPI_COMM_WORLD, mpierr)
> > > ! Finalize
> > >     call MPI_Finalize(mpierr)
> > > end
> > >
> > > ===========
> > > #
> > >   cat $PBS_NODEFILE > hostfile
> > >   cat $PBS_NODEFILE | uniq > hosts
> > >   mpi_width=`cat hostfile | wc -l`
> > >   mpi_nodes=`cat hosts | wc -l`
> > >
> > > for mpi_pn in 8 16
> > >   do
> > >     let sum_mpi=$mpi_pn*$mpi_nodes
> > >     let OMP_NUM_THREADS=$mpi_width/$sum_mpi
> > >
> > >     for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
> > >      do
> > >        echo "    $param"
> > >        echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
> > >        time mpirun_rsh -np $sum_mpi -hostfile hosts $param ./dummy
> > >     done
> > > done
> > > ===============
> > > using parameter "MV2_USE_RoCE=1" and "MV2_USE_RDMA_CM=1" should fail as they
> > > not been configured yet(openib.conf), nevertheless, the program does not exit cleanly.
> > > We are seeing this with some other applications where the process seems to have crashed
> > > and is not producting any useful output, but there are threads lingering on ling after the program has
> > > crashed.
> > >
> > > At this stage we are not sure if this is infiniband or torque or mvapich2 issue. Please let us know if
> > > you have seen this behaviour and if there is a way to resolve this.
> > >
> > > Best,
> > > Bhupender.
> > >
> > >
> > > Bhupender Thakur.
> > > IT- Analyst,
> > > High Performance Computing, LSU.
> > > Ph (225)578-5934
> >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
> > --
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
> >
> >
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> 
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
> 
> 
> 
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo