[mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs

Mon Nov 19 16:52:40 EST 2012

Dear Jonathan,

mpiexec without any parameters seems to run fine.
With the same parameters it seems to clean up better. For this run

for mpi_pn in 1 2 4 8 16
  do
    let sum_mpi=$mpi_pn*$mpi_nodes
    let OMP_NUM_THREADS=$mpi_width/$sum_mpi

    for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
     do
       echo "    $param"
       echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
       #time mpirun_rsh -np $sum_mpi -hostfile hosts $param  ./dummy
       time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
    done
done

I see the following error mesage when running with mpiexec
...
    MV2_USE_RoCE=1
    nodes:16  mpi-per-node:4  omp:4
=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
    MV2_USE_XRC=1
    nodes:16  mpi-per-node:8  omp:2
 System date: 
Mon Nov 19 12:50:03 CST 2012
     I am            0  of          128  on mike005   
     I am           10  of          128  on mike005
...
With another code but with more numerics built in(should still finish in few minutes)
mpirun_rsh still hangs after the last error and I can still see processes, including mpispwan
still lingering on on the mothr node as well as other nodes until they are killed by the epilogue. 
The process tree on the mother node looks like this

  |-ntpd,4066 -u ntp:ntp -p /var/run/ntpd.pid -g
  |-pbs_mom,4097
  |   |-bash,79154
  |   |   |-208.mike3.SC,79860 -x /var/spool/torque/mom_priv/jobs/208.mike3.SC
  |   |   |   `-mpirun_rsh,80150 -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
  |   |   |       |-bash,80152 -c...
  |   |   |       |   `-mpispawn,80169 0
  |   |   |       |       |-a.out,80171
  |   |   |       |       |   |-{a.out},80180
  |   |   |       |       |   |-{a.out},80193
  |   |   |       |       |   `-{a.out},80194
  ......
  |   |   |       |       |-a.out,80178
  |   |   |       |       |   |-{a.out},80182
  |   |   |       |       |   |-{a.out},80189
  |   |   |       |       |   `-{a.out},80190
  |   |   |       |       `-{mpispawn},80170
  |   |   |       |-ssh,80153 -q mike006...
  |   |   |       |-ssh,80154 -q mike007...
 ........
  |   |   |       |-ssh,80167 -q mike020...
  |   |   |       |-{mpirun_rsh},80151
  |   |   |       `-{mpirun_rsh},80168
  |   |   `-pbs_demux,79763
  |   |-{pbs_mom},4104
  |   `-{pbs_mom},4106
  |-portreserve,3573

Netstat shows the connections being maintained
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0      0 mike005:54949               mike005:44480               ESTABLISHED 80172/./a.out       
tcp        0      0 mike005:36529               mike010:ssh                 ESTABLISHED 80157/ssh           
tcp        0      0 mike005:45949               mike008:33736               ESTABLISHED 80169/mpispawn      
tcp        0      0 mike005:42308               mike006:58824               ESTABLISHED 80169/mpispawn      
tcp        0      0 mike005:ssh                 mike3:53763                 ESTABLISHED -
...

>From the output file:
...
MV2_USE_RDMA_CM=1
nodes:16  mpi-per-node:8  omp:2
MV2_USE_RoCE=1
nodes:16  mpi-per-node:8  omp:2
------------------------------------------------------
Running PBS epilogue script    - 19-Nov-2012 14:24:55
------------------------------------------------------
Checking node mike005 (MS)
Checking node mike020 ok
-> Killing process of bthakur: ./a.out
-> Killing process of bthakur: ./a.out
...
Job Name:        run.sh
Session Id:      79154
Resource Limits: ncpus=1,neednodes=16:ppn=16,nodes=16:ppn=16,walltime=01:00:00
Resources Used:  cput=08:03:39,mem=240952kb,vmem=2429576kb,walltime=01:01:52

>From the error file:
...
real	0m1.692s
user	0m0.099s
sys	0m0.387s
+ for param in '"MV2_USE_ONLY_UD=1"' '"MV2_USE_XRC=1"' '"MV2_USE_RDMA_CM=1"' '"MV2_USE_RoCE=1"'
+ echo MV2_USE_RoCE=1
+ echo 'nodes:16  mpi-per-node:8  omp:2'
+ mpirun_rsh -np 128 -hostfile hosts MV2_USE_RoCE=1 ./a.out
=>> PBS: job killed: walltime 3635 exceeded limit 3600
[mike005:mpirun_rsh][signal_processor] Caught signal 15, killing job
[mike008:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 13. MPI process died?

I was starting to believe that this might have something to do with multi-threaded 
pbs_moms unable to clean up as was suggested in this thread,
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
but I honestly dont know the cause. For your information, we are running RHEL-6.2.
Let me know if you might know of a possible reason.

Best,
Bhupender.

Bhupender Thakur.
IT- Analyst,
High Performance Computing, LSU.
Ph (225)578-5934

________________________________________
From: Bhupender Thakur
Sent: Monday, November 19, 2012 1:09 PM
To: Jonathan Perkins
Subject: RE: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs

Dear Jonathan,

Thankyou  for your prompt response. mpiexec without any parameters seems to run fine.
With the same parameters it seems to clean up better. For this run

for mpi_pn in 1 2 4 8 16
  do
    let sum_mpi=$mpi_pn*$mpi_nodes
    let OMP_NUM_THREADS=$mpi_width/$sum_mpi

    for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
     do
       echo "    $param"
       echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
       #time mpirun_rsh -np $sum_mpi -hostfile hosts $param  ./dummy
       time $MPI/bin/mpiexec -np $sum_mpi -env $param ./a.out
    done
done

I see the following error mesage when running with mpiexec

    MV2_USE_RoCE=1
    nodes:16  mpi-per-node:4  omp:4

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
    MV2_USE_RDMA_CM=1
    nodes:16  mpi-per-node:4  omp:4

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 65280
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
    MV2_USE_XRC=1
    nodes:16  mpi-per-node:8  omp:2
 System date:
Mon Nov 19 12:50:03 CST 2012
     I am            0  of          128  on mike005
     I am           10  of          128  on mike005
...

Usually when jobs crash, processes are left behind they continue to use a lot of CPU time. I will have to
check with a production job to see if mpispawn is left behind. This was a small instance of cleanup failure I
was able to generate. ssh connections seem to persist though as is shown by netstat -tp.
I was starting to believe that this might have something to do with multi-threaded pbs_moms unable to clean
up as was suggested in this thread,
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=149
but I honestly dont know the cause. For your information, we are running RHEL-6.2.

Best,
Bhupender.

Bhupender Thakur.
IT- Analyst,
High Performance Computing, LSU.
Ph (225)578-5934

________________________________________
From: Jonathan Perkins [perkinjo at cse.ohio-state.edu]
Sent: Monday, November 19, 2012 12:27 PM
To: Bhupender Thakur
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] mvapich2 1.9a fails to cleanup after failed jobs

Bhupender:
Thanks for your report.  This could be a problem with mvapich2.  Can you
tell us a bit more about the problem you're facing.  Which processes in
particular are being left behind (mpispawn and/or other processes?)
Also, does this also happen when using mpiexec?

On Mon, Nov 19, 2012 at 05:05:26PM +0000, Bhupender Thakur wrote:
> Hi,
>
> We are working on implementing mvapich2 one out new cluster but have run into some issues
> with mvapich2 unable to cleanup frequently when jobs fails
>
> We are usign mellanox infiniband
> $ ibv_devinfo
> hca_id:    mlx4_0
>     transport:            InfiniBand (0)
>     fw_ver:                2.10.4492
>     node_guid:            0002:c903:00ff:25b0
>     sys_image_guid:            0002:c903:00ff:25b3
>     vendor_id:            0x02c9
>     vendor_part_id:            4099
>     hw_ver:                0x0
>     board_id:            DEL0A30000019
>     phys_port_cnt:            1
>         port:    1
>             state:            PORT_ACTIVE (4)
>             max_mtu:        2048 (4)
>             active_mtu:        2048 (4)
>             sm_lid:            1
>             port_lid:        300
>             port_lmc:        0x00
>             link_layer:        IB
>
>
> mvapich2 1.9a
> $ mpiname -a
> MVAPICH2 1.9a Sat Sep  8 15:01:35 EDT 2012 ch3:mrail
>
> Compilation
> CC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc -O2 -fPIC   -g -DNDEBUG -DNVALGRIND -O2
> CXX: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc -O2 -fPIC  -g -DNDEBUG -DNVALGRIND -O2
> F77: /usr/local/compilers/Intel/composer_xe_2013/bin/ifort   -g -O2 -L/usr/lib64
> FC: /usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort -O2 -fPIC  -g -O2
>
> Configuration
> --prefix=/usr/local/packages/mvapich2/1.9a/Intel-13.0.0 \
> FC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/ifort \
> CC=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icc \
> CXX=/usr/local/compilers/Intel/composer_xe_2013.0.079/bin/intel64/icpc \
> CFLAGS=-O2 -fPIC FCFLAGS=-O2 -fPIC CXXFLAGS=-O2 -fPIC \
> LDFLAGS=-L/usr/local/compilers/Intel/composer_xe_2013/lib -L/usr/local/compilers/Intel/composer_xe_2013/lib/intel64 \
> LIBS= CPPFLAGS= \
> --enable-rdma-cm --enable-g=dbg --enable-romio --with-file-system=lustre+nfs \
> --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 \
> --enable-threads=runtime --enable-mpe --enable-smpcoll --enable-shared --enable-xrc --with-hwloc
>
> $ pbs_mom --version
> version: 3.0.6
> pbs moms are threaded with the default 3 threads.
>
> This does not happen with openmpi. A sample hello world is being run with different parameters
>
> program dummy
>   use mpi
>   character*10 name
> ! Init MPI
>     call MPI_Init(mpierr)
> ! Get Rank Size
>     call MPI_COMM_Rank(MPI_COMM_WORLD, nrank, mpierr)
>     call MPI_COMM_Size(MPI_COMM_WORLD, nproc, mpierr)
> ! Get Date
>     if (nrank==0) then
>     write(*,*)'System date: Running mpirun_rsh'
>     call system('date')
>     end if
> ! Print rank
>     call MPI_Barrier(MPI_COMM_WORLD, mpierr)
>     !
>     call MPI_Get_processor_name(name, nlen, mpierr)
>     write(*,*)"    I am ", nrank, " of " ,nproc, " on ", name
>     !
>     call MPI_Barrier(MPI_COMM_WORLD, mpierr)
> ! Finalize
>     call MPI_Finalize(mpierr)
> end
>
> ===========
> #
>   cat $PBS_NODEFILE > hostfile
>   cat $PBS_NODEFILE | uniq > hosts
>   mpi_width=`cat hostfile | wc -l`
>   mpi_nodes=`cat hosts | wc -l`
>
> for mpi_pn in 8 16
>   do
>     let sum_mpi=$mpi_pn*$mpi_nodes
>     let OMP_NUM_THREADS=$mpi_width/$sum_mpi
>
>     for param in "MV2_USE_XRC=1"  "MV2_USE_RoCE=1" "MV2_USE_RDMA_CM=1"
>      do
>        echo "    $param"
>        echo "    nodes:$mpi_nodes  mpi-per-node:$mpi_pn  omp:$OMP_NUM_THREADS"
>        time mpirun_rsh -np $sum_mpi -hostfile hosts $param ./dummy
>     done
> done
> ===============
> using parameter "MV2_USE_RoCE=1" and "MV2_USE_RDMA_CM=1" should fail as they
> not been configured yet(openib.conf), nevertheless, the program does not exit cleanly.
> We are seeing this with some other applications where the process seems to have crashed
> and is not producting any useful output, but there are threads lingering on ling after the program has
> crashed.
>
> At this stage we are not sure if this is infiniband or torque or mvapich2 issue. Please let us know if
> you have seen this behaviour and if there is a way to resolve this.
>
> Best,
> Bhupender.
>
>
> Bhupender Thakur.
> IT- Analyst,
> High Performance Computing, LSU.
> Ph (225)578-5934

> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo