[mvapich-discuss] mvapich2-1.0.3 bug?

Acero Fernandez Alicia alicia.acero at ciemat.es
Fri Mar 23 09:42:41 EDT 2012


Hello,

I have a problem when I try to run a parallel program in my cluster, sometimes I run it with success and other times it fails. I have simplified to the mininum, but it fails sometimes. I am the system administrator of the cluster and I have checked disks, network, etc and I don´t find any problem on the cluster. Then, I think perhaps it is a bug of this version of mvapich or perhaps a timeout or something related to this mpi implementation. Could you give any idea of what can be happening?

I send you the pbs script (it only executes mpdboot command)  and the outputs (failed and successful) of the runs:

 

PBS Script:


#PBS -l nodes=eul0202.ciemat.es:ppn=8+eul0203.ciemat.es:ppn=8+eul0204.ciemat.es:ppn=8
#PBS -l walltime=00:10:00
#PBS -o /nfs/blanco/temp01
#PBS -e /nfs/blanco/temp01
#
NUMPROC=`wc -l < $PBS_NODEFILE`
NUMNODES=`uniq $PBS_NODEFILE | wc -l`
#
/opt/ofed_1.3.1/mpi/intel/mvapich2-1.0.3/bin/mpdboot -v -n $NUMNODES -f ${PBS_NODEFILE}
Status=$?
if [ $Status -eq 0 ]
  then
    echo "  #########  SUCCESS  ############# "
  else
    echo "  #########  FAILED ############# "
fi


Output of the successful job:

--------------------------------------------
Prologue Args:

Job ID: 4963382.eulmgr.ciemat.es
User ID: blanco
Group ID: ceca
--------------------------------------------
running mpdallexit on eul0509.ciemat.es
LAUNCHED mpd on eul0509.ciemat.es  via
RUNNING: mpd on eul0509.ciemat.es
LAUNCHED mpd on eul0510.ciemat.es  via  eul0509.ciemat.es
LAUNCHED mpd on eul0511.ciemat.es  via  eul0509.ciemat.es
RUNNING: mpd on eul0510.ciemat.es
RUNNING: mpd on eul0511.ciemat.es
  #########  SUCCESS  #############

--------------------------------------------
Epilogue Args:

Job Name :  caton.pbs.IB
Host/s:           eul0509.ciemat.es eul0510.ciemat.es eul0511.ciemat.es
Elapsed(Wall)time:00:00:01
Memory:           5352kb
Virtual memory:   42628kb
Job submitted at: Fri Mar 23 10:37:15
Job started at:   Fri Mar 23 10:37:18
Job ended at:     Fri Mar 23 10:37:19
--------------------------------------------



Output of the failed job:


--------------------------------------------
Prologue Args:

Job ID: 4963381.eulmgr.ciemat.es
User ID: blanco
Group ID: ceca
--------------------------------------------
running mpdallexit on eul0512.ciemat.es
LAUNCHED mpd on eul0512.ciemat.es  via
RUNNING: mpd on eul0512.ciemat.es
LAUNCHED mpd on eul0513.ciemat.es  via  eul0512.ciemat.es
LAUNCHED mpd on eul0514.ciemat.es  via  eul0512.ciemat.es
mpdboot_eul0512.ciemat.es (handle_mpd_output 382): failed to handshake with mpd on eul0513.ciemat.es; recvd output={}

  #########  FAILED  #############

--------------------------------------------
Epilogue Args:

Job Name :  caton.pbs.IB
Host/s:           eul0512.ciemat.es eul0513.ciemat.es eul0514.ciemat.es
Elapsed(Wall)time:00:00:01
Memory:           688kb
Virtual memory:   5068kb
Job submitted at: Fri Mar 23 10:37:14
Job started at:   Fri Mar 23 10:37:16
Job ended at:     Fri Mar 23 10:37:17
--------------------------------------------


Regards,

Alicia Acero

----------------------------
Confidencialidad: 
Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción.

Disclaimer: 
This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately. 
----------------------------



More information about the mvapich-discuss mailing list