[mvapich-discuss] mvapich2-1.0.3 bug?
Acero Fernandez Alicia
alicia.acero at ciemat.es
Fri Mar 23 09:42:41 EDT 2012
Hello,
I have a problem when I try to run a parallel program in my cluster, sometimes I run it with success and other times it fails. I have simplified to the mininum, but it fails sometimes. I am the system administrator of the cluster and I have checked disks, network, etc and I don´t find any problem on the cluster. Then, I think perhaps it is a bug of this version of mvapich or perhaps a timeout or something related to this mpi implementation. Could you give any idea of what can be happening?
I send you the pbs script (it only executes mpdboot command) and the outputs (failed and successful) of the runs:
PBS Script:
#PBS -l nodes=eul0202.ciemat.es:ppn=8+eul0203.ciemat.es:ppn=8+eul0204.ciemat.es:ppn=8
#PBS -l walltime=00:10:00
#PBS -o /nfs/blanco/temp01
#PBS -e /nfs/blanco/temp01
#
NUMPROC=`wc -l < $PBS_NODEFILE`
NUMNODES=`uniq $PBS_NODEFILE | wc -l`
#
/opt/ofed_1.3.1/mpi/intel/mvapich2-1.0.3/bin/mpdboot -v -n $NUMNODES -f ${PBS_NODEFILE}
Status=$?
if [ $Status -eq 0 ]
then
echo " ######### SUCCESS ############# "
else
echo " ######### FAILED ############# "
fi
Output of the successful job:
--------------------------------------------
Prologue Args:
Job ID: 4963382.eulmgr.ciemat.es
User ID: blanco
Group ID: ceca
--------------------------------------------
running mpdallexit on eul0509.ciemat.es
LAUNCHED mpd on eul0509.ciemat.es via
RUNNING: mpd on eul0509.ciemat.es
LAUNCHED mpd on eul0510.ciemat.es via eul0509.ciemat.es
LAUNCHED mpd on eul0511.ciemat.es via eul0509.ciemat.es
RUNNING: mpd on eul0510.ciemat.es
RUNNING: mpd on eul0511.ciemat.es
######### SUCCESS #############
--------------------------------------------
Epilogue Args:
Job Name : caton.pbs.IB
Host/s: eul0509.ciemat.es eul0510.ciemat.es eul0511.ciemat.es
Elapsed(Wall)time:00:00:01
Memory: 5352kb
Virtual memory: 42628kb
Job submitted at: Fri Mar 23 10:37:15
Job started at: Fri Mar 23 10:37:18
Job ended at: Fri Mar 23 10:37:19
--------------------------------------------
Output of the failed job:
--------------------------------------------
Prologue Args:
Job ID: 4963381.eulmgr.ciemat.es
User ID: blanco
Group ID: ceca
--------------------------------------------
running mpdallexit on eul0512.ciemat.es
LAUNCHED mpd on eul0512.ciemat.es via
RUNNING: mpd on eul0512.ciemat.es
LAUNCHED mpd on eul0513.ciemat.es via eul0512.ciemat.es
LAUNCHED mpd on eul0514.ciemat.es via eul0512.ciemat.es
mpdboot_eul0512.ciemat.es (handle_mpd_output 382): failed to handshake with mpd on eul0513.ciemat.es; recvd output={}
######### FAILED #############
--------------------------------------------
Epilogue Args:
Job Name : caton.pbs.IB
Host/s: eul0512.ciemat.es eul0513.ciemat.es eul0514.ciemat.es
Elapsed(Wall)time:00:00:01
Memory: 688kb
Virtual memory: 5068kb
Job submitted at: Fri Mar 23 10:37:14
Job started at: Fri Mar 23 10:37:16
Job ended at: Fri Mar 23 10:37:17
--------------------------------------------
Regards,
Alicia Acero
----------------------------
Confidencialidad:
Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción.
Disclaimer:
This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately.
----------------------------
More information about the mvapich-discuss
mailing list