[mvapich-discuss] mvapich2-1.0.3 bug?

Dhabaleswar Panda panda at cse.ohio-state.edu
Fri Mar 23 10:00:25 EDT 2012


Hi,

Thanks for your note. MVAPICH2-1.0.3 is an ancient version..
Unfortunately, we will not be able to provide help and support for this
version.

Please use the latest MVAPICH2 1.7-branch version or MVPICH2 1.8-RC1
(released yesterday). If you encounter any issues with these versions, we
will be happy to extend help.

Thanks,

DK

On Fri, 23 Mar 2012, Acero Fernandez Alicia wrote:

> Hello,
>
> I have a problem when I try to run a parallel program in my cluster, sometimes I run it with success and other times it fails. I have simplified to the mininum, but it fails sometimes. I am the system administrator of the cluster and I have checked disks, network, etc and I don´t find any problem on the cluster. Then, I think perhaps it is a bug of this version of mvapich or perhaps a timeout or something related to this mpi implementation. Could you give any idea of what can be happening?
>
> I send you the pbs script (it only executes mpdboot command)  and the outputs (failed and successful) of the runs:
>
>
>
> PBS Script:
>
>
> #PBS -l nodes=eul0202.ciemat.es:ppn=8+eul0203.ciemat.es:ppn=8+eul0204.ciemat.es:ppn=8
> #PBS -l walltime=00:10:00
> #PBS -o /nfs/blanco/temp01
> #PBS -e /nfs/blanco/temp01
> #
> NUMPROC=`wc -l < $PBS_NODEFILE`
> NUMNODES=`uniq $PBS_NODEFILE | wc -l`
> #
> /opt/ofed_1.3.1/mpi/intel/mvapich2-1.0.3/bin/mpdboot -v -n $NUMNODES -f ${PBS_NODEFILE}
> Status=$?
> if [ $Status -eq 0 ]
>   then
>     echo "  #########  SUCCESS  ############# "
>   else
>     echo "  #########  FAILED ############# "
> fi
>
>
> Output of the successful job:
>
> --------------------------------------------
> Prologue Args:
>
> Job ID: 4963382.eulmgr.ciemat.es
> User ID: blanco
> Group ID: ceca
> --------------------------------------------
> running mpdallexit on eul0509.ciemat.es
> LAUNCHED mpd on eul0509.ciemat.es  via
> RUNNING: mpd on eul0509.ciemat.es
> LAUNCHED mpd on eul0510.ciemat.es  via  eul0509.ciemat.es
> LAUNCHED mpd on eul0511.ciemat.es  via  eul0509.ciemat.es
> RUNNING: mpd on eul0510.ciemat.es
> RUNNING: mpd on eul0511.ciemat.es
>   #########  SUCCESS  #############
>
> --------------------------------------------
> Epilogue Args:
>
> Job Name :  caton.pbs.IB
> Host/s:           eul0509.ciemat.es eul0510.ciemat.es eul0511.ciemat.es
> Elapsed(Wall)time:00:00:01
> Memory:           5352kb
> Virtual memory:   42628kb
> Job submitted at: Fri Mar 23 10:37:15
> Job started at:   Fri Mar 23 10:37:18
> Job ended at:     Fri Mar 23 10:37:19
> --------------------------------------------
>
>
>
> Output of the failed job:
>
>
> --------------------------------------------
> Prologue Args:
>
> Job ID: 4963381.eulmgr.ciemat.es
> User ID: blanco
> Group ID: ceca
> --------------------------------------------
> running mpdallexit on eul0512.ciemat.es
> LAUNCHED mpd on eul0512.ciemat.es  via
> RUNNING: mpd on eul0512.ciemat.es
> LAUNCHED mpd on eul0513.ciemat.es  via  eul0512.ciemat.es
> LAUNCHED mpd on eul0514.ciemat.es  via  eul0512.ciemat.es
> mpdboot_eul0512.ciemat.es (handle_mpd_output 382): failed to handshake with mpd on eul0513.ciemat.es; recvd output={}
>
>   #########  FAILED  #############
>
> --------------------------------------------
> Epilogue Args:
>
> Job Name :  caton.pbs.IB
> Host/s:           eul0512.ciemat.es eul0513.ciemat.es eul0514.ciemat.es
> Elapsed(Wall)time:00:00:01
> Memory:           688kb
> Virtual memory:   5068kb
> Job submitted at: Fri Mar 23 10:37:14
> Job started at:   Fri Mar 23 10:37:16
> Job ended at:     Fri Mar 23 10:37:17
> --------------------------------------------
>
>
> Regards,
>
> Alicia Acero
>
> ----------------------------
> Confidencialidad:
> Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción.
>
> Disclaimer:
> This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately.
> ----------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>




More information about the mvapich-discuss mailing list