[mvapich-discuss] mvapich2-1.0.3 bug?

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Mar 23 10:18:21 EDT 2012


Thanks for the note.  The version of MVAPICH2 you're using is very old.
We've developed a process manager to replace mpd called mpirun_rsh that
has been available for several releases now.

I suggest updating to our latest stable release MVAPICH2 1.7 or our
newly released MVAPICH2 1.8rc1.  You will find that there are many other
new features and performance enhancements available in these releases
compared to MVAPICH2 1.0.3.

Download:
http://mvapich.cse.ohio-state.edu/download/mvapich2/

Userguide:
http://mvapich.cse.ohio-state.edu/support/

On Fri, Mar 23, 2012 at 02:42:41PM +0100, Acero Fernandez Alicia wrote:
> Hello,
> 
> I have a problem when I try to run a parallel program in my cluster, sometimes I run it with success and other times it fails. I have simplified to the mininum, but it fails sometimes. I am the system administrator of the cluster and I have checked disks, network, etc and I don´t find any problem on the cluster. Then, I think perhaps it is a bug of this version of mvapich or perhaps a timeout or something related to this mpi implementation. Could you give any idea of what can be happening?
> 
> I send you the pbs script (it only executes mpdboot command)  and the outputs (failed and successful) of the runs:
> 
>  
> 
> PBS Script:
> 
> 
> #PBS -l nodes=eul0202.ciemat.es:ppn=8+eul0203.ciemat.es:ppn=8+eul0204.ciemat.es:ppn=8
> #PBS -l walltime=00:10:00
> #PBS -o /nfs/blanco/temp01
> #PBS -e /nfs/blanco/temp01
> #
> NUMPROC=`wc -l < $PBS_NODEFILE`
> NUMNODES=`uniq $PBS_NODEFILE | wc -l`
> #
> /opt/ofed_1.3.1/mpi/intel/mvapich2-1.0.3/bin/mpdboot -v -n $NUMNODES -f ${PBS_NODEFILE}
> Status=$?
> if [ $Status -eq 0 ]
>   then
>     echo "  #########  SUCCESS  ############# "
>   else
>     echo "  #########  FAILED ############# "
> fi
> 
> 
> Output of the successful job:
> 
> --------------------------------------------
> Prologue Args:
> 
> Job ID: 4963382.eulmgr.ciemat.es
> User ID: blanco
> Group ID: ceca
> --------------------------------------------
> running mpdallexit on eul0509.ciemat.es
> LAUNCHED mpd on eul0509.ciemat.es  via
> RUNNING: mpd on eul0509.ciemat.es
> LAUNCHED mpd on eul0510.ciemat.es  via  eul0509.ciemat.es
> LAUNCHED mpd on eul0511.ciemat.es  via  eul0509.ciemat.es
> RUNNING: mpd on eul0510.ciemat.es
> RUNNING: mpd on eul0511.ciemat.es
>   #########  SUCCESS  #############
> 
> --------------------------------------------
> Epilogue Args:
> 
> Job Name :  caton.pbs.IB
> Host/s:           eul0509.ciemat.es eul0510.ciemat.es eul0511.ciemat.es
> Elapsed(Wall)time:00:00:01
> Memory:           5352kb
> Virtual memory:   42628kb
> Job submitted at: Fri Mar 23 10:37:15
> Job started at:   Fri Mar 23 10:37:18
> Job ended at:     Fri Mar 23 10:37:19
> --------------------------------------------
> 
> 
> 
> Output of the failed job:
> 
> 
> --------------------------------------------
> Prologue Args:
> 
> Job ID: 4963381.eulmgr.ciemat.es
> User ID: blanco
> Group ID: ceca
> --------------------------------------------
> running mpdallexit on eul0512.ciemat.es
> LAUNCHED mpd on eul0512.ciemat.es  via
> RUNNING: mpd on eul0512.ciemat.es
> LAUNCHED mpd on eul0513.ciemat.es  via  eul0512.ciemat.es
> LAUNCHED mpd on eul0514.ciemat.es  via  eul0512.ciemat.es
> mpdboot_eul0512.ciemat.es (handle_mpd_output 382): failed to handshake with mpd on eul0513.ciemat.es; recvd output={}
> 
>   #########  FAILED  #############
> 
> --------------------------------------------
> Epilogue Args:
> 
> Job Name :  caton.pbs.IB
> Host/s:           eul0512.ciemat.es eul0513.ciemat.es eul0514.ciemat.es
> Elapsed(Wall)time:00:00:01
> Memory:           688kb
> Virtual memory:   5068kb
> Job submitted at: Fri Mar 23 10:37:14
> Job started at:   Fri Mar 23 10:37:16
> Job ended at:     Fri Mar 23 10:37:17
> --------------------------------------------
> 
> 
> Regards,
> 
> Alicia Acero
> 
> ----------------------------
> Confidencialidad: 
> Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucción.
> 
> Disclaimer: 
> This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately. 
> ----------------------------
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list