[mvapich-discuss] Running more than 72 tasks with mvapich 0.9.5

Pavel Shamis (Pasha) pasha at mellanox.co.il
Tue May 9 02:49:33 EDT 2006


Hello,
Can you provide more information about your cluster configuration:
1. hca type
2. # of switches and cluster topology
3. # of nodes

Do you have network monitoring tools - ibadm/ibmon running on the cluster?
I think that in your case you have some bad cables in the system. If you 
will tune and run ibadm you will find the problem immediately.

Regards,
Pasha

Otheus wrote:
> Greetings,
> 
> I think I found my answer at: 
> https://docs.mellanox.com/dm/ibgold/docs/Troubleshooting.txt
> 
>     Problem: Running MPI on a big cluster (>200 nodes) fails.
> 
>     Suggestion:
>     	Try to increase the VAPI driver timeout parameter, VIADEV_DEFAULT_TIME_OUT,
>     	for the MPI stack. To achieve	this, use the '-paramfile filename' option with
>     	mpirun_rsh. For example, you can run:
> 
>          /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -np 2 -paramfile ./perfparams -hostfile /root/cluster /usr/local/ibgd/mpi/osu/gcc/tests/PMB2.2.1/PMB-MPI1
> 
>           where the file perfparams includes the following line:
>     	VIADEV_DEFAULT_TIME_OUT = 12
> 
> In my case, I had to set the default to 31. Numbers bigger than this 
> resulted in another error.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list