[mvapich-discuss] Running more than 72 tasks with mvapich 0.9.5
Pavel Shamis (Pasha)
pasha at mellanox.co.il
Tue May 9 02:49:33 EDT 2006
Hello,
Can you provide more information about your cluster configuration:
1. hca type
2. # of switches and cluster topology
3. # of nodes
Do you have network monitoring tools - ibadm/ibmon running on the cluster?
I think that in your case you have some bad cables in the system. If you
will tune and run ibadm you will find the problem immediately.
Regards,
Pasha
Otheus wrote:
> Greetings,
>
> I think I found my answer at:
> https://docs.mellanox.com/dm/ibgold/docs/Troubleshooting.txt
>
> Problem: Running MPI on a big cluster (>200 nodes) fails.
>
> Suggestion:
> Try to increase the VAPI driver timeout parameter, VIADEV_DEFAULT_TIME_OUT,
> for the MPI stack. To achieve this, use the '-paramfile filename' option with
> mpirun_rsh. For example, you can run:
>
> /usr/local/ibgd/mpi/osu/gcc/mvapich-0.9.5/bin/mpirun_rsh -np 2 -paramfile ./perfparams -hostfile /root/cluster /usr/local/ibgd/mpi/osu/gcc/tests/PMB2.2.1/PMB-MPI1
>
> where the file perfparams includes the following line:
> VIADEV_DEFAULT_TIME_OUT = 12
>
> In my case, I had to set the default to 31. Numbers bigger than this
> resulted in another error.
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list