[mvapich-discuss] Re: mvapich2-1.6 problems with np ~ >= 1000

Johnny Devaprasad johnnydevaprasad at gmail.com
Tue Apr 5 09:54:33 EDT 2011


Hi Jonathan,

The version of mvapich2 is 1.6 (This is on the subject line, so I did not
include it in the description.) I downloaded this a couple of weeks back.

I have 112 nodes (48 cores each - magny-cours cpus).

On the command line i specify -np 2000 ( which is approximately 42 nodes).
My machine file has more entries than that.

Infiniband information:
-------------------------------
[root at node112 ~]# lspci | grep Infi
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0
5GT/s - IB QDR / 10GigE] (rev b0)

ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.7.626
Hardware version: b0


Regards,
Johnny

On Tue, Apr 5, 2011 at 12:01 PM, Johnny Devaprasad <
johnnydevaprasad at gmail.com> wrote:

> Hi all,
>
> I am running a simple MPI program (only calls MPI_Get_processor_name).
>
> This sometime works and most of the time does not...
>
> mpirun_rsh -np 2000 -hostfile
> /home/jd/working/simple/mvapich2/machinefile_large
> /home/jd/working/simple/mvapich2/mvapich2_pgi
> Exit code -5 signaled from node015
> MPI process (rank: 315) terminated unexpectedly on node027
> MPI process (rank: 214) terminated unexpectedly on node014
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
>
> mpirun_rsh -np 1000 -hostfile
> /home/jd/working/simple/mvapich2/machinefile_large
> /home/jd/working/simple/mvapich2/mvapich2_pgi
> MPI process (rank: 435) terminated unexpectedly on node044
> Exit code -5 signaled from node041
> MPI process (rank: 777) terminated unexpectedly on node048
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
> handle_mt_peer: fail to read...: Success
>
>
> Regards,
> Johnny
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110405/ea4daafc/attachment.html


More information about the mvapich-discuss mailing list