[mvapich-discuss] Potential Bug(s) with mvapich2-1.8

Devendar Bureddy bureddy at cse.ohio-state.edu
Fri Oct 26 09:16:02 EDT 2012


Thanks David.

I am cc'ing this note to MVAPICH-discuss so that we can close this report.
 For everyone's information, we had some offline discussion and issue
turned out be a setup issue.

-Devendar

---------- Forwarded message ----------
From: David M. Race <dr.david.race at gmail.com>
Date: Thu, Oct 25, 2012 at 10:46 PM
Subject: Re: [mvapich-discuss] Potential Bug(s) with mvapich2-1.8
To: Devendar Bureddy <bureddy at cse.ohio-state.edu>

Dear Devendar,

Thanks for this information.  We will put both HCAs onto the same switch in
the near future and run some tests.

Thanks for your quick response.

Hi David

The issue with out runtime parameters is because of the fact that both
these machine dual rails are connected back-to-back ( with out switch).  In
this mode of setup, IB communication is not possible between 1st
HCA(mlx4_0) on node1 and 2nd HCA (mlx4_1) on node2.

MVAPICH2 by default with out any run time parameters is trying to use both
rails with binding HCA to  MPI processes (i.e 16 of 32 processes bind to
one HCA and other 16 to another HCA). In this case, processes which are
bind to mlx4_0 on one node are not able to communicate with processes bind
to mlx4_1 on other node.  This is not going be an issue if these nodes are
connected through a switch.

When MV2_IBA_HCAS is specified, it internally switch to rail SHARING scheme
( i.e all processes will use all HCAs).  Hence things are working fine with
run time parameters.

I think, you can continue your experiments with run-time parameter

Single Rail  : MV2_NUM_HCAS=1
Dual Rail    : MV2_NUM_HCAS=2

-Devendar


On Wed, Oct 24, 2012 at 1:02 PM, David M. Race <dr.david.race at gmail.com>wrote:

>  I ran with the following configurations:
>
> mpiexec.hydra -machinefile ./hostfile -np 64 ./mpibench        FAILED
> mpiexec.hydra -machinefile ./hostfile -np 64 -genv MV2_IBA_HCA mlx4_1
> -genv MV2_NUM_HCAS 1 ./mpibench         PASSED
> mpiexec.hydra -machinefile ./hostfile -np 64 -genv MV2_IBA_HCA mlx4_0
> -genv MV2_NUM_HCAS 1 ./mpibench         PASSED
>
> Regards
>
>  *Dr. David Race*
> Appro International, Inc.
> 4200 Research Forest Drive, Suite 400
> The Woodlands, TX 77381
> On 10/23/2012 6:42 PM, Devendar Bureddy wrote:
>
> Hi David
>
> Can you please try and let us know how failed single rail case runs
> with following options?
>
> MV2_NUM_HCAS =1  and MV2_IBA_HCA=mlx4_0
>
> -Devendar
>
> On Tue, Oct 23, 2012 at 5:22 PM, David M. Race <dr.david.race at gmail.com> <dr.david.race at gmail.com> wrote:
>
>  Hello,
>
> We are using mvapich2-1.8 and mvapich2-1.9a to study the performance on
> Intel 32 core systems.  These systems appear to have some different
> performance characteristics that require different collective algorithms.
> During the study we have run across the two runtime failures described in
> the attached document.
>
> Please let me know if you need any additional information to resolve these
> issues.
>
> Regards
>
> --
> Dr. David Race
> Appro International, Inc.
> 4200 Research Forest Drive, Suite 400
> The Woodlands, TX 77381
>
> _______________________________________________
> mvapich-discuss mailing listmvapich-discuss at cse.ohio-state.eduhttp://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>


-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20121026/3246c0b4/attachment.html


More information about the mvapich-discuss mailing list