[mvapich-discuss] problems with MVAPICH2 over 10GbE

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Aug 17 12:25:17 EDT 2011


Thanks for your report, I'm checking with some of the other developers
to verify the way this should work.  I believe that you do not need to
use the IP addresses of the RDMAoE port but instead specify the the
HCA name using MV2_IBA_HCA in addition to the MV2_USE_RDMAOE=1 option.

The name of the HCA can be found by using the ibstat command and
should look something like mlx4_...

On Wed, Aug 17, 2011 at 11:17 AM, Konz, Jeffrey (SSA Solution Centers)
<jeffrey.konz at hp.com> wrote:
> I am running on a cluster with the Mellanox LOM that supports both IB and 10 GbE.
> Both ports on the interface are active, one is on IB network the other on 10 GbE network.
>
> I built mvapich2-1.7rc1 with these options : "--with-device=ch3:mrail --with-rdma=gen2"
>
> Running over IB works fine.
>
> When I try to run over the 10GbE network with the "MV2_USE_RDMAOE=1" option I get this error:
>
> Fatal error in MPI_Init:
> Internal MPI error!
>
> [atl3-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [atl3-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [atl3-13:mpispawn_0][child_handler] MPI process (rank: 0, pid: 23500) exited with status 1
> [atl3-13:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 10.10.0.149 aborted: Error while reading a PMI socket (4)
>
> In the hostfile I specified the IP addresses of the 10 GbE ports.
>
> I am running incorrectly or have I not built mvapich with the correct options?
>
> Thanks,
>
> -Jeff
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list