[mvapich-discuss] MVAPICH 2.1 with Mellanox 40GbE NIC

Hari Subramoni subramoni.1 at osu.edu
Fri Nov 13 15:34:35 EST 2015


Hello Dr. Vanzo,

Glad to know things are working for you now :-). Please let us know if you
face any other issues and we'll be glad to work with you to resolve it.

Regards,
Hari.

On Fri, Nov 13, 2015 at 3:29 PM, Davide Vanzo <vanzod at accre.vanderbilt.edu>
wrote:

> Hari,
> that was it. Thank you for pointing me in the right direction!
>
> Have a great weekend,
> Davide
>
>
> On Fri, 2015-11-13 at 15:27 -0500, Hari Subramoni wrote:
>
> Hello Dr. Vanzo,
>
> Could you please try to run after setting "MV2_USE_RoCE=1"?
>
> eg: mpirun_rsh -ssh -hostfile forty_hosts -np 2 MV2_USE_RoCE=1 ./osu_bw
>
> Please refer to the following section of the MVAPICH2 userguide for more
> details about how to run MVAPICH2 with RoCE support
>
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2b-userguide.html#x1-380005.2.7
>
> Regards,
> Hari.
>
> On Fri, Nov 13, 2015 at 3:00 PM, Davide Vanzo <vanzod at accre.vanderbilt.edu
> > wrote:
>
> Hi all,
> I'm having issues running MVAPICH 2.1 with Mellanox 40Gbps Ethernet NICs.
> I correctly installed the Mellanox OFED 3.1-1.0.3 drivers and IB tools and
> libraries necessary to allow RDMA communication through the NIC. As a
> matter of fact, if I run the osu_bw benchmark with the OpenMPI provided
> with OFED, I get a 3.35GB/s bandwidth.
> I configured MVAPICH as follows:
>
> ./configure --with-device=ch3:mrail --with-rdma=gen2
> --with-ib-include=/usr/include/infiniband --with-ib-libpath=/usr/lib64
> --enable-hwloc
>
> If I try to run the osu_bw (this one compiled with MVAPICH mpicc), I get
> the following error:
>
> $ mpirun_rsh -ssh -hostfile forty_hosts -np 2 ./osu_bw
> [vmp817:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [vmp817:mpi_rank_1][print_backtrace]   0:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(print_backtrace+0x1e)
> [0x7f8e1a3ea60e]
> [vmp817:mpi_rank_1][print_backtrace]   1:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7f8e1a3ea719]
> [vmp817:mpi_rank_1][print_backtrace]   2: /lib64/libc.so.6(+0x326a0)
> [0x7f8e19d356a0]
> [vmp817:mpi_rank_1][print_backtrace]   3:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x34af1a)
> [0x7f8e1a3e1f1a]
> [vmp817:mpi_rank_1][print_backtrace]   4:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(_int_malloc+0x119)
> [0x7f8e1a3e2dd9]
> [vmp817:mpi_rank_1][print_backtrace]   5:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(malloc+0x90)
> [0x7f8e1a3e4030]
> [vmp817:mpi_rank_1][print_backtrace]   6: /lib64/libc.so.6(+0x66ecb)
> [0x7f8e19d69ecb]
> [vmp817:mpi_rank_1][print_backtrace]   7: /lib64/libc.so.6(+0x9e4ff)
> [0x7f8e19da14ff]
> [vmp817:mpi_rank_1][print_backtrace]   8: /lib64/libc.so.6(+0x9d954)
> [0x7f8e19da0954]
> [vmp817:mpi_rank_1][print_backtrace]   9: /lib64/libc.so.6(+0x9dab9)
> [0x7f8e19da0ab9]
> [vmp817:mpi_rank_1][print_backtrace]  10:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPID_Abort+0xad)
> [0x7f8e1a38dd0d]
> [vmp817:mpi_rank_1][print_backtrace]  11:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x2c23fe)
> [0x7f8e1a3593fe]
> [vmp817:mpi_rank_1][print_backtrace]  12:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPIR_Err_return_comm+0x100)
> [0x7f8e1a359510]
> [vmp817:mpi_rank_1][print_backtrace]  13:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPI_Init+0xa6)
> [0x7f8e1a30cd06]
> [vmp817:mpi_rank_1][print_backtrace]  14: ./osu_bw() [0x40127e]
> [vmp817:mpi_rank_1][print_backtrace]  15:
> /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f8e19d21d5d]
> [vmp817:mpi_rank_1][print_backtrace]  16: ./osu_bw() [0x4010e9]
> [vmp816:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [vmp816:mpi_rank_0][print_backtrace]   0:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(print_backtrace+0x1e)
> [0x7f664552b60e]
> [vmp816:mpi_rank_0][print_backtrace]   1:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(error_sighandler+0x59)
> [0x7f664552b719]
> [vmp816:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6(+0x326a0)
> [0x7f6644e766a0]
> [vmp816:mpi_rank_0][print_backtrace]   3:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x34af1a)
> [0x7f6645522f1a]
> [vmp816:mpi_rank_0][print_backtrace]   4:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(_int_malloc+0x119)
> [0x7f6645523dd9]
> [vmp816:mpi_rank_0][print_backtrace]   5:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(malloc+0x90)
> [0x7f6645525030]
> [vmp816:mpi_rank_0][print_backtrace]   6: /lib64/libc.so.6(+0x66ecb)
> [0x7f6644eaaecb]
> [vmp816:mpi_rank_0][print_backtrace]   7: /lib64/libc.so.6(+0x9e4ff)
> [0x7f6644ee24ff]
> [vmp816:mpi_rank_0][print_backtrace]   8: /lib64/libc.so.6(+0x9d954)
> [0x7f6644ee1954]
> [vmp816:mpi_rank_0][print_backtrace]   9: /lib64/libc.so.6(+0x9dab9)
> [0x7f6644ee1ab9]
> [vmp816:mpi_rank_0][print_backtrace]  10:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPID_Abort+0xad)
> [0x7f66454ced0d]
> [vmp816:mpi_rank_0][print_backtrace]  11:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x2c23fe)
> [0x7f664549a3fe]
> [vmp816:mpi_rank_0][print_backtrace]  12:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPIR_Err_return_comm+0x100)
> [0x7f664549a510]
> [vmp816:mpi_rank_0][print_backtrace]  13:
> /usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPI_Init+0xa6)
> [0x7f664544dd06]
> [vmp816:mpi_rank_0][print_backtrace]  14: ./osu_bw() [0x40127e]
> [vmp816:mpi_rank_0][print_backtrace]  15:
> /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f6644e62d5d]
> [vmp816:mpi_rank_0][print_backtrace]  16: ./osu_bw() [0x4010e9]
> [vmp817:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5.
> MPI process died?
> [vmp817:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [vmp817:mpispawn_1][child_handler] MPI process (rank: 1, pid: 4631)
> terminated with signal 11 -> abort job
> [vmp816:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5.
> MPI process died?
> [vmp816:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [vmp816:mpispawn_0][child_handler] MPI process (rank: 0, pid: 15476)
> terminated with signal 11 -> abort job
> [vmp816:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node
> mlx02 aborted: Error while reading a PMI socket (4)
>
> Where the hostfile is:
>
> $ cat forty_hosts
> mlx01:1
> mlx02:1
>
> and the /etc/hosts reads:
>
> $ cat /etc/hosts
> 127.0.0.1   localhost
> 192.168.4.1 mlx01
> 192.168.4.2 mlx02
>
> Where the two IP addresses correctly map the 40GbE interface:
>
> # ip addr ls eth4
> 8: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen
> 1000
>     link/ether e4:1d:2d:2e:09:a0 brd ff:ff:ff:ff:ff:ff
>     inet 192.168.4.1/24 brd 192.168.4.255 scope global eth4
>     inet6 fe80::e61d:2dff:fe2e:9a0/64 scope link
>        valid_lft forever preferred_lft forever
>
> # ethtool eth4
> Settings for eth4:
> Supported ports: [ FIBRE ]
> Supported link modes:   1000baseKX/Full
>                         10000baseKX4/Full
>                         10000baseKR/Full
>                         40000baseCR4/Full
>                         40000baseSR4/Full
> Supported pause frame use: Symmetric Receive-only
> Supports auto-negotiation: Yes
> Advertised link modes:  1000baseKX/Full
>                         10000baseKX4/Full
>                         10000baseKR/Full
>                         40000baseCR4/Full
>                         40000baseSR4/Full
> Advertised pause frame use: Symmetric
> Advertised auto-negotiation: Yes
> Link partner advertised link modes:  40000baseCR4/Full
> Link partner advertised pause frame use: No
> Link partner advertised auto-negotiation: Yes
> Speed: 40000Mb/s
> Duplex: Full
> Port: Direct Attach Copper
> PHYAD: 0
> Transceiver: internal
> Auto-negotiation: on
> Supports Wake-on: d
> Wake-on: d
> Current message level: 0x00000014 (20)
>        link ifdown
> Link detected: yes
>
> Any suggestion is welcomed!
>
> Davide
>
> --
> Davide Vanzo, PhD
> Application Developer
> Advanced Computing Center for Research and Education (ACCRE)
> Vanderbilt University - Hill Center 201
> www.accre.vanderbilt.edu
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151113/89b428d1/attachment-0001.html>


More information about the mvapich-discuss mailing list