[mvapich-discuss] MVAPICH 2.1 with Mellanox 40GbE NIC

Davide Vanzo vanzod at accre.vanderbilt.edu
Fri Nov 13 15:00:46 EST 2015


Hi all,
I'm having issues running MVAPICH 2.1 with Mellanox 40Gbps Ethernet
NICs.
I correctly installed the Mellanox OFED 3.1-1.0.3 drivers and IB tools
and libraries necessary to allow RDMA communication through the NIC. As
a matter of fact, if I run the osu_bw benchmark with the OpenMPI
provided with OFED, I get a 3.35GB/s bandwidth.
I configured MVAPICH as follows:

./configure --with-device=ch3:mrail --with-rdma=gen2 --with-ib-
include=/usr/include/infiniband --with-ib-libpath=/usr/lib64 --enable-
hwloc

If I try to run the osu_bw (this one compiled with MVAPICH mpicc), I
get the following error:

$ mpirun_rsh -ssh -hostfile forty_hosts -np 2 ./osu_bw
[vmp817:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
(signal 11)
[vmp817:mpi_rank_1][print_backtrace]   0:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(print_backtrace+0x1
e) [0x7f8e1a3ea60e]
[vmp817:mpi_rank_1][print_backtrace]   1:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(error_sighandler+0x
59) [0x7f8e1a3ea719]
[vmp817:mpi_rank_1][print_backtrace]   2: /lib64/libc.so.6(+0x326a0)
[0x7f8e19d356a0]
[vmp817:mpi_rank_1][print_backtrace]   3:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x34af1a)
[0x7f8e1a3e1f1a]
[vmp817:mpi_rank_1][print_backtrace]   4:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(_int_malloc+0x119)
[0x7f8e1a3e2dd9]
[vmp817:mpi_rank_1][print_backtrace]   5:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(malloc+0x90)
[0x7f8e1a3e4030]
[vmp817:mpi_rank_1][print_backtrace]   6: /lib64/libc.so.6(+0x66ecb)
[0x7f8e19d69ecb]
[vmp817:mpi_rank_1][print_backtrace]   7: /lib64/libc.so.6(+0x9e4ff)
[0x7f8e19da14ff]
[vmp817:mpi_rank_1][print_backtrace]   8: /lib64/libc.so.6(+0x9d954)
[0x7f8e19da0954]
[vmp817:mpi_rank_1][print_backtrace]   9: /lib64/libc.so.6(+0x9dab9)
[0x7f8e19da0ab9]
[vmp817:mpi_rank_1][print_backtrace]  10:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPID_Abort+0xad)
[0x7f8e1a38dd0d]
[vmp817:mpi_rank_1][print_backtrace]  11:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x2c23fe)
[0x7f8e1a3593fe]
[vmp817:mpi_rank_1][print_backtrace]  12:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPIR_Err_return_com
m+0x100) [0x7f8e1a359510]
[vmp817:mpi_rank_1][print_backtrace]  13:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPI_Init+0xa6)
[0x7f8e1a30cd06]
[vmp817:mpi_rank_1][print_backtrace]  14: ./osu_bw() [0x40127e]
[vmp817:mpi_rank_1][print_backtrace]  15:
/lib64/libc.so.6(__libc_start_main+0xfd) [0x7f8e19d21d5d]
[vmp817:mpi_rank_1][print_backtrace]  16: ./osu_bw() [0x4010e9]
[vmp816:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
(signal 11)
[vmp816:mpi_rank_0][print_backtrace]   0:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(print_backtrace+0x1
e) [0x7f664552b60e]
[vmp816:mpi_rank_0][print_backtrace]   1:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(error_sighandler+0x
59) [0x7f664552b719]
[vmp816:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6(+0x326a0)
[0x7f6644e766a0]
[vmp816:mpi_rank_0][print_backtrace]   3:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x34af1a)
[0x7f6645522f1a]
[vmp816:mpi_rank_0][print_backtrace]   4:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(_int_malloc+0x119)
[0x7f6645523dd9]
[vmp816:mpi_rank_0][print_backtrace]   5:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(malloc+0x90)
[0x7f6645525030]
[vmp816:mpi_rank_0][print_backtrace]   6: /lib64/libc.so.6(+0x66ecb)
[0x7f6644eaaecb]
[vmp816:mpi_rank_0][print_backtrace]   7: /lib64/libc.so.6(+0x9e4ff)
[0x7f6644ee24ff]
[vmp816:mpi_rank_0][print_backtrace]   8: /lib64/libc.so.6(+0x9d954)
[0x7f6644ee1954]
[vmp816:mpi_rank_0][print_backtrace]   9: /lib64/libc.so.6(+0x9dab9)
[0x7f6644ee1ab9]
[vmp816:mpi_rank_0][print_backtrace]  10:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPID_Abort+0xad)
[0x7f66454ced0d]
[vmp816:mpi_rank_0][print_backtrace]  11:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(+0x2c23fe)
[0x7f664549a3fe]
[vmp816:mpi_rank_0][print_backtrace]  12:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPIR_Err_return_com
m+0x100) [0x7f664549a510]
[vmp816:mpi_rank_0][print_backtrace]  13:
/usr/local/mvapich_2.1_gcc493/RDMA/lib/libmpi.so.12(MPI_Init+0xa6)
[0x7f664544dd06]
[vmp816:mpi_rank_0][print_backtrace]  14: ./osu_bw() [0x40127e]
[vmp816:mpi_rank_0][print_backtrace]  15:
/lib64/libc.so.6(__libc_start_main+0xfd) [0x7f6644e62d5d]
[vmp816:mpi_rank_0][print_backtrace]  16: ./osu_bw() [0x4010e9]
[vmp817:mpispawn_1][readline] Unexpected End-Of-File on file descriptor
5. MPI process died?
[vmp817:mpispawn_1][mtpmi_processops] Error while reading PMI socket.
MPI process died?
[vmp817:mpispawn_1][child_handler] MPI process (rank: 1, pid: 4631)
terminated with signal 11 -> abort job
[vmp816:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
5. MPI process died?
[vmp816:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI process died?
[vmp816:mpispawn_0][child_handler] MPI process (rank: 0, pid: 15476)
terminated with signal 11 -> abort job
[vmp816:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node
mlx02 aborted: Error while reading a PMI socket (4)

Where the hostfile is:

$ cat forty_hosts
mlx01:1
mlx02:1

and the /etc/hosts reads:

$ cat /etc/hosts
127.0.0.1   localhost
192.168.4.1 mlx01
192.168.4.2 mlx02

Where the two IP addresses correctly map the 40GbE interface:

# ip addr ls eth4
8: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP
qlen 1000
    link/ether e4:1d:2d:2e:09:a0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.4.1/24 brd 192.168.4.255 scope global eth4
    inet6 fe80::e61d:2dff:fe2e:9a0/64 scope link
       valid_lft forever preferred_lft forever

# ethtool eth4
Settings for eth4:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseKX/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        40000baseCR4/Full
	                        40000baseSR4/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Advertised link modes:  1000baseKX/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        40000baseCR4/Full
	                        40000baseSR4/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  40000baseCR4/Full
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Speed: 40000Mb/s
	Duplex: Full
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000014 (20)
			       link ifdown
	Link detected: yes

Any suggestion is welcomed!

Davide

-- 
Davide Vanzo, PhD
Application Developer
Advanced Computing Center for Research and Education (ACCRE)
Vanderbilt University - Hill Center 201
www.accre.vanderbilt.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151113/5094d8bb/attachment-0001.html>


More information about the mvapich-discuss mailing list