[mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)
Subramoni, Hari
subramoni.1 at osu.edu
Mon Jun 22 21:26:36 EDT 2020
Hi, Shaleen.
With the default configuration, MVAPICH2 only uses the IB interfaces. If you mention MV2_USE_RoCE=1, the traffic will go over the IB interfaces configured in RoCE mode. The traffic will not go over pure interfaces IP-only interfaces like GigE or 10GigE.
I am not sure I understand your second question. Can you please clarify?
Thx,
Hari.
From: Shaleen Garg <shaleen.garg at rutgers.edu>
Sent: Monday, June 22, 2020 7:32 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)
Hi
This worked ! but how do I make sure that IB is getting used ?
Also, I work with linux kernel, so I compile my own kernel, can you tell me which config parameters should be enabled for compiling ?
Regards,
From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Monday, June 22, 2020 at 7:23 PM
To: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)
Hi, Shaleen.
I am assuming this is with MVAPICH2 2.3.4 GA release.
Is this a RoCE system? If so, can you please use MV2_USE_RoCE=1 as a runtime environment variable?
Thx,
Hari.
From: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>
Sent: Monday, June 22, 2020 6:44 PM
To: Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>; mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)
The following is the error I get:
shaleen at node-0:~$ mpirun --hostfile HOSTS -env MV2_DEBUG_SHOW_BACKTRACE=2 -env MV2_SMP_USE_CMA=0 -np 2 ./a.out
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1697] Could not modify qpto RTR
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1697] Could not modify qpto RTR
This is a single socket system. For testing I am using 2 nodes. They don’t have a shared disk.
Is there an issue with the kernel config ? What kernel configs should be enabled for ib to work fine ?
Regards,
Shaleen Garg
From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Monday, June 22, 2020 at 11:08 AM
To: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)
Hi, Shaleen.
Is this a single socket system?
We recently released a newer version of MVAPICH2 (2.3.4). Can you please try that? That fixes some issue similar to this.
If you observe a similar issue with MVAPICH2 2.3.4, can you do the following.
1. Reconfigure MVAPICH2 with “./configure --with-device=ch3:mrail --with-rdma=gen2 –enable-g=all and –enable-fast=none”
2. Add MV2_DEBUG_SHOW_BACKTRACE=2 when running it
That will tell us where the seg fault occurs.
Thx,
Hari.
From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> On Behalf Of Shaleen Garg
Sent: Monday, June 22, 2020 9:48 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)
Hi All,
I am trying to install mvapich on a machine with Mellanox IB:
$lspci | grep “Mellanox”
Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
To install, I followed the user guide. Since this is a new machine, I have installed the following packages ( on ubuntu 18.04 with linux version 4.15.0-55-generic): libibmad-dev libibumad-dev libibumad3 libibverbs-dev gfortran infiniband-diags rdma-core.
Installation Method:
$ ./configure --with-device=ch3:mrail --with-rdma=gen2
$ make -j
$ sudo make install
Now this installs fine. But, when I run a hello world program:
$ mpirun -env MV2_SMP_USE_CMA=0 -np 10 ./a.out
I get the following error:
[apt140:mpi_rank_2][error_sighandler] Caught error: Floating point exception (signal 8)
…
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 13854 RUNNING AT apt140
= EXIT CODE: 8
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Is there something I am missing ? I don’t know why even within the node, mpi hello world is not working. The code I am testing on comes from https://mpitutorial.com/tutorials/mpi-hello-world/<https://urldefense.com/v3/__https:/nam02.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Fnam02.safelinks.protection.outlook.com*2F*3Furl*3Dhttps*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Fmpitutorial.com*2Ftutorials*2Fmpi-hello-world*2F__*3B!!KGKeukY!ndjrdSb_kPhyBYYFaVpap2wx7Sjs9GYbPHbGiLhdLYm6Fywr1qbtlIARBeAoBItHbbeNCglyHo71nTU*24*26data*3D02*7C01*7Cshaleen.garg*40rutgers.edu*7Ccd0444a6fc73495933bc08d816be0f2d*7Cb92d2b234d35447093ff69aca6632ffe*7C1*7C0*7C637284352840037663*26sdata*3D*2FIZRIDcF3J3gteScSaAtcgHPywZdW48uN1bu8*2BKCi4Y*3D*26reserved*3D0__*3BJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSU!!KGKeukY!mk1IRFOmd7Sp4Hlq56Kw7hTefppJLWaCzmPNbia0XttpEvClzE_mLNrWOExyID_o-w*24&data=02*7C01*7Cshaleen.garg*40rutgers.edu*7C8653040db86d4582002e08d817034355*7Cb92d2b234d35447093ff69aca6632ffe*7C1*7C0*7C637284650073521069&sdata=FRGwclfh2HdYS3LwjBYedq6Bt*2BD2BGu0cMY4GWnGnMk*3D&reserved=0__;JSUlJSUlJSUlJSoqKioqKioqKioqKiUlKioqKioqKiolJSoqKiUlJSUlJSUlJSUlJSUl!!KGKeukY!jomPA2q2K7NwelzO2x4lLljW4o8CxeR-nLXYSEMDVBzQ3uvvL8c2HGJDPQvAvunUVA$>
Regards,
Shaleen Garg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200623/93a3613d/attachment-0001.html>
More information about the mvapich-discuss
mailing list