[mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Subramoni, Hari subramoni.1 at osu.edu
Mon Jun 22 21:26:36 EDT 2020


Hi, Shaleen.

With the default configuration, MVAPICH2 only uses the IB interfaces. If you mention MV2_USE_RoCE=1, the traffic will go over the IB interfaces configured in RoCE mode. The traffic will not go over pure interfaces IP-only interfaces like GigE or 10GigE.

I am not sure I understand your second question. Can you please clarify?

Thx,
Hari.

From: Shaleen Garg <shaleen.garg at rutgers.edu>
Sent: Monday, June 22, 2020 7:32 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi

This worked ! but how do I make sure that IB is getting used ?

Also, I work with linux kernel, so I compile my own kernel, can you tell me which config parameters should be enabled for compiling ?

Regards,

From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Monday, June 22, 2020 at 7:23 PM
To: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi, Shaleen.

I am assuming this is with MVAPICH2 2.3.4 GA release.

Is this a RoCE system? If so, can you please use MV2_USE_RoCE=1 as a runtime environment variable?

Thx,
Hari.

From: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>
Sent: Monday, June 22, 2020 6:44 PM
To: Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>; mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

The following is the error I get:

shaleen at node-0:~$ mpirun --hostfile HOSTS -env MV2_DEBUG_SHOW_BACKTRACE=2 -env MV2_SMP_USE_CMA=0 -np 2 ./a.out
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1697] Could not modify qpto RTR
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1697] Could not modify qpto RTR

This is a single socket system. For testing I am using 2 nodes. They don’t have a shared disk.

Is there an issue with the kernel config ? What kernel configs should be enabled for ib to work fine ?


Regards,
Shaleen Garg
From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Monday, June 22, 2020 at 11:08 AM
To: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi, Shaleen.

Is this a single socket system?

We recently released a newer version of MVAPICH2 (2.3.4). Can you please try that? That fixes some issue similar to this.

If you observe a similar issue with MVAPICH2 2.3.4, can you do the following.


  1.  Reconfigure MVAPICH2 with “./configure --with-device=ch3:mrail --with-rdma=gen2 –enable-g=all and –enable-fast=none”
  2.  Add MV2_DEBUG_SHOW_BACKTRACE=2 when running it

That will tell us where the seg fault occurs.

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> On Behalf Of Shaleen Garg
Sent: Monday, June 22, 2020 9:48 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi All,

I am trying to install mvapich on a machine with Mellanox IB:


$lspci | grep “Mellanox”

Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

To install, I followed the user guide. Since this is a new machine, I have installed the following packages ( on ubuntu 18.04 with linux version 4.15.0-55-generic): libibmad-dev libibumad-dev libibumad3 libibverbs-dev gfortran infiniband-diags rdma-core.

Installation Method:

$ ./configure --with-device=ch3:mrail --with-rdma=gen2

$ make -j

$ sudo make install


Now this installs fine. But, when I run a hello world program:


$ mpirun -env MV2_SMP_USE_CMA=0 -np 10 ./a.out


I get the following error:

[apt140:mpi_rank_2][error_sighandler] Caught error: Floating point exception (signal 8)
…

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 13854 RUNNING AT apt140

=   EXIT CODE: 8

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

Is there something I am missing ? I don’t know why even within the node, mpi hello world is not working. The code I am testing on comes from https://mpitutorial.com/tutorials/mpi-hello-world/<https://urldefense.com/v3/__https:/nam02.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Fnam02.safelinks.protection.outlook.com*2F*3Furl*3Dhttps*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Fmpitutorial.com*2Ftutorials*2Fmpi-hello-world*2F__*3B!!KGKeukY!ndjrdSb_kPhyBYYFaVpap2wx7Sjs9GYbPHbGiLhdLYm6Fywr1qbtlIARBeAoBItHbbeNCglyHo71nTU*24*26data*3D02*7C01*7Cshaleen.garg*40rutgers.edu*7Ccd0444a6fc73495933bc08d816be0f2d*7Cb92d2b234d35447093ff69aca6632ffe*7C1*7C0*7C637284352840037663*26sdata*3D*2FIZRIDcF3J3gteScSaAtcgHPywZdW48uN1bu8*2BKCi4Y*3D*26reserved*3D0__*3BJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSU!!KGKeukY!mk1IRFOmd7Sp4Hlq56Kw7hTefppJLWaCzmPNbia0XttpEvClzE_mLNrWOExyID_o-w*24&data=02*7C01*7Cshaleen.garg*40rutgers.edu*7C8653040db86d4582002e08d817034355*7Cb92d2b234d35447093ff69aca6632ffe*7C1*7C0*7C637284650073521069&sdata=FRGwclfh2HdYS3LwjBYedq6Bt*2BD2BGu0cMY4GWnGnMk*3D&reserved=0__;JSUlJSUlJSUlJSoqKioqKioqKioqKiUlKioqKioqKiolJSoqKiUlJSUlJSUlJSUlJSUl!!KGKeukY!jomPA2q2K7NwelzO2x4lLljW4o8CxeR-nLXYSEMDVBzQ3uvvL8c2HGJDPQvAvunUVA$>


Regards,
Shaleen Garg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200623/93a3613d/attachment-0001.html>


More information about the mvapich-discuss mailing list