[mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Subramoni, Hari subramoni.1 at osu.edu
Wed Jul 8 13:40:55 EDT 2020


Hi, Shaleen.

Looks like you’re running out of the amount of memory that can be pinned.

Can you send the output of ulimit -l?

Thx,
Hari.

From: Shaleen Garg <shaleen.garg at rutgers.edu>
Sent: Wednesday, July 8, 2020 1:19 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi Hari,

Sorry for restarting the thread. I was trying to run osu_benchmarks on my system over RoCE and this is what I got:

$>mpirun -env MV2_SMP_USE_CMA=0 -env MV2_USE_RoCE=1 -np 8 --hostfile ~/HOSTS ./osu_ialltoall

# OSU MPI Non-blocking All-to-All Latency Test v5.6.3
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
1                      34.81             19.05             17.08              7.75
2                      34.01             18.13             16.10              1.36
4                      33.85             18.03             15.22              0.00
8                      33.71             18.11             15.42              0.00
16                     34.60             18.27             16.22              0.00
32                     32.10             17.69             15.60              7.59
64                     27.71             14.62             13.03              0.00
128                    25.86             13.96             12.27              2.95
256                    26.04             14.01             12.65              4.89
512                    27.38             14.67             13.22              3.92
1024                   32.15             19.28             17.00             24.30
2048                   40.64             26.21             23.68             39.08
4096                   61.68             40.26             37.87             43.42
8192                   95.73             69.26             65.76             59.75
16384                 265.17            165.72            161.38             38.37
32768                 492.18            305.00            297.69             37.12
65536                 914.56            563.99            553.33             36.64
131072               1674.04           1057.87           1039.65             40.73
262144               3112.99           1905.70           1875.16             35.62

[src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 462] Cannot register vbuf region
[node-6.firsestart.lsm-pg0.utah.cloudlab.us:mpi_rank_6][MRAILI_Get_Vbuf] src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c:1208: vbuf pool allocation failed: Cannot allocate memory (12)

Are there any env variables I am missing ?

Also, I wanted to know if there are env variables to change the network hardware used by mvapich for example just ethernet and IB.

Regards,
Shaleen

From: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>
Date: Monday, June 22, 2020 at 9:52 PM
To: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Whenever you compile the linux kernel, there are some configuration parameters (.config file). What parameters should be turned on to support IB for mvapich ?

Regards,
Shaleen

From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Monday, June 22, 2020 at 9:26 PM
To: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi, Shaleen.

With the default configuration, MVAPICH2 only uses the IB interfaces. If you mention MV2_USE_RoCE=1, the traffic will go over the IB interfaces configured in RoCE mode. The traffic will not go over pure interfaces IP-only interfaces like GigE or 10GigE.

I am not sure I understand your second question. Can you please clarify?

Thx,
Hari.

From: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>
Sent: Monday, June 22, 2020 7:32 PM
To: Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>; mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi

This worked ! but how do I make sure that IB is getting used ?

Also, I work with linux kernel, so I compile my own kernel, can you tell me which config parameters should be enabled for compiling ?

Regards,

From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Monday, June 22, 2020 at 7:23 PM
To: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi, Shaleen.

I am assuming this is with MVAPICH2 2.3.4 GA release.

Is this a RoCE system? If so, can you please use MV2_USE_RoCE=1 as a runtime environment variable?

Thx,
Hari.

From: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>
Sent: Monday, June 22, 2020 6:44 PM
To: Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>; mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

The following is the error I get:

shaleen at node-0:~$ mpirun --hostfile HOSTS -env MV2_DEBUG_SHOW_BACKTRACE=2 -env MV2_SMP_USE_CMA=0 -np 2 ./a.out
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1697] Could not modify qpto RTR
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1697] Could not modify qpto RTR

This is a single socket system. For testing I am using 2 nodes. They don’t have a shared disk.

Is there an issue with the kernel config ? What kernel configs should be enabled for ib to work fine ?


Regards,
Shaleen Garg
From: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Monday, June 22, 2020 at 11:08 AM
To: Shaleen Garg <shaleen.garg at rutgers.edu<mailto:shaleen.garg at rutgers.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Subject: RE: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi, Shaleen.

Is this a single socket system?

We recently released a newer version of MVAPICH2 (2.3.4). Can you please try that? That fixes some issue similar to this.

If you observe a similar issue with MVAPICH2 2.3.4, can you do the following.


  1.  Reconfigure MVAPICH2 with “./configure --with-device=ch3:mrail --with-rdma=gen2 –enable-g=all and –enable-fast=none”
  2.  Add MV2_DEBUG_SHOW_BACKTRACE=2 when running it

That will tell us where the seg fault occurs.

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> On Behalf Of Shaleen Garg
Sent: Monday, June 22, 2020 9:48 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: [mvapich-discuss] MVAPICH2-2.3.3 giving me floating point error (signal 8)

Hi All,

I am trying to install mvapich on a machine with Mellanox IB:


$lspci | grep “Mellanox”

Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

To install, I followed the user guide. Since this is a new machine, I have installed the following packages ( on ubuntu 18.04 with linux version 4.15.0-55-generic): libibmad-dev libibumad-dev libibumad3 libibverbs-dev gfortran infiniband-diags rdma-core.

Installation Method:

$ ./configure --with-device=ch3:mrail --with-rdma=gen2

$ make -j

$ sudo make install


Now this installs fine. But, when I run a hello world program:


$ mpirun -env MV2_SMP_USE_CMA=0 -np 10 ./a.out


I get the following error:

[apt140:mpi_rank_2][error_sighandler] Caught error: Floating point exception (signal 8)
…

===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 13854 RUNNING AT apt140

=   EXIT CODE: 8

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

Is there something I am missing ? I don’t know why even within the node, mpi hello world is not working. The code I am testing on comes from https://mpitutorial.com/tutorials/mpi-hello-world/<https://urldefense.com/v3/__https:/nam02.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Fnam02.safelinks.protection.outlook.com*2F*3Furl*3Dhttps*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Fnam02.safelinks.protection.outlook.com*2F*3Furl*3Dhttps*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Fmpitutorial.com*2Ftutorials*2Fmpi-hello-world*2F__*3B!!KGKeukY!ndjrdSb_kPhyBYYFaVpap2wx7Sjs9GYbPHbGiLhdLYm6Fywr1qbtlIARBeAoBItHbbeNCglyHo71nTU*24*26data*3D02*7C01*7Cshaleen.garg*40rutgers.edu*7Ccd0444a6fc73495933bc08d816be0f2d*7Cb92d2b234d35447093ff69aca6632ffe*7C1*7C0*7C637284352840037663*26sdata*3D*2FIZRIDcF3J3gteScSaAtcgHPywZdW48uN1bu8*2BKCi4Y*3D*26reserved*3D0__*3BJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSU!!KGKeukY!mk1IRFOmd7Sp4Hlq56Kw7hTefppJLWaCzmPNbia0XttpEvClzE_mLNrWOExyID_o-w*24*26data*3D02*7C01*7Cshaleen.garg*40rutgers.edu*7C8653040db86d4582002e08d817034355*7Cb92d2b234d35447093ff69aca6632ffe*7C1*7C0*7C637284650073521069*26sdata*3DFRGwclfh2HdYS3LwjBYedq6Bt*2BD2BGu0cMY4GWnGnMk*3D*26reserved*3D0__*3BJSUlJSUlJSUlJSoqKioqKioqKioqKiUlKioqKioqKiolJSoqKiUlJSUlJSUlJSUlJSUl!!KGKeukY!jomPA2q2K7NwelzO2x4lLljW4o8CxeR-nLXYSEMDVBzQ3uvvL8c2HGJDPQvAvunUVA*24&data=02*7C01*7Cshaleen.garg*40rutgers.edu*7Cf5cbff8d24ee483c4f0608d817147e34*7Cb92d2b234d35447093ff69aca6632ffe*7C1*7C0*7C637284724085531903&sdata=W5wovtkLH0b0RYl5VaK0LoD*2FX4Suj*2FM*2B1i18v1Nt4eQ*3D&reserved=0__;JSUlJSUlJSUlJSoqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqJSUqKioqKioqKiUlKiolJSUlJSUlJSUlJSUlJSUl!!KGKeukY!nxR9rPP-ZePLRStwwJ1YVkDA4QkgQvYdIIsINhXg6a2n_LCQnpapf1v25Ihg4ov0Vg$>


Regards,
Shaleen Garg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200708/ad57e07a/attachment-0001.html>


More information about the mvapich-discuss mailing list