[mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync

Subramoni, Hari subramoni.1 at osu.edu
Tue Dec 8 15:22:59 EST 2020


Thanks Lana. That does give me some ideas.

Let me take a look at it and get back to you.

Best,
Hari.

From: Lana Deere <lana.deere at gmail.com>
Sent: Tuesday, December 8, 2020 3:14 PM
To: Subramoni, Hari <subramoni.1 at osu.edu>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync

I will try -O0 to see what different that makes.

The MPI_Intercomm_merge segmentation violation has this stack traceback, in case it suggests anything.
libmpi.so.12 MPIDI_CH3I_CM_Connect
libmpi.so.12 MPIDI_CH3_iSendv
libmpi.so.12 MPIDI_CH3_EagerContigIsend
libmpi.so.12 MPID_Isend
libmpi.so.12 MPIR_Intercomm_merge_impl
libmpi.so.12 MPI_Intercomm_merge
The few runs I've tried so far have gotten the seg faults in both processes with rank 0, both the parent and the child of the MPI_Spawn.

This one reproduces in our application and also in a relatively small example, so I can try stripping the example down to something which will compile straightforwardly.  If I'm successful, I should be able to send it to you.  I'm going to try -O0 first, though.

.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)



On Tue, Dec 8, 2020 at 3:00 PM Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>> wrote:
Hi, Lana.

Thanks for confirming that. I was just about to get back to you with the same recommendation.

May I assume that this is still with your internal application? Is it possible to get a version to try out a stripped down reproducer?

Best,
Hari.

From: Lana Deere <lana.deere at gmail.com<mailto:lana.deere at gmail.com>>
Sent: Tuesday, December 8, 2020 2:32 PM
To: Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Cc: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync

I tried configuring with --enable-fast=O2,ndebug instead of O3,ndebug.  Now I do not get the malloc warning and my simple send/recv test works.  I'm getting segmentation violations in MPI_Intercomm_merge following MPI_Comm_spawn, though.  I'll see what I can figure out about that.

.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)


On Mon, Dec 7, 2020 at 3:05 PM Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>> wrote:
Hi, Lana.

Let us take a look at this and get back to you.

Best,
Hari.

From: Lana Deere <lana.deere at gmail.com<mailto:lana.deere at gmail.com>>
Sent: Monday, December 7, 2020 2:23 PM
To: Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Cc: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] bus error in MPIDI_CH3I_CM_SHMEM_Sync

On Fri, Dec 4, 2020 at 5:29 PM Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>> wrote:
Sorry to hear that you’re facing issues. If possible, could you please try out your program with the new MVAPICH2 2.3.5 release we made a few days ago and see if it resolves your issues?

When I compile version 2.3.5, I get this warning message:
In file included from tools/topo/hwloc/topo_hwloc.c:8:0:
In function ‘handle_rr_binding’,
    inlined from ‘HYDT_topo_hwloc_init’ at tools/topo/hwloc/topo_hwloc.c:408:16:
./include/hydra.h:639:21: warning: argument 1 value ‘18446744073709551608’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
 #define HYDU_malloc malloc
                     ^
./include/hydra.h:651:22: note: in expansion of macro ‘HYDU_malloc’
         (p) = (type) HYDU_malloc((size));                               \
                      ^~~~~~~~~~~
tools/topo/hwloc/topo_hwloc.c:108:5: note: in expansion of macro ‘HYDU_MALLOC’
     HYDU_MALLOC(HYDT_topo_hwloc_info.bitmap, hwloc_bitmap_t *,
     ^~~~~~~~~~~
In file included from ./mpl/include/mpl.h:13:0,
                 from ./include/hydra.h:17,
                 from tools/topo/hwloc/topo_hwloc.c:8:
tools/topo/hwloc/topo_hwloc.c: In function ‘HYDT_topo_hwloc_init’:
/usr/include/stdlib.h:465:14: note: in a call to allocation function ‘malloc’ declared here
 extern void *malloc (size_t __size) __THROW __attribute_malloc__ __wur;

When subsequently I try running a simple send/receive test, I get segmentation faults in MPI_Send and MPI_Recv, either
  libmpi.so.12 MPIDI_CH3_Rendezvous_rget_recv_finish
  libmpi.so.12 MRAILI_RDMA_Get_finish
  libmpi.so.12 MRAILI_Process_send
  libmpi.so.12
  libmpi.so.12 MPIDI_CH3I_MRAILI_Cq_poll_ib
  libmpi.so.12 MPIDI_CH3I_read_progress
  libmpi.so.12 MPIDI_CH3I_Progress
  libmpi.so.12 MPI_Recv
or
  libmpi.so.12 MPIDI_CH3_Rendezvous_rget_send_finish
  libmpi.so.12 handle_read
  libmpi.so.12 MPIDI_CH3I_Progress
  libmpi.so.12 MPI_Send

This version was configured using the same options as my 2.3.4 installation, namely,
    --prefix="${INSTALL}"
    --enable-fast=O3,ndebug
    --enable-shared
    --with-pic
    --disable-fortran
    --disable-cxx
    --disable-mcast
    --enable-threads=multiple
    --enable-error-messages=all
    --with-pm=hydra

.. Lana (lana.deere at gmail.com<mailto:lana.deere at gmail.com>)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20201208/fc904fef/attachment-0001.html>


More information about the mvapich-discuss mailing list