[mvapich-discuss] Multiple PMI exchange for hostname, wrong node_id

Nenad Vukicevic nenad at intrepid.com
Wed Feb 24 11:58:28 EST 2016


I am running MVAPICH 2.2b on the Fedora FC23 IB cluster.  While running
some accumulate tests I started getting segmentation faults in
'MPIDI_Fetch_and_op()' that are related to this line:

338         (win_ptr->create_flavor == MPI_WIN_FLAVOR_SHARED |
339         (win_ptr->shm_allocated == TRUE && orig_vc->node_id ==
target_vc->node_id)))))
340     {
341         mpi_errno = MPIDI_CH3I_Shm_fop_op(origin_addr, result_addr,
datatype,
342                                           target_rank, target_disp, op,
win_ptr);
343         if (mpi_errno) MPIU_ERR_POP(mpi_errno);
344     }

It turned out that node_id for orig_vc and target_vc are the same, even
though these ranks are running on separate nodes.  I have 4 nodes and 5
ranks, and rank 4 was instantiated on node 0, while the code above had
node_id equal 3, which made it try to access shared space via shared
memory.

While further debugging the issue I noticed that thee are two places in the
initialization code where hostname is being exchanged and 'node_id' set:

mpid_vc.c:
1499 int MPIDI_Populate_vc_node_ids(MPIDI_PG_t *pg, int our_pg_rank)

mpid_vc.c:
1363 int MPIDI_Get_local_host(MPIDI_PG_t *pg, int our_pg_rank)

The first procedure above sets the node_id correctly, the second one does
not.  If I comment out line 1444 in the same file, I am able to pass the
test:

1444         // pg->vct[i].node_id = g_num_nodes - 1;

Note that the second procedure is called only if "#if
defined(CHANNEL_MRAIL) || defined(CHANNEL_PSM)" and I am now wondering if I
have the right configuration.  While configuring mvapich I did not add any
specific channel configuration, thinking that it will configure what is the
best out of the box.

I can provide some mpirun logs that show multiple hostname exchanges.

Nenad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160224/ba254f54/attachment-0001.html>


More information about the mvapich-discuss mailing list