[mvapich-discuss] (no subject)

Mingzhe Li li.2192 at osu.edu
Wed Feb 24 12:07:14 EST 2016


Hi Nenad,

Thanks for the note. Which launcher did you use? Is it possible to send us
a reproducer for this issue?

Thanks,
Mingzhe

On Wed, Feb 24, 2016 at 11:58 AM, Nenad Vukicevic <nenad at intrepid.com>
wrote:

> X-MS-Exchange-Cros
> --===============3870460748841214172==
> Content-Type: multipart/alternative;
> boundary="001a11367c2ecaa3c8052c86fa95"
>
> --001a11367c2ecaa3c8052c86fa95
> Content-Type: text/plain; charset="UTF-8"
>
> I am running MVAPICH 2.2b on the Fedora FC23 IB cluster.  While running
> some accumulate tests I started getting segmentation faults in
> 'MPIDI_Fetch_and_op()' that are related to this line:
>
> 338         (win_ptr->create_flavor == MPI_WIN_FLAVOR_SHARED |
> 339         (win_ptr->shm_allocated == TRUE && orig_vc->node_id ==
> target_vc->node_id)))))
> 340     {
> 341         mpi_errno = MPIDI_CH3I_Shm_fop_op(origin_addr, result_addr,
> datatype,
> 342                                           target_rank, target_disp, op,
> win_ptr);
> 343         if (mpi_errno) MPIU_ERR_POP(mpi_errno);
> 344     }
>
> It turned out that node_id for orig_vc and target_vc are the same, even
> though these ranks are running on separate nodes.  I have 4 nodes and 5
> ranks, and rank 4 was instantiated on node 0, while the code above had
> node_id equal 3, which made it try to access shared space via shared
> memory.
>
> While further debugging the issue I noticed that thee are two places in the
> initialization code where hostname is being exchanged and 'node_id' set:
>
> mpid_vc.c:
> 1499 int MPIDI_Populate_vc_node_ids(MPIDI_PG_t *pg, int our_pg_rank)
>
> mpid_vc.c:
> 1363 int MPIDI_Get_local_host(MPIDI_PG_t *pg, int our_pg_rank)
>
> The first procedure above sets the node_id correctly, the second one does
> not.  If I comment out line 1444 in the same file, I am able to pass the
> test:
>
> 1444         // pg->vct[i].node_id = g_num_nodes - 1;
>
> Note that the second procedure is called only if "#if
> defined(CHANNEL_MRAIL) || defined(CHANNEL_PSM)" and I am now wondering if I
> have the right configuration.  While configuring mvapich I did not add any
> specific channel configuration, thinking that it will configure what is the
> best out of the box.
>
> I can provide some mpirun logs that show multiple hostname exchanges.
>
> Nenad
>
> --001a11367c2ecaa3c8052c86fa95
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"ltr">I=C2=A0am running MVAPICH 2.2b on the Fedora FC23 IB
> clust=
> er.=C2=A0 While running some accumulate tests I started getting
> segmentatio=
> n faults in 'MPIDI_Fetch_and_op()' that are related to this
> line:<b=
> r><br>338 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (win_ptr->create_flavor =3D=3D
> MPI=
> _WIN_FLAVOR_SHARED |<br>339 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> (win_ptr->shm_al=
> located =3D=3D TRUE && orig_vc->node_id =3D=3D
> target_vc->nod=
> e_id)))))<br>340 =C2=A0 =C2=A0 {<br>341 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> mpi_err=
> no =3D MPIDI_CH3I_Shm_fop_op(origin_addr, result_addr, datatype,<br>342
> =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
> targe=
> t_rank, target_disp, op, win_ptr);<br>343 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if
> (m=
> pi_errno) MPIU_ERR_POP(mpi_errno);<br>344 =C2=A0 =C2=A0 }<br><br>It turned
> =
> out that node_id for orig_vc and target_vc are the same, even though these
> =
> ranks are running on separate nodes.=C2=A0 I have 4 nodes and 5 ranks, and
> =
> rank 4 was instantiated on node 0, while the code above had node_id equal
> 3=
> , which made it try to access shared space via shared
> memory.=C2=A0<div><br=
> ></div><div>While further debugging the issue I noticed that thee are two
> p=
> laces in the initialization code where hostname is being exchanged and
> &#39=
> ;node_id' set:<div><br></div><div>mpid_vc.c:</div><div>1499 int
> MPIDI_P=
> opulate_vc_node_ids(MPIDI_PG_t *pg, int
> our_pg_rank)<br><br>mpid_vc.c:</div=
> >
>
>
>
>
>
>
>
> <div><span>1363 </span><span>int</span><span>
> MPIDI_Get_local_host(MPIDI_PG=
> _t *pg, </span><span>int</span><span>
> our_pg_rank)</span></div><div><br></d=
> iv><div>The first procedure above sets the node_id correctly, the second
> on=
> e does not.=C2=A0 If I comment out line 1444 in the same file, I am able
> to=
>  pass the test:</div><div><span><br>1444 </span><span>=C2=A0 =C2=A0 =C2=A0
> =
> =C2=A0 </span><span>//
> pg-></span><span>vct</span><span>[</span><span>i<=
> /span><span>].node_id =3D
> </span><span>g</span><span>_</span><span>num</spa=
> n><span>_nodes -
> 1;</span><br></div><div><span><br></span></div><div><span>=
> Note that the second procedure is called only if "</span>#if
> defined(C=
> HANNEL_MRAIL) || defined(CHANNEL_PSM)" and I am now wondering if I
> hav=
> e the right configuration.=C2=A0 While configuring mvapich I did not add
> an=
> y specific channel configuration, thinking that it will configure what is
> t=
> he best out of the box.</div><div><br></div><div>I can provide some mpirun
> =
> logs that show multiple hostname exchanges.=C2=A0</div>
>
>
>
>
>
>
>
> <div><br></div><div>Nenad<div>
> </div></div></div></div>
>
> --001a11367c2ecaa3c8052c86fa95--
>
> --===============3870460748841214172==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --===============3870460748841214172==--
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160224/562793cf/attachment-0001.html>


More information about the mvapich-discuss mailing list