[mvapich-discuss] (no subject)

Hari Subramoni subramoni.1 at osu.edu
Wed Oct 28 14:09:07 EDT 2015


Hello,

The error is OpenMPI specific. So we will not be able to give you exact
guidance. However, can you please see if following the steps in the
following link solves the issue of being unable to create QPs?

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-1150009.1.4

Regards,
Hari.

On Wed, Oct 28, 2015 at 1:56 PM, Sasso, John (GE Power & Water, Non-GE) <
John1.Sasso at ge.com> wrote:

> --===============3811411851188913222==
> Content-Language: en-US
> Content-Type: multipart/alternative;
>
> boundary="_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_"
>
> --_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
>
> Pardon if this has been addressed already, but I could not find the answer
> =
> after doing Google searches.  I tried posing this question on the OpenMPI
> a=
> nd OpenFabrics mailing lists, but it was recommended I post to the MVAPICH
> =
> list given their focus on IB.
>
> We are in the process of analyzing and troubleshooting MPI jobs of
> increasi=
> ngly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based
> IB=
>  fabric.  At a sufficiently large scale (# cores) a job will end up
> failing=
>  with errors similar to:
>
> [yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb]
> e=
> rror in endpoint reply start connect
> [xxxxx:29318] 853 more processes have sent help message
> help-mpi-btl-openib=
> -cpc-base.txt / ibv_create_qp failed
>
> So I know we are running into some memory limitation (educated guess) when
> =
> queue pairs are being created to support such a huge mesh.  We are now
> inve=
> stigating using the XRC transport to decrease memory consumption.
>
> Anyways, my questions are:
>
>
> 1.       How do we determine HOW MUCH memory is being pinned by an MPI job
> =
> on a node?  (If pmap, what exactly are we looking for?)
>
> 2.       How do we determine WHERE these pinned memory regions are?
>
> We are running RedHat 6.x
>
> --john
>
>
>
> --_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
> Content-Type: text/html; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
>
> <html xmlns:v=3D"urn:schemas-microsoft-com:vml"
> xmlns:o=3D"urn:schemas-micr=
> osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word"
> =
> xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml"
> xmlns=3D"http:=
> //www.w3.org/TR/REC-html40">
> <head>
> <meta http-equiv=3D"Content-Type" content=3D"text/html;
> charset=3Dus-ascii"=
> >
> <meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium)">
> <style><!--
> /* Font Definitions */
> @font-face
>         {font-family:Calibri;
>         panose-1:2 15 5 2 2 2 4 3 2 4;}
> /* Style Definitions */
> p.MsoNormal, li.MsoNormal, div.MsoNormal
>         {margin:0in;
>         margin-bottom:.0001pt;
>         font-size:11.0pt;
>         font-family:"Calibri","sans-serif";}
> a:link, span.MsoHyperlink
>         {mso-style-priority:99;
>         color:blue;
>         text-decoration:underline;}
> a:visited, span.MsoHyperlinkFollowed
>         {mso-style-priority:99;
>         color:purple;
>         text-decoration:underline;}
> p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
>         {mso-style-priority:34;
>         margin-top:0in;
>         margin-right:0in;
>         margin-bottom:0in;
>         margin-left:.5in;
>         margin-bottom:.0001pt;
>         font-size:11.0pt;
>         font-family:"Calibri","sans-serif";}
> span.EmailStyle17
>         {mso-style-type:personal-compose;
>         font-family:"Calibri","sans-serif";
>         color:windowtext;}
> .MsoChpDefault
>         {mso-style-type:export-only;
>         font-family:"Calibri","sans-serif";}
> @page WordSection1
>         {size:8.5in 11.0in;
>         margin:1.0in 1.0in 1.0in 1.0in;}
> div.WordSection1
>         {page:WordSection1;}
> /* List Definitions */
> @list l0
>         {mso-list-id:1625035151;
>         mso-list-type:hybrid;
>         mso-list-template-ids:1366715120 67698703 67698713 67698715
> 67698703 67698=
> 713 67698715 67698703 67698713 67698715;}
> @list l0:level1
>         {mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level2
>         {mso-level-number-format:alpha-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level3
>         {mso-level-number-format:roman-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:right;
>         text-indent:-9.0pt;}
> @list l0:level4
>         {mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level5
>         {mso-level-number-format:alpha-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level6
>         {mso-level-number-format:roman-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:right;
>         text-indent:-9.0pt;}
> @list l0:level7
>         {mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level8
>         {mso-level-number-format:alpha-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level9
>         {mso-level-number-format:roman-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:right;
>         text-indent:-9.0pt;}
> ol
>         {margin-bottom:0in;}
> ul
>         {margin-bottom:0in;}
> --></style><!--[if gte mso 9]><xml>
> <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
> </xml><![endif]--><!--[if gte mso 9]><xml>
> <o:shapelayout v:ext=3D"edit">
> <o:idmap v:ext=3D"edit" data=3D"1" />
> </o:shapelayout></xml><![endif]-->
> </head>
> <body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
> <div class=3D"WordSection1">
> <p class=3D"MsoNormal">Pardon if this has been addressed already, but I
> cou=
> ld not find the answer after doing Google searches.  I tried posing
> th=
> is question on the OpenMPI and OpenFabrics mailing lists, but it was
> recomm=
> ended I post to the MVAPICH list given
>  their focus on IB.<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">We are in the process of analyzing and
> troubleshooti=
> ng MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate
> o=
> ver a Mellanox-based IB fabric.  At a sufficiently large scale (#
> core=
> s) a job will end up failing with errors
>  similar to:<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p
> class=3D"MsoNormal">[yyyyy][[56933,1],1904][connect/btl_openib_connect_o=
> ob.c:867:rml_recv_cb] error in endpoint reply start connect<o:p></o:p></p>
> <p class=3D"MsoNormal">[xxxxx:29318] 853 more processes have sent help
> mess=
> age help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">So I know we are running into some memory
> limitation=
>  (educated guess) when queue pairs are being created to support such a
> huge=
>  mesh.  We are now investigating using the XRC transport to decrease
> m=
> emory consumption.<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">Anyways, my questions are:<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0
> level=
> 1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">1.<span style=
> =3D"font:7.0pt "Times New
> Roman"">     &=
> nbsp;
> </span></span><![endif]>How do we determine HOW MUCH memory is being
> pinned=
>  by an MPI job on a node?  (If pmap, what exactly are we looking
> for?)=
> <o:p></o:p></p>
> <p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0
> level=
> 1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">2.<span style=
> =3D"font:7.0pt "Times New
> Roman"">     &=
> nbsp;
> </span></span><![endif]>How do we determine WHERE these pinned memory
> regio=
> ns are?<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">We are running RedHat 6.x<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal"><b><span
> style=3D"font-size:10.0pt;font-family:&quot=
> ;Arial","sans-serif";color:#333333">--john</span></b><span
> s=
>
> tyle=3D"font-size:10.0pt;font-family:"Arial","sans-serif&quo=
> t;;color:#333333"><o:p></o:p></span></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> </div>
> </body>
> </html>
>
> --_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_--
>
> --===============3811411851188913222==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --===============3811411851188913222==--
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151028/59d327ef/attachment-0001.html>


More information about the mvapich-discuss mailing list