[mvapich-discuss] (no subject)

Michael Knox mikeknox at lcse.umn.edu
Fri Oct 30 11:21:28 EDT 2015


John,

I believe it is passed into Fluent as an environment variable.  For
example, in my job submission script I set the following:

export MPIRUN_OPTIONS=" -v -prot -IBV -e MPI_IBV_XRC=1 -e
MPI_PIN_PERCENTAGE=60"

Mike


On Fri, Oct 30, 2015 at 10:16 AM, Sasso, John (GE Power & Water, Non-GE) <
John1.Sasso at ge.com> wrote:

> Mike,
>
>
>
> Thank you for the info.  How did you pass that flag to FLUENT?  Did you
> have to modify some script in the FLUENT installation?
>
>
>
> --john
>
>
>
>
>
> *From:* Michael Knox [mailto:mikeknox at lcse.umn.edu]
> *Sent:* Thursday, October 29, 2015 9:15 AM
> *To:* Sasso, John (GE Power & Water, Non-GE)
> *Cc:* mvapich-discuss at cse.ohio-state.edu
> *Subject:* Re:
>
>
>
> Hi John,
>
>
>
> I ran into very similar problems with some larger Fluent runs on a
> Mellanox ConnectIB Infiniband cluster.  We were using Platform MPI and XRC
> fixed the issue.  I can't answer either of your questions but I would also
> be curious to know the answer.
>
>
>
> The flag we needed to use was:
>
> -e MPI_IBV_XRC=1
>
> Mike
>
>
>
> On Wed, Oct 28, 2015 at 12:56 PM, Sasso, John (GE Power & Water, Non-GE) <
> John1.Sasso at ge.com> wrote:
>
> --===============3811411851188913222==
> Content-Language: en-US
> Content-Type: multipart/alternative;
>
> boundary="_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_"
>
> --_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
>
> Pardon if this has been addressed already, but I could not find the answer
> =
> after doing Google searches.  I tried posing this question on the OpenMPI
> a=
> nd OpenFabrics mailing lists, but it was recommended I post to the MVAPICH
> =
> list given their focus on IB.
>
> We are in the process of analyzing and troubleshooting MPI jobs of
> increasi=
> ngly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based
> IB=
>  fabric.  At a sufficiently large scale (# cores) a job will end up
> failing=
>  with errors similar to:
>
> [yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb]
> e=
> rror in endpoint reply start connect
> [xxxxx:29318] 853 more processes have sent help message
> help-mpi-btl-openib=
> -cpc-base.txt / ibv_create_qp failed
>
> So I know we are running into some memory limitation (educated guess) when
> =
> queue pairs are being created to support such a huge mesh.  We are now
> inve=
> stigating using the XRC transport to decrease memory consumption.
>
> Anyways, my questions are:
>
>
> 1.       How do we determine HOW MUCH memory is being pinned by an MPI job
> =
> on a node?  (If pmap, what exactly are we looking for?)
>
> 2.       How do we determine WHERE these pinned memory regions are?
>
> We are running RedHat 6.x
>
> --john
>
>
>
> --_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
> Content-Type: text/html; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
>
> <html xmlns:v=3D"urn:schemas-microsoft-com:vml"
> xmlns:o=3D"urn:schemas-micr=
> osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word"
> =
> xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__schemas.microsoft.com_office_2004_12_omml&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=7xhgsWZ7bsm0wBIo2FdKc1vp4ceWI_U9sVr-4HymSN8&s=F_7jeOg6CPSubyD-zt4RUn0AxibFifs9HGSTPsJ8up0&e=>"
> xmlns=3D"http:=
> //www.w3.org/TR/REC-html40
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_TR_REC-2Dhtml40&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=7xhgsWZ7bsm0wBIo2FdKc1vp4ceWI_U9sVr-4HymSN8&s=V95rvEu10ssjhQr_LJzIJ0ZpajaBHxdZvhdzvheWrxk&e=>
> ">
> <head>
> <meta http-equiv=3D"Content-Type" content=3D"text/html;
> charset=3Dus-ascii"=
> >
> <meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium)">
> <style><!--
> /* Font Definitions */
> @font-face
>         {font-family:Calibri;
>         panose-1:2 15 5 2 2 2 4 3 2 4;}
> /* Style Definitions */
> p.MsoNormal, li.MsoNormal, div.MsoNormal
>         {margin:0in;
>         margin-bottom:.0001pt;
>         font-size:11.0pt;
>         font-family:"Calibri","sans-serif";}
> a:link, span.MsoHyperlink
>         {mso-style-priority:99;
>         color:blue;
>         text-decoration:underline;}
> a:visited, span.MsoHyperlinkFollowed
>         {mso-style-priority:99;
>         color:purple;
>         text-decoration:underline;}
> p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
>         {mso-style-priority:34;
>         margin-top:0in;
>         margin-right:0in;
>         margin-bottom:0in;
>         margin-left:.5in;
>         margin-bottom:.0001pt;
>         font-size:11.0pt;
>         font-family:"Calibri","sans-serif";}
> span.EmailStyle17
>         {mso-style-type:personal-compose;
>         font-family:"Calibri","sans-serif";
>         color:windowtext;}
> .MsoChpDefault
>         {mso-style-type:export-only;
>         font-family:"Calibri","sans-serif";}
> @page WordSection1
>         {size:8.5in 11.0in;
>         margin:1.0in 1.0in 1.0in 1.0in;}
> div.WordSection1
>         {page:WordSection1;}
> /* List Definitions */
> @list l0
>         {mso-list-id:1625035151;
>         mso-list-type:hybrid;
>         mso-list-template-ids:1366715120 67698703 67698713 67698715
> 67698703 67698=
> 713 67698715 67698703 67698713 67698715;}
> @list l0:level1
>         {mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level2
>         {mso-level-number-format:alpha-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level3
>         {mso-level-number-format:roman-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:right;
>         text-indent:-9.0pt;}
> @list l0:level4
>         {mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level5
>         {mso-level-number-format:alpha-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level6
>         {mso-level-number-format:roman-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:right;
>         text-indent:-9.0pt;}
> @list l0:level7
>         {mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level8
>         {mso-level-number-format:alpha-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:left;
>         text-indent:-.25in;}
> @list l0:level9
>         {mso-level-number-format:roman-lower;
>         mso-level-tab-stop:none;
>         mso-level-number-position:right;
>         text-indent:-9.0pt;}
> ol
>         {margin-bottom:0in;}
> ul
>         {margin-bottom:0in;}
> --></style><!--[if gte mso 9]><xml>
> <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
> </xml><![endif]--><!--[if gte mso 9]><xml>
> <o:shapelayout v:ext=3D"edit">
> <o:idmap v:ext=3D"edit" data=3D"1" />
> </o:shapelayout></xml><![endif]-->
> </head>
> <body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
> <div class=3D"WordSection1">
> <p class=3D"MsoNormal">Pardon if this has been addressed already, but I
> cou=
> ld not find the answer after doing Google searches.  I tried posing
> th=
> is question on the OpenMPI and OpenFabrics mailing lists, but it was
> recomm=
> ended I post to the MVAPICH list given
>  their focus on IB.<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">We are in the process of analyzing and
> troubleshooti=
> ng MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate
> o=
> ver a Mellanox-based IB fabric.  At a sufficiently large scale (#
> core=
> s) a job will end up failing with errors
>  similar to:<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p
> class=3D"MsoNormal">[yyyyy][[56933,1],1904][connect/btl_openib_connect_o=
> ob.c:867:rml_recv_cb] error in endpoint reply start connect<o:p></o:p></p>
> <p class=3D"MsoNormal">[xxxxx:29318] 853 more processes have sent help
> mess=
> age help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">So I know we are running into some memory
> limitation=
>  (educated guess) when queue pairs are being created to support such a
> huge=
>  mesh.  We are now investigating using the XRC transport to decrease
> m=
> emory consumption.<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">Anyways, my questions are:<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0
> level=
> 1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">1.<span style=
> =3D"font:7.0pt "Times New
> Roman"">     &=
> nbsp;
> </span></span><![endif]>How do we determine HOW MUCH memory is being
> pinned=
>  by an MPI job on a node?  (If pmap, what exactly are we looking
> for?)=
> <o:p></o:p></p>
> <p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0
> level=
> 1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">2.<span style=
> =3D"font:7.0pt "Times New
> Roman"">     &=
> nbsp;
> </span></span><![endif]>How do we determine WHERE these pinned memory
> regio=
> ns are?<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal">We are running RedHat 6.x<o:p></o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal"><b><span
> style=3D"font-size:10.0pt;font-family:&quot=
> ;Arial","sans-serif";color:#333333">--john</span></b><span
> s=
>
> tyle=3D"font-size:10.0pt;font-family:"Arial","sans-serif&quo=
> t;;color:#333333"><o:p></o:p></span></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> <p class=3D"MsoNormal"><o:p> </o:p></p>
> </div>
> </body>
> </html>
>
> --_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_--
>
> --===============3811411851188913222==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__mailman.cse.ohio-2Dstate.edu_mailman_listinfo_mvapich-2Ddiscuss&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=7xhgsWZ7bsm0wBIo2FdKc1vp4ceWI_U9sVr-4HymSN8&s=VHDXYkK5GYxWDR6B3wa7ohFisLaZTvwTrzvI_UZdx10&e=>
>
> --===============3811411851188913222==--
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151030/017e4344/attachment-0001.html>


More information about the mvapich-discuss mailing list