[mvapich-discuss] (no subject)

Sasso, John (GE Power & Water, Non-GE) John1.Sasso at ge.com
Fri Oct 30 11:16:05 EDT 2015


Mike,

Thank you for the info.  How did you pass that flag to FLUENT?  Did you have to modify some script in the FLUENT installation?

--john


From: Michael Knox [mailto:mikeknox at lcse.umn.edu]
Sent: Thursday, October 29, 2015 9:15 AM
To: Sasso, John (GE Power & Water, Non-GE)
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re:

Hi John,

I ran into very similar problems with some larger Fluent runs on a Mellanox ConnectIB Infiniband cluster.  We were using Platform MPI and XRC fixed the issue.  I can't answer either of your questions but I would also be curious to know the answer.

The flag we needed to use was:
-e MPI_IBV_XRC=1
Mike

On Wed, Oct 28, 2015 at 12:56 PM, Sasso, John (GE Power & Water, Non-GE) <John1.Sasso at ge.com<mailto:John1.Sasso at ge.com>> wrote:
--===============3811411851188913222==
Content-Language: en-US
Content-Type: multipart/alternative;
        boundary="_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_"

--_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Pardon if this has been addressed already, but I could not find the answer =
after doing Google searches.  I tried posing this question on the OpenMPI a=
nd OpenFabrics mailing lists, but it was recommended I post to the MVAPICH =
list given their focus on IB.

We are in the process of analyzing and troubleshooting MPI jobs of increasi=
ngly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based IB=
 fabric.  At a sufficiently large scale (# cores) a job will end up failing=
 with errors similar to:

[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] e=
rror in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message help-mpi-btl-openib=
-cpc-base.txt / ibv_create_qp failed

So I know we are running into some memory limitation (educated guess) when =
queue pairs are being created to support such a huge mesh.  We are now inve=
stigating using the XRC transport to decrease memory consumption.

Anyways, my questions are:


1.       How do we determine HOW MUCH memory is being pinned by an MPI job =
on a node?  (If pmap, what exactly are we looking for?)

2.       How do we determine WHERE these pinned memory regions are?

We are running RedHat 6.x

--john



--_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml<https://urldefense.proofpoint.com/v2/url?u=http-3A__schemas.microsoft.com_office_2004_12_omml&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=7xhgsWZ7bsm0wBIo2FdKc1vp4ceWI_U9sVr-4HymSN8&s=F_7jeOg6CPSubyD-zt4RUn0AxibFifs9HGSTPsJ8up0&e=>" xmlns=3D"http:=
//www.w3.org/TR/REC-html40<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_TR_REC-2Dhtml40&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=7xhgsWZ7bsm0wBIo2FdKc1vp4ceWI_U9sVr-4HymSN8&s=V95rvEu10ssjhQr_LJzIJ0ZpajaBHxdZvhdzvheWrxk&e=>">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:1625035151;
        mso-list-type:hybrid;
        mso-list-template-ids:1366715120 67698703 67698713 67698715 67698703 67698=
713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div class=3D"WordSection1">
<p class=3D"MsoNormal">Pardon if this has been addressed already, but I cou=
ld not find the answer after doing Google searches.  I tried posing th=
is question on the OpenMPI and OpenFabrics mailing lists, but it was recomm=
ended I post to the MVAPICH list given
 their focus on IB.<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">We are in the process of analyzing and troubleshooti=
ng MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate o=
ver a Mellanox-based IB fabric.  At a sufficiently large scale (# core=
s) a job will end up failing with errors
 similar to:<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">[yyyyy][[56933,1],1904][connect/btl_openib_connect_o=
ob.c:867:rml_recv_cb] error in endpoint reply start connect<o:p></o:p></p>
<p class=3D"MsoNormal">[xxxxx:29318] 853 more processes have sent help mess=
age help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">So I know we are running into some memory limitation=
 (educated guess) when queue pairs are being created to support such a huge=
 mesh.  We are now investigating using the XRC transport to decrease m=
emory consumption.<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">Anyways, my questions are:<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">1.<span style=
=3D"font:7.0pt "Times New Roman"">     &=
nbsp;
</span></span><![endif]>How do we determine HOW MUCH memory is being pinned=
 by an MPI job on a node?  (If pmap, what exactly are we looking for?)=
<o:p></o:p></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">2.<span style=
=3D"font:7.0pt "Times New Roman"">     &=
nbsp;
</span></span><![endif]>How do we determine WHERE these pinned memory regio=
ns are?<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">We are running RedHat 6.x<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Arial","sans-serif";color:#333333">--john</span></b><span s=
tyle=3D"font-size:10.0pt;font-family:"Arial","sans-serif&quo=
t;;color:#333333"><o:p></o:p></span></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>

--_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_--

--===============3811411851188913222==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss<https://urldefense.proofpoint.com/v2/url?u=http-3A__mailman.cse.ohio-2Dstate.edu_mailman_listinfo_mvapich-2Ddiscuss&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=7xhgsWZ7bsm0wBIo2FdKc1vp4ceWI_U9sVr-4HymSN8&s=VHDXYkK5GYxWDR6B3wa7ohFisLaZTvwTrzvI_UZdx10&e=>

--===============3811411851188913222==--

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151030/4ed3e306/attachment-0001.html>


More information about the mvapich-discuss mailing list