[mvapich-discuss] (no subject)

Sasso, John (GE Power & Water, Non-GE) John1.Sasso at ge.com
Wed Oct 28 14:20:35 EDT 2015


But my questions are:

1.       How do we determine HOW MUCH memory is being pinned by an MPI job on a node?  (If pmap, what exactly are we looking for?)

2.       How do we determine WHERE these pinned memory regions are?


Does MVAPICH do pinning of memory regions as well?  If so, my question would still hold even for MVAPICH.  Thanks

--john


From: hari.subramoni at gmail.com [mailto:hari.subramoni at gmail.com] On Behalf Of Hari Subramoni
Sent: Wednesday, October 28, 2015 2:09 PM
To: Sasso, John (GE Power & Water, Non-GE)
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re:

Hello,

The error is OpenMPI specific. So we will not be able to give you exact guidance. However, can you please see if following the steps in the following link solves the issue of being unable to create QPs?

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2a-userguide.html#x1-1150009.1.4<https://urldefense.proofpoint.com/v2/url?u=http-3A__mvapich.cse.ohio-2Dstate.edu_static_media_mvapich_mvapich2-2D2.2a-2Duserguide.html-23x1-2D1150009.1.4&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=_Anxvx6vYRdRguW7r1kqggBbndQ8XQwHDE12ZTgUwBs&s=y_WwbpBkL0QntS4JqtxvRylLBaA3QtZMQVOCFjRB2ck&e=>

Regards,
Hari.

On Wed, Oct 28, 2015 at 1:56 PM, Sasso, John (GE Power & Water, Non-GE) <John1.Sasso at ge.com<mailto:John1.Sasso at ge.com>> wrote:
--===============3811411851188913222==
Content-Language: en-US
Content-Type: multipart/alternative;
        boundary="_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_"

--_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Pardon if this has been addressed already, but I could not find the answer =
after doing Google searches.  I tried posing this question on the OpenMPI a=
nd OpenFabrics mailing lists, but it was recommended I post to the MVAPICH =
list given their focus on IB.

We are in the process of analyzing and troubleshooting MPI jobs of increasi=
ngly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based IB=
 fabric.  At a sufficiently large scale (# cores) a job will end up failing=
 with errors similar to:

[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] e=
rror in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message help-mpi-btl-openib=
-cpc-base.txt / ibv_create_qp failed

So I know we are running into some memory limitation (educated guess) when =
queue pairs are being created to support such a huge mesh.  We are now inve=
stigating using the XRC transport to decrease memory consumption.

Anyways, my questions are:


1.       How do we determine HOW MUCH memory is being pinned by an MPI job =
on a node?  (If pmap, what exactly are we looking for?)

2.       How do we determine WHERE these pinned memory regions are?

We are running RedHat 6.x

--john



--_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml<https://urldefense.proofpoint.com/v2/url?u=http-3A__schemas.microsoft.com_office_2004_12_omml&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=_Anxvx6vYRdRguW7r1kqggBbndQ8XQwHDE12ZTgUwBs&s=GWvDkv-ML2547ipefHP5V8hLcs-E9pWW_mZOSy03Xnk&e=>" xmlns=3D"http:=
//www.w3.org/TR/REC-html40<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.w3.org_TR_REC-2Dhtml40&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=_Anxvx6vYRdRguW7r1kqggBbndQ8XQwHDE12ZTgUwBs&s=7X4lOrOPt-NR-xmDc8Q8YoopvNWPu6kmeb24SR6bj-M&e=>">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:1625035151;
        mso-list-type:hybrid;
        mso-list-template-ids:1366715120 67698703 67698713 67698715 67698703 67698=
713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div class=3D"WordSection1">
<p class=3D"MsoNormal">Pardon if this has been addressed already, but I cou=
ld not find the answer after doing Google searches.  I tried posing th=
is question on the OpenMPI and OpenFabrics mailing lists, but it was recomm=
ended I post to the MVAPICH list given
 their focus on IB.<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">We are in the process of analyzing and troubleshooti=
ng MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate o=
ver a Mellanox-based IB fabric.  At a sufficiently large scale (# core=
s) a job will end up failing with errors
 similar to:<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">[yyyyy][[56933,1],1904][connect/btl_openib_connect_o=
ob.c:867:rml_recv_cb] error in endpoint reply start connect<o:p></o:p></p>
<p class=3D"MsoNormal">[xxxxx:29318] 853 more processes have sent help mess=
age help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">So I know we are running into some memory limitation=
 (educated guess) when queue pairs are being created to support such a huge=
 mesh.  We are now investigating using the XRC transport to decrease m=
emory consumption.<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">Anyways, my questions are:<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">1.<span style=
=3D"font:7.0pt "Times New Roman"">     &=
nbsp;
</span></span><![endif]>How do we determine HOW MUCH memory is being pinned=
 by an MPI job on a node?  (If pmap, what exactly are we looking for?)=
<o:p></o:p></p>
<p class=3D"MsoListParagraph" style=3D"text-indent:-.25in;mso-list:l0 level=
1 lfo1"><![if !supportLists]><span style=3D"mso-list:Ignore">2.<span style=
=3D"font:7.0pt "Times New Roman"">     &=
nbsp;
</span></span><![endif]>How do we determine WHERE these pinned memory regio=
ns are?<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal">We are running RedHat 6.x<o:p></o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Arial","sans-serif";color:#333333">--john</span></b><span s=
tyle=3D"font-size:10.0pt;font-family:"Arial","sans-serif&quo=
t;;color:#333333"><o:p></o:p></span></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
<p class=3D"MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>

--_000_4F505D9A84D1D74E9397FB427DDF95BC52056EC2ALPMBAPA12e2kad_--

--===============3811411851188913222==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss<https://urldefense.proofpoint.com/v2/url?u=http-3A__mailman.cse.ohio-2Dstate.edu_mailman_listinfo_mvapich-2Ddiscuss&d=CwMFaQ&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=_Anxvx6vYRdRguW7r1kqggBbndQ8XQwHDE12ZTgUwBs&s=VPMOhe-xhxBRsbyYnvFkSQ3E5hUES9sCYHvgAq1aJAU&e=>

--===============3811411851188913222==--

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151028/ac185ea3/attachment-0001.html>


More information about the mvapich-discuss mailing list