[mvapich-discuss] Finding amount of pinned memory and regions

Sasso, John (GE Power & Water, Non-GE) John1.Sasso at ge.com
Wed Oct 28 13:56:59 EDT 2015


Pardon if this has been addressed already, but I could not find the answer after doing Google searches.  I tried posing this question on the OpenMPI and OpenFabrics mailing lists, but it was recommended I post to the MVAPICH list given their focus on IB.

We are in the process of analyzing and troubleshooting MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based IB fabric.  At a sufficiently large scale (# cores) a job will end up failing with errors similar to:

[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] error in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed

So I know we are running into some memory limitation (educated guess) when queue pairs are being created to support such a huge mesh.  We are now investigating using the XRC transport to decrease memory consumption.

Anyways, my questions are:


1.       How do we determine HOW MUCH memory is being pinned by an MPI job on a node?  (If pmap, what exactly are we looking for?)

2.       How do we determine WHERE these pinned memory regions are?

We are running RedHat 6.x

--john


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151028/f7f5e220/attachment.html>


More information about the mvapich-discuss mailing list