[mvapich-discuss] Finding amount of pinned memory and regions
Sasso, John (GE Power & Water, Non-GE)
John1.Sasso at ge.com
Wed Oct 28 13:56:59 EDT 2015
Pardon if this has been addressed already, but I could not find the answer after doing Google searches. I tried posing this question on the OpenMPI and OpenFabrics mailing lists, but it was recommended I post to the MVAPICH list given their focus on IB.
We are in the process of analyzing and troubleshooting MPI jobs of increasingly large scale (OpenMPI 1.6.5) which communicate over a Mellanox-based IB fabric. At a sufficiently large scale (# cores) a job will end up failing with errors similar to:
[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] error in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed
So I know we are running into some memory limitation (educated guess) when queue pairs are being created to support such a huge mesh. We are now investigating using the XRC transport to decrease memory consumption.
Anyways, my questions are:
1. How do we determine HOW MUCH memory is being pinned by an MPI job on a node? (If pmap, what exactly are we looking for?)
2. How do we determine WHERE these pinned memory regions are?
We are running RedHat 6.x
--john
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151028/f7f5e220/attachment.html>
More information about the mvapich-discuss
mailing list