[mvapich-discuss] Occasional failure initializing

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Jul 28 11:18:43 EDT 2015


The MV2_USE_MPIRUN_MAPPING=0 variable causes our library to do a collective
over PMI to determine which ranks are local to each other.  This is as
opposed to an optimization where mpirun_rsh would attempt to directly tell
the library.

Can you tell us a little more about the other errors that you were facing.
Was there some sort of backtrace or error stack that you can share?  There
might be some unexpected interaction taking place.

On Tue, Jul 28, 2015 at 10:30 AM Martin Pokorny <mpokorny at nrao.edu> wrote:

> On 07/27/2015 06:00 PM, Jonathan Perkins wrote:
> > Hello Martin, can you try setting MV2_USE_MPIRUN_MAPPING=0 to see if
> > this resolves the issue?
>
> That seems to resolve the problem, at least in my test code. I will try
> the same setting with my "real" application code later today.
>
> What is the meaning of the MV2_USE_MPIRUN_MAPPING variable?
> Interestingly, one of the ways I can increase the frequency of the error
> that I reported is to set MV2_USE_RDMA_CM=1; other errors consistently
> occur when using that setting in all programs the I've tried on our
> cluster, but it also has the effect of triggering my reported error more
> frequently in the test program. However, using your suggested setting,
> not only has my reported error apparently gone away, but the other
> errors that I normally see when using rdma cm have also disappeared.
>
> --
> Martin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150728/fcc50d93/attachment.html>


More information about the mvapich-discuss mailing list