[mvapich-discuss] Patch to retry psm_ep_open

Adam T. Moody moody20 at llnl.gov
Tue Mar 10 18:54:51 EDT 2015


Hello MVAPICH team,
We have some people who run a sequence of up to 1000 independent MPI 
jobs within a single SLURM allocation as a suite of application 
regression tests.  All job steps are submitted to SLURM at once, and 
they rely on SLURM to schedule the job steps to run in turn once earlier 
jobs finish and free up resources.  It seems that some of these job 
steps start before the previous job steps have fully released their PSM 
contexts, which then leads to a failure in psm_ep_open() in the new job 
step.  It's not clear whether the problem lies with our (old) version of 
SLURM in starting the next job too early or whether the node / network 
card driver is just slow to free up contexts.

Anyway, as a work around for such cases, the attached patch retries 
psm_ep_open multiple times after sleeping for some time between 
retries.  The user can tune the total number of retries and the time 
between retries with environment variables.  This work around is rather 
hacky, but it helps on our machines.  I thought I'd send it your way in 
case it's useful to others with PSM.

My original patch was for MVAPICH-1.2, and I've ported this to 
MVAPICH2-2.0.1.  I checked that it compiles, however, if you want to 
include it, please verify that it does what you'd expect.  In 
particular, please look at the warning it prints in case you have a 
better format for that.
Thanks,
-Adam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: psm_ep_open_retry.patch
Type: text/x-patch
Size: 5455 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150310/31228b12/attachment.bin>


More information about the mvapich-discuss mailing list