[mvapich-discuss] Patch to retry psm_ep_open
Adam T. Moody
moody20 at llnl.gov
Tue Mar 10 18:54:51 EDT 2015
Hello MVAPICH team,
We have some people who run a sequence of up to 1000 independent MPI
jobs within a single SLURM allocation as a suite of application
regression tests. All job steps are submitted to SLURM at once, and
they rely on SLURM to schedule the job steps to run in turn once earlier
jobs finish and free up resources. It seems that some of these job
steps start before the previous job steps have fully released their PSM
contexts, which then leads to a failure in psm_ep_open() in the new job
step. It's not clear whether the problem lies with our (old) version of
SLURM in starting the next job too early or whether the node / network
card driver is just slow to free up contexts.
Anyway, as a work around for such cases, the attached patch retries
psm_ep_open multiple times after sleeping for some time between
retries. The user can tune the total number of retries and the time
between retries with environment variables. This work around is rather
hacky, but it helps on our machines. I thought I'd send it your way in
case it's useful to others with PSM.
My original patch was for MVAPICH-1.2, and I've ported this to
MVAPICH2-2.0.1. I checked that it compiles, however, if you want to
include it, please verify that it does what you'd expect. In
particular, please look at the warning it prints in case you have a
better format for that.
Thanks,
-Adam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: psm_ep_open_retry.patch
Type: text/x-patch
Size: 5455 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150310/31228b12/attachment.bin>
More information about the mvapich-discuss
mailing list