[mvapich-discuss] Multi-node jobs hang with mpirun

Subramoni, Hari subramoni.1 at osu.edu
Wed Dec 11 20:50:50 EST 2019


Hi, Chris.

Sorry to hear that you’re facing issues. Typically, MVAPICH2 should run fine with SLURM.


  1.  Does the system have multiple IB HCAs?
  2.  Can you please send me the output of ibv_devinfo executed on a compute node
  3.  Can you please send the output of mpiname -a?

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Chris Woelkers - NOAA Federal
Sent: Wednesday, December 11, 2019 7:22 PM
To: Carlson, Timothy S <Timothy.Carlson at pnnl.gov>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Multi-node jobs hang with mpirun

I originally asked them and after some time they placed the blame on Slurm. Some more research with that group gave me the hint to check mvapich. I've now put the issue back in Bright's hands and am asking as much for my curiosity as anything.

On Wed, Dec 11, 2019, 17:49 Carlson, Timothy S <Timothy.Carlson at pnnl.gov<mailto:Timothy.Carlson at pnnl.gov>> wrote:
I would offer that this should be addressed by the Bright folks as it is software that was bundled with their cluster management tools.

From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu>> On Behalf Of Chris Woelkers - NOAA Federal
Sent: Wednesday, December 11, 2019 2:36 PM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
Subject: [mvapich-discuss] Multi-node jobs hang with mpirun

I'm using mvapich 2.3 as provided by the repository for Bright Cluster Manager. All jobs are submitted via Slurm.
When I attempt to run a job with a single node selected it runs with no problem.
When I try that same job with multiple nodes it hangs and eventually times out with no output or errors.
I have found the following thread detailing almost the same issue with mvapich 2.3a. http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2017-June/006402.html
I am wondering if this issue was found and fixed in the final 2.3 release, assuming 2.3a is an alpha or other early release.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191212/bbb8377a/attachment-0001.html>


More information about the mvapich-discuss mailing list