[mvapich-discuss] Unable to get MVAPICH2 1.7 to spawn all requested ranks in a specific node

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Sep 3 20:16:44 EDT 2012


On Mon, Sep 03, 2012 at 10:51:53PM +0000, Fernandez, Alvaro wrote:
> Hello,

Hi, my reply is inline.

> This is my first post to this list, so please bear with me...
> 
> I have a cluster with four boxes. Each box is a 2P node, and each P
> has 16 cores. So I have a total of 32 cores per node, or 32 X 4 = 128
> cores in the full cluster. (Scientific Linux 6.2 Carbon Linux). I've
> verified that Infiniband works across from one 2P node to another.
> 
> I am running a simple MPI code hello_v2.f  (listed at the end of this
> email), which outputs a rank number and hostname for every MPI rank it
> spawns. I compile and run like this:
> 
> sm2utsq01 # mpif90 -o hello_v2.exe hello_v2.f -f90=openf90
> -I/usr/mpi/gcc/mvapich2-1.7/include
> sm2utsq01 #  mpirun_rsh -hostfile hosts_all -np 128 hello_v2.exe  >&
> mpirun_rsh_all.out
> 
> The contents of the host file are:
> # cat hosts_all
> sm2utsq01:32
> sm2utsq02:32
> sm2utsq03:32
> sm2utsq04:32
> 
> So the output file  looks like this:
> node           1 : Hello world from host sm2utsq01
> node           2 : Hello world from host sm2utsq01
> node          11 : Hello world from host sm2utsq01
> [...]
> 
> 
> I used grep on this file to verify I have 32 responding ranks in each
> node  like I should. But here's what I see:
> [root at sm2utsq01]# grep sm2utsq01 mpirun_rsh_all.out | wc -l    NOT OK.
> 2
> [root at sm2utsq01]# grep sm2utsq02 mpirun_rsh_all.out | wc -l    OK.
> 32
> [root at sm2utsq01]# grep sm2utsq03 mpirun_rsh_all.out | wc -l    OK.
> 32
> [root at sm2utsq01]# grep sm2utsq04 mpirun_rsh_all.out | wc -l    OK.
> 32
> 
> In fact, for the node I'm running on (sm2utsq01), I always get a
> variable count of MPI ranks, and never all 32 I expect. The other
> nodes all output the correct count: 32.

What is the output for this node?  There may be a clue there such as
garbled output or an error message.

> Why? What is so special about the node I'm running on? I've tried
> playing with the host file syntax, and listing the hosts in reverse
> order.

Do you see the same behavior when you launch from another node?

> 
> But regardless,  sm2utsq01 is not running all 32 MPI ranks it ought.
> 
> I'm really perplexed. Any ideas on how to debug this?

I think the answers to the above questions should help.  You can also
try running the OSU Micro Benchmarks to see what behavior you see there.

Please note, the stable release of MVAPICH2 is 1.8.  I highly suggest
that you update to the latest 1.8 nightly tarball to get the latest bug
fixes and performance enhancements.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list