[mvapich-discuss] Unable to get MVAPICH2 1.7 to spawn all requested ranks in a specific node

Fernandez, Alvaro Alvaro.Fernandez at amd.com
Mon Sep 3 18:51:53 EDT 2012


Hello,


This is my first post to this list, so please bear with me...

I have a cluster with four boxes. Each box is a 2P node, and each P has 16 cores. So I have a total of 32 cores per node, or 32 X 4 = 128 cores in the full cluster. (Scientific Linux 6.2 Carbon Linux). I've verified that Infiniband works across from one 2P node to another.

I am running a simple MPI code hello_v2.f  (listed at the end of this email), which outputs a rank number and hostname for every MPI rank it spawns. I compile and run like this:

sm2utsq01 # mpif90 -o hello_v2.exe hello_v2.f -f90=openf90  -I/usr/mpi/gcc/mvapich2-1.7/include
sm2utsq01 #  mpirun_rsh -hostfile hosts_all -np 128 hello_v2.exe  >& mpirun_rsh_all.out

The contents of the host file are:
# cat hosts_all
sm2utsq01:32
sm2utsq02:32
sm2utsq03:32
sm2utsq04:32

So the output file  looks like this:
node           1 : Hello world from host sm2utsq01
node           2 : Hello world from host sm2utsq01
node          11 : Hello world from host sm2utsq01
[...]


I used grep on this file to verify I have 32 responding ranks in each node  like I should. But here's what I see:
[root at sm2utsq01]# grep sm2utsq01 mpirun_rsh_all.out | wc -l    NOT OK.
2
[root at sm2utsq01]# grep sm2utsq02 mpirun_rsh_all.out | wc -l    OK.
32
[root at sm2utsq01]# grep sm2utsq03 mpirun_rsh_all.out | wc -l    OK.
32
[root at sm2utsq01]# grep sm2utsq04 mpirun_rsh_all.out | wc -l    OK.
32

In fact, for the node I'm running on (sm2utsq01), I always get a variable count of MPI ranks, and never all 32 I expect. The other nodes all output the correct count: 32.

Why? What is so special about the node I'm running on? I've tried playing with the host file syntax, and listing the hosts in reverse order.

But regardless,  sm2utsq01 is not running all 32 MPI ranks it ought.

I'm really perplexed. Any ideas on how to debug this?

***************

The Fortran listing for the test code:

[root at sm2utsq01]# cat hello_v2.f
c234567
      program hello
         implicit none
         include 'mpif.h'

        integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),l
        character*(MPI_MAX_PROCESSOR_NAME) hostname
        !integer gethostname !$pragma C(gethostname)

        call MPI_INIT(ierror)
        call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
        call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

        call mpi_get_processor_name(hostname,l,ierror)

        if (ierror /= 0) then
           print *,'Error in get_processor_name ',trim(hostname)
        else

           print*, 'node', rank, ': Hello from host ',trim(hostname)
        end if
        call MPI_FINALIZE(ierror)
        if (ierror /= 0) print *,'ierror = ',ierror
      end


Alvaro




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120903/ab797f39/attachment.html


More information about the mvapich-discuss mailing list