[mvapich-discuss] Unable to get MVAPICH2 1.7 to spawn all requested
ranks in a specific node
Fernandez, Alvaro
Alvaro.Fernandez at amd.com
Mon Sep 3 18:51:53 EDT 2012
Hello,
This is my first post to this list, so please bear with me...
I have a cluster with four boxes. Each box is a 2P node, and each P has 16 cores. So I have a total of 32 cores per node, or 32 X 4 = 128 cores in the full cluster. (Scientific Linux 6.2 Carbon Linux). I've verified that Infiniband works across from one 2P node to another.
I am running a simple MPI code hello_v2.f (listed at the end of this email), which outputs a rank number and hostname for every MPI rank it spawns. I compile and run like this:
sm2utsq01 # mpif90 -o hello_v2.exe hello_v2.f -f90=openf90 -I/usr/mpi/gcc/mvapich2-1.7/include
sm2utsq01 # mpirun_rsh -hostfile hosts_all -np 128 hello_v2.exe >& mpirun_rsh_all.out
The contents of the host file are:
# cat hosts_all
sm2utsq01:32
sm2utsq02:32
sm2utsq03:32
sm2utsq04:32
So the output file looks like this:
node 1 : Hello world from host sm2utsq01
node 2 : Hello world from host sm2utsq01
node 11 : Hello world from host sm2utsq01
[...]
I used grep on this file to verify I have 32 responding ranks in each node like I should. But here's what I see:
[root at sm2utsq01]# grep sm2utsq01 mpirun_rsh_all.out | wc -l NOT OK.
2
[root at sm2utsq01]# grep sm2utsq02 mpirun_rsh_all.out | wc -l OK.
32
[root at sm2utsq01]# grep sm2utsq03 mpirun_rsh_all.out | wc -l OK.
32
[root at sm2utsq01]# grep sm2utsq04 mpirun_rsh_all.out | wc -l OK.
32
In fact, for the node I'm running on (sm2utsq01), I always get a variable count of MPI ranks, and never all 32 I expect. The other nodes all output the correct count: 32.
Why? What is so special about the node I'm running on? I've tried playing with the host file syntax, and listing the hosts in reverse order.
But regardless, sm2utsq01 is not running all 32 MPI ranks it ought.
I'm really perplexed. Any ideas on how to debug this?
***************
The Fortran listing for the test code:
[root at sm2utsq01]# cat hello_v2.f
c234567
program hello
implicit none
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),l
character*(MPI_MAX_PROCESSOR_NAME) hostname
!integer gethostname !$pragma C(gethostname)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call mpi_get_processor_name(hostname,l,ierror)
if (ierror /= 0) then
print *,'Error in get_processor_name ',trim(hostname)
else
print*, 'node', rank, ': Hello from host ',trim(hostname)
end if
call MPI_FINALIZE(ierror)
if (ierror /= 0) print *,'ierror = ',ierror
end
Alvaro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120903/ab797f39/attachment.html
More information about the mvapich-discuss
mailing list