[mvapich-discuss] Issues with 4 socket AMD 6200 system

Kyle Sheumaker ksheumaker at advancedclustering.com
Thu Dec 1 11:59:06 EST 2011


We have encountered a very strange problem when using mvapich2 (1.8a1p1) with a 4 socket, 16 core per socket, AMD 6276 machine (64 total cores).  At this point we are not trying multiple boxes, just one system.  When we try to run even the simplest of programs compiled with PGI 11.10 we get this result:

[act at quad ~]$ mpirun -np 64 ./hello
[proxy:0:0 at quad] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:79): assert (!closed) failed
[proxy:0:0 at quad] fn_get (./pm/pmiserv/pmip_pmi_v1.c:348): error sending PMI response
[proxy:0:0 at quad] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0 at quad] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at quad] main (./pm/pmiserv/pmip.c:214): demux engine error waiting for event
[mpiexec at quad] control_cb (./pm/pmiserv/pmiserv_cb.c:154): assert (!closed) failed
[mpiexec at quad] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at quad] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:189): error waiting for event
[mpiexec at quad] main (./ui/mpich/mpiexec.c:385): process manager error waiting for completion

If I reduce the number of processors with the -np argument, it still fails, even with -np 1.

When trying the same code using mvapich2 built with gfortran instead of pgfortran the code will run correctly.  Using the same code on the same box with openmpi and mpich2 with pgfortran it works fine.

If I reduce the number of the cores in the box by setting maxcpus on the kernel command line, my sample program runs fine without recompilng.  I kept incrementing the maxcpus argument until it fails, and it appears the magic number is 41.  Using maxcpus=1 though maxcpus=40 the program will run, anything greater than 40 it fails.

The code I'm testing with is very simple:

program hello
 use mpi
 implicit none
 integer :: ierr ,nprocs,mype,n

 CALL MPI_INIT(ierr)
 CALL MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
 CALL MPI_COMM_RANK(MPI_COMM_WORLD,mype,ierr)
 do n=0,nprocs-1
  if (mype.EQ.n) write(6,*) 'hi from process ', n
 enddo

 CALL MPI_FINALIZE(ierr) 
 stop
end program

Any ideas?  I can provide remote access to the box if needed.

Thanks,
-- Kyle

---
Kyle Sheumaker
Advanced Clustering
phone: 913-643-0305
skype: ksheumaker
email/xmpp: ksheumaker at advancedclustering.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20111201/f2d943a2/attachment-0001.html


More information about the mvapich-discuss mailing list