[mvapich-discuss] Issues with 4 socket AMD 6200 system

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Dec 1 12:25:50 EST 2011


Thanks for the report Kyle.  Remote access to this machine will be
helpful.  I'll send you a followup message with my public ssh key.

On Thu, Dec 1, 2011 at 11:59 AM, Kyle Sheumaker
<ksheumaker at advancedclustering.com> wrote:
> We have encountered a very strange problem when using mvapich2 (1.8a1p1)
> with a 4 socket, 16 core per socket, AMD 6276 machine (64 total cores).  At
> this point we are not trying multiple boxes, just one system.  When we try
> to run even the simplest of programs compiled with PGI 11.10 we get this
> result:
>
> [act at quad ~]$ mpirun -np 64 ./hello
> [proxy:0:0 at quad] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:79): assert
> (!closed) failed
> [proxy:0:0 at quad] fn_get (./pm/pmiserv/pmip_pmi_v1.c:348): error sending PMI
> response
> [proxy:0:0 at quad] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned
> error
> [proxy:0:0 at quad] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at quad] main (./pm/pmiserv/pmip.c:214): demux engine error waiting
> for event
> [mpiexec at quad] control_cb (./pm/pmiserv/pmiserv_cb.c:154): assert (!closed)
> failed
> [mpiexec at quad] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at quad] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:189): error waiting for event
> [mpiexec at quad] main (./ui/mpich/mpiexec.c:385): process manager error
> waiting for completion
>
> If I reduce the number of processors with the -np argument, it still fails,
> even with -np 1.
>
> When trying the same code using mvapich2 built with gfortran instead of
> pgfortran the code will run correctly.  Using the same code on the same box
> with openmpi and mpich2 with pgfortran it works fine.
>
> If I reduce the number of the cores in the box by setting maxcpus on the
> kernel command line, my sample program runs fine without recompilng.  I kept
> incrementing the maxcpus argument until it fails, and it appears the magic
> number is 41.  Using maxcpus=1 though maxcpus=40 the program will run,
> anything greater than 40 it fails.
>
> The code I'm testing with is very simple:
>
> program hello
>  use mpi
>  implicit none
>  integer :: ierr ,nprocs,mype,n
>
>  CALL MPI_INIT(ierr)
>  CALL MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
>  CALL MPI_COMM_RANK(MPI_COMM_WORLD,mype,ierr)
>  do n=0,nprocs-1
>   if (mype.EQ.n) write(6,*) 'hi from process ', n
>  enddo
>
>  CALL MPI_FINALIZE(ierr)
>  stop
> end program
>
> Any ideas?  I can provide remote access to the box if needed.
>
> Thanks,
> -- Kyle
>
> ---
> Kyle Sheumaker
> Advanced Clustering
> phone: 913-643-0305
> skype: ksheumaker
> email/xmpp: ksheumaker at advancedclustering.com
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list