[mvapich-discuss] Issues with 4 socket AMD 6200 system

Mehmet mehmet.belgin at oit.gatech.edu
Thu Dec 1 13:26:03 EST 2011


Not sure if this helps but I gave it a shot on one of our 64-core
interlagos nodes, using PGI 11.8/mvapich2-1.6 and it worked fine for 64
cores. I used mpirun_rsh. I know this is not a apples to apples comparison
but still...

We recently had a problem where mvapich2 did not work with > 64 procs
unless we disable the ptmalloc (registration cache). That was with v1.7. I
was wondering whether 1.8 addresses this issue or not? Devendar, sorry to
constantly bug you with this, but I am curios...

-Mehmet


On Thu, Dec 1, 2011 at 12:25 PM, Jonathan Perkins <
perkinjo at cse.ohio-state.edu> wrote:

> Thanks for the report Kyle.  Remote access to this machine will be
> helpful.  I'll send you a followup message with my public ssh key.
>
> On Thu, Dec 1, 2011 at 11:59 AM, Kyle Sheumaker
> <ksheumaker at advancedclustering.com> wrote:
> > We have encountered a very strange problem when using mvapich2 (1.8a1p1)
> > with a 4 socket, 16 core per socket, AMD 6276 machine (64 total
> cores).  At
> > this point we are not trying multiple boxes, just one system.  When we
> try
> > to run even the simplest of programs compiled with PGI 11.10 we get this
> > result:
> >
> > [act at quad ~]$ mpirun -np 64 ./hello
> > [proxy:0:0 at quad] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:79):
> assert
> > (!closed) failed
> > [proxy:0:0 at quad] fn_get (./pm/pmiserv/pmip_pmi_v1.c:348): error sending
> PMI
> > response
> > [proxy:0:0 at quad] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler
> returned
> > error
> > [proxy:0:0 at quad] HYDT_dmxu_poll_wait_for_event
> > (./tools/demux/demux_poll.c:77): callback returned error status
> > [proxy:0:0 at quad] main (./pm/pmiserv/pmip.c:214): demux engine error
> waiting
> > for event
> > [mpiexec at quad] control_cb (./pm/pmiserv/pmiserv_cb.c:154): assert
> (!closed)
> > failed
> > [mpiexec at quad] HYDT_dmxu_poll_wait_for_event
> > (./tools/demux/demux_poll.c:77): callback returned error status
> > [mpiexec at quad] HYD_pmci_wait_for_completion
> > (./pm/pmiserv/pmiserv_pmci.c:189): error waiting for event
> > [mpiexec at quad] main (./ui/mpich/mpiexec.c:385): process manager error
> > waiting for completion
> >
> > If I reduce the number of processors with the -np argument, it still
> fails,
> > even with -np 1.
> >
> > When trying the same code using mvapich2 built with gfortran instead of
> > pgfortran the code will run correctly.  Using the same code on the same
> box
> > with openmpi and mpich2 with pgfortran it works fine.
> >
> > If I reduce the number of the cores in the box by setting maxcpus on the
> > kernel command line, my sample program runs fine without recompilng.  I
> kept
> > incrementing the maxcpus argument until it fails, and it appears the
> magic
> > number is 41.  Using maxcpus=1 though maxcpus=40 the program will run,
> > anything greater than 40 it fails.
> >
> > The code I'm testing with is very simple:
> >
> > program hello
> >  use mpi
> >  implicit none
> >  integer :: ierr ,nprocs,mype,n
> >
> >  CALL MPI_INIT(ierr)
> >  CALL MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
> >  CALL MPI_COMM_RANK(MPI_COMM_WORLD,mype,ierr)
> >  do n=0,nprocs-1
> >   if (mype.EQ.n) write(6,*) 'hi from process ', n
> >  enddo
> >
> >  CALL MPI_FINALIZE(ierr)
> >  stop
> > end program
> >
> > Any ideas?  I can provide remote access to the box if needed.
> >
> > Thanks,
> > -- Kyle
> >
> > ---
> > Kyle Sheumaker
> > Advanced Clustering
> > phone: 913-643-0305
> > skype: ksheumaker
> > email/xmpp: ksheumaker at advancedclustering.com
> >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
=========================================
Mehmet Belgin, Ph.D. (mehmet.belgin at oit.gatech.edu)
Scientific Computing Consultant | OIT - Academic and Research Technologies
Georgia Institute of Technology
258 Fourth Street, Rich Building, Room 326
Atlanta, GA  30332-0700
Office: (404) 385-0665
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20111201/4e09d94e/attachment.html


More information about the mvapich-discuss mailing list