[mvapich-discuss] Issues with 4 socket AMD 6200 system

Mon Dec 19 20:07:46 EST 2011

Hi all.  I'm updating the list to let everyone know that there was a
problem with the size of one of our internal data structures that led
to the problem Kyle reported.

We have provided a fix for this problem in the 1.7 branch.  The latest
tarball http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.7/mvapich2-latest.tar.gz
contains this fix and a couple other recent fixes/enhancements.

Kyle, thanks again for reporting this issue to us.

On Thu, Dec 1, 2011 at 12:25 PM, Jonathan Perkins
<perkinjo at cse.ohio-state.edu> wrote:
> Thanks for the report Kyle.  Remote access to this machine will be
> helpful.  I'll send you a followup message with my public ssh key.
>
> On Thu, Dec 1, 2011 at 11:59 AM, Kyle Sheumaker
> <ksheumaker at advancedclustering.com> wrote:
>> We have encountered a very strange problem when using mvapich2 (1.8a1p1)
>> with a 4 socket, 16 core per socket, AMD 6276 machine (64 total cores).  At
>> this point we are not trying multiple boxes, just one system.  When we try
>> to run even the simplest of programs compiled with PGI 11.10 we get this
>> result:
>>
>> [act at quad ~]$ mpirun -np 64 ./hello
>> [proxy:0:0 at quad] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:79): assert
>> (!closed) failed
>> [proxy:0:0 at quad] fn_get (./pm/pmiserv/pmip_pmi_v1.c:348): error sending PMI
>> response
>> [proxy:0:0 at quad] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned
>> error
>> [proxy:0:0 at quad] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:0 at quad] main (./pm/pmiserv/pmip.c:214): demux engine error waiting
>> for event
>> [mpiexec at quad] control_cb (./pm/pmiserv/pmiserv_cb.c:154): assert (!closed)
>> failed
>> [mpiexec at quad] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at quad] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:189): error waiting for event
>> [mpiexec at quad] main (./ui/mpich/mpiexec.c:385): process manager error
>> waiting for completion
>>
>> If I reduce the number of processors with the -np argument, it still fails,
>> even with -np 1.
>>
>> When trying the same code using mvapich2 built with gfortran instead of
>> pgfortran the code will run correctly.  Using the same code on the same box
>> with openmpi and mpich2 with pgfortran it works fine.
>>
>> If I reduce the number of the cores in the box by setting maxcpus on the
>> kernel command line, my sample program runs fine without recompilng.  I kept
>> incrementing the maxcpus argument until it fails, and it appears the magic
>> number is 41.  Using maxcpus=1 though maxcpus=40 the program will run,
>> anything greater than 40 it fails.
>>
>> The code I'm testing with is very simple:
>>
>> program hello
>>  use mpi
>>  implicit none
>>  integer :: ierr ,nprocs,mype,n
>>
>>  CALL MPI_INIT(ierr)
>>  CALL MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
>>  CALL MPI_COMM_RANK(MPI_COMM_WORLD,mype,ierr)
>>  do n=0,nprocs-1
>>   if (mype.EQ.n) write(6,*) 'hi from process ', n
>>  enddo
>>
>>  CALL MPI_FINALIZE(ierr)
>>  stop
>> end program
>>
>> Any ideas?  I can provide remote access to the box if needed.
>>
>> Thanks,
>> -- Kyle
>>
>> ---
>> Kyle Sheumaker
>> Advanced Clustering
>> phone: 913-643-0305
>> skype: ksheumaker
>> email/xmpp: ksheumaker at advancedclustering.com
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo