[mvapich-discuss] Help with polled desc error

Le Yan lyan1 at cct.lsu.edu
Wed Feb 20 17:08:32 EST 2008


Hi,

Thank you for the suggestions. I apologize that I wasn't able to work on
this for the past week.

I'm not sure what the experience is for other people who had the same
problem, but it looks like that "-env MV2_USE_RING_STARTUP 0" did the
trick for us: I've been running 10+ jobs with 256 procs on the same set
of nodes, and all jobs with it being passed at the command line ran just
fine, as opposed to others that failed with the default environment
setting.

Hope this is helpful information.

Cheers,
Le 

On Tue, 2008-02-12 at 16:41 -0500, wei huang wrote:
> Hi,
> 
> We donot see anything abnormal from our local testing. In order to help us
> locating the problem, could you please try the following:
> 
> 1) Check if you have enough space in the /tmp directly
> 
> 2) Disable ring based start using:
> 
> mpiexec -n N -env MV2_USE_RING_STARTUP 0 ./a.out
> 
> 3) If this fails, disable shared memory support using runtime variable
> MV2_USE_SHARED_MEM=0:
> 
> mpiexec -n N -env MV2_USE_SHARED_MEM 0 ./a.out
> 
> Thanks.
> 
> Regards,
> Wei Huang
> 
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
> 
> 
> On Tue, 12 Feb 2008, Le Yan wrote:
> 
> > Hi,
> >
> > We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband
> > cluster. It has 8 cores per node and is running RHEL 4.5 (kernel
> > 2.6.9-55). The OFED library version is 1.2.
> >
> > At first it seemed that any code compiled with Mvapich2 1.0.1 failed at
> > the MPI_INIT stage when running with more than 128 procs. But later on
> > we found that a code could run only if it doesn't use all 8 processors
> > on the same node (which explains why mpiGraph never fails, because it
> > uses only 1 processor per node). For example, a job running with 16
> > nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs
> > per node will not.
> >
> > In addition, if the MALLOC_CHECK_ environment variable is set to 1, a
> > bunch of errors appear in the standard error like this:
> >
> > 61: malloc: using debugging hooks
> > 61: free(): invalid pointer 0x707000!
> > 61: Fatal error in MPI_Init:
> > 61: Other MPI error, error stack:
> > 61: MPIR_Init_thread(259)..: Initialization failed
> > 61: MPID_Init(102).........: channel initialization failed
> > 61: MPIDI_CH3_Init(178)....:
> > 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library
> >
> > I'm not quite sure what these messages mean, but sure it looks like a
> > memory issue?
> >
> > Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system.
> >
> > Cheers,
> > Le
> >
> >
> > On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote:
> > > Hi
> > >
> > > No failure was found in these mpiGraph runs. It's just that there is
> > > significant variation among the entries of the matrices, compared to
> > > another IB cluster of ours.
> > >
> > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/
> > >
> > > Thanks.
> > >
> > > Shao-Ching
> > >
> > >
> > > On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote:
> > > > Hi,
> > > >
> > > > How often do you observe the failures when running the mpiGraph test? Do
> > > > all the failure happen at startup, as your simple program?
> > > >
> > > > Thanks.
> > > >
> > > > Regards,
> > > > Wei Huang
> > > >
> > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > Dept. of Computer Science and Engineering
> > > > Ohio State University
> > > > OH 43210
> > > > Tel: (614)292-8501
> > > >
> > > >
> > > > On Fri, 1 Feb 2008, Shao-Ching Huang wrote:
> > > >
> > > > >
> > > > > Hi Wei,
> > > > >
> > > > > We cleaned up a few things and re-ran the mpiGraph tests. The updated
> > > > > results are posted here:
> > > > >
> > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html
> > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html
> > > > >
> > > > > Please ignore results in my previous email. Thank you.
> > > > >
> > > > > Regards,
> > > > > Shao-Ching
> > > > >
> > > > >
> > > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote:
> > > > > >
> > > > > > Hi Wei,
> > > > > >
> > > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1)
> > > > > > MPI process per node:
> > > > > >
> > > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out
> > > > > >
> > > > > > The results from the two runs are posted here:
> > > > > >
> > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/
> > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/
> > > > > >
> > > > > > During the tests, some other users are also running jobs on some of
> > > > > > these 48 nodes.
> > > > > >
> > > > > > Could you please help us interpret these results, if possible?
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Shao-Ching Huang
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote:
> > > > > > > Hi Scott,
> > > > > > >
> > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few
> > > > > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure
> > > > > > > the fabrics and system setup are ok. To diagnose this, we suggest you
> > > > > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph.
> > > > > > > This test stresses the interconnects. It should fail at a much higher
> > > > > > > frequency than simple cpi program if there is a problem with your system
> > > > > > > setup.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Wei Huang
> > > > > > >
> > > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > > Dept. of Computer Science and Engineering
> > > > > > > Ohio State University
> > > > > > > OH 43210
> > > > > > > Tel: (614)292-8501
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote:
> > > > > > >
> > > > > > > > My co-worker passed this along...
> > > > > > > >
> > > > > > > > Yes, the error happens on the cpi.c program too.  It happened 2 times
> > > > > > > > among the 9 cases I ran.
> > > > > > > >
> > > > > > > > I was using 128 processes (on 32 4-core nodes).
> > > > > > > >
> > > > > > > > ---
> > > > > > > >
> > > > > > > > and another...
> > > > > > > >
> > > > > > > >    It happens for a simple MPI program which just does MPI_Init and
> > > > > > > > MPI_Finalize and print out number of processors.  It happened for
> > > > > > > > anything from 4 nodes (16 processors ) and more.
> > > > > > > >
> > > > > > > > What environment variables should we look for?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Scott
> > > > > > > >
> > > > > > > > wei huang wrote:
> > > > > > > > > Hi Scott,
> > > > > > > > >
> > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you
> > > > > > > > > have any environmental variables when you are running the program? Does
> > > > > > > > > the error happen on simple test like cpi?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Wei Huang
> > > > > > > > >
> > > > > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > > > > Dept. of Computer Science and Engineering
> > > > > > > > > Ohio State University
> > > > > > > > > OH 43210
> > > > > > > > > Tel: (614)292-8501
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote:
> > > > > > > > >
> > > > > > > > >> The low level ibv tests work fine.
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > mvapich-discuss mailing list
> > > > > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > mvapich-discuss mailing list
> > > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > > _______________________________________________
> > > > > > mvapich-discuss mailing list
> > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > >
> > --
> > Le Yan
> > User support
> > Louisiana Optical Network Initiative (LONI)
> > Office: 225-578-7524
> > Fax: 225-578-6400
> >
> >
> 



More information about the mvapich-discuss mailing list