[mvapich-discuss] Help with polled desc error

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Feb 20 22:26:14 EST 2008


> Hi,
>
> Thank you for the suggestions. I apologize that I wasn't able to work on
> this for the past week.
>
> I'm not sure what the experience is for other people who had the same
> problem, but it looks like that "-env MV2_USE_RING_STARTUP 0" did the
> trick for us: I've been running 10+ jobs with 256 procs on the same set
> of nodes, and all jobs with it being passed at the command line ran just
> fine, as opposed to others that failed with the default environment
> setting.

Glad to know that you are able to run jobs successfully with the above
option.

> Hope this is helpful information.

Yes, it is very helpful. We will take a look at this issue further.

Thanks,

DK

> Cheers,
> Le
>
> On Tue, 2008-02-12 at 16:41 -0500, wei huang wrote:
> > Hi,
> >
> > We donot see anything abnormal from our local testing. In order to help us
> > locating the problem, could you please try the following:
> >
> > 1) Check if you have enough space in the /tmp directly
> >
> > 2) Disable ring based start using:
> >
> > mpiexec -n N -env MV2_USE_RING_STARTUP 0 ./a.out
> >
> > 3) If this fails, disable shared memory support using runtime variable
> > MV2_USE_SHARED_MEM=0:
> >
> > mpiexec -n N -env MV2_USE_SHARED_MEM 0 ./a.out
> >
> > Thanks.
> >
> > Regards,
> > Wei Huang
> >
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> >
> >
> > On Tue, 12 Feb 2008, Le Yan wrote:
> >
> > > Hi,
> > >
> > > We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband
> > > cluster. It has 8 cores per node and is running RHEL 4.5 (kernel
> > > 2.6.9-55). The OFED library version is 1.2.
> > >
> > > At first it seemed that any code compiled with Mvapich2 1.0.1 failed at
> > > the MPI_INIT stage when running with more than 128 procs. But later on
> > > we found that a code could run only if it doesn't use all 8 processors
> > > on the same node (which explains why mpiGraph never fails, because it
> > > uses only 1 processor per node). For example, a job running with 16
> > > nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs
> > > per node will not.
> > >
> > > In addition, if the MALLOC_CHECK_ environment variable is set to 1, a
> > > bunch of errors appear in the standard error like this:
> > >
> > > 61: malloc: using debugging hooks
> > > 61: free(): invalid pointer 0x707000!
> > > 61: Fatal error in MPI_Init:
> > > 61: Other MPI error, error stack:
> > > 61: MPIR_Init_thread(259)..: Initialization failed
> > > 61: MPID_Init(102).........: channel initialization failed
> > > 61: MPIDI_CH3_Init(178)....:
> > > 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library
> > >
> > > I'm not quite sure what these messages mean, but sure it looks like a
> > > memory issue?
> > >
> > > Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system.
> > >
> > > Cheers,
> > > Le
> > >
> > >
> > > On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote:
> > > > Hi
> > > >
> > > > No failure was found in these mpiGraph runs. It's just that there is
> > > > significant variation among the entries of the matrices, compared to
> > > > another IB cluster of ours.
> > > >
> > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/
> > > >
> > > > Thanks.
> > > >
> > > > Shao-Ching
> > > >
> > > >
> > > > On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote:
> > > > > Hi,
> > > > >
> > > > > How often do you observe the failures when running the mpiGraph test? Do
> > > > > all the failure happen at startup, as your simple program?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Regards,
> > > > > Wei Huang
> > > > >
> > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > Dept. of Computer Science and Engineering
> > > > > Ohio State University
> > > > > OH 43210
> > > > > Tel: (614)292-8501
> > > > >
> > > > >
> > > > > On Fri, 1 Feb 2008, Shao-Ching Huang wrote:
> > > > >
> > > > > >
> > > > > > Hi Wei,
> > > > > >
> > > > > > We cleaned up a few things and re-ran the mpiGraph tests. The updated
> > > > > > results are posted here:
> > > > > >
> > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html
> > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html
> > > > > >
> > > > > > Please ignore results in my previous email. Thank you.
> > > > > >
> > > > > > Regards,
> > > > > > Shao-Ching
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote:
> > > > > > >
> > > > > > > Hi Wei,
> > > > > > >
> > > > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1)
> > > > > > > MPI process per node:
> > > > > > >
> > > > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out
> > > > > > >
> > > > > > > The results from the two runs are posted here:
> > > > > > >
> > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/
> > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/
> > > > > > >
> > > > > > > During the tests, some other users are also running jobs on some of
> > > > > > > these 48 nodes.
> > > > > > >
> > > > > > > Could you please help us interpret these results, if possible?
> > > > > > >
> > > > > > > Thank you.
> > > > > > >
> > > > > > > Shao-Ching Huang
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote:
> > > > > > > > Hi Scott,
> > > > > > > >
> > > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few
> > > > > > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure
> > > > > > > > the fabrics and system setup are ok. To diagnose this, we suggest you
> > > > > > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph.
> > > > > > > > This test stresses the interconnects. It should fail at a much higher
> > > > > > > > frequency than simple cpi program if there is a problem with your system
> > > > > > > > setup.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Wei Huang
> > > > > > > >
> > > > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > > > Dept. of Computer Science and Engineering
> > > > > > > > Ohio State University
> > > > > > > > OH 43210
> > > > > > > > Tel: (614)292-8501
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote:
> > > > > > > >
> > > > > > > > > My co-worker passed this along...
> > > > > > > > >
> > > > > > > > > Yes, the error happens on the cpi.c program too.  It happened 2 times
> > > > > > > > > among the 9 cases I ran.
> > > > > > > > >
> > > > > > > > > I was using 128 processes (on 32 4-core nodes).
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > and another...
> > > > > > > > >
> > > > > > > > >    It happens for a simple MPI program which just does MPI_Init and
> > > > > > > > > MPI_Finalize and print out number of processors.  It happened for
> > > > > > > > > anything from 4 nodes (16 processors ) and more.
> > > > > > > > >
> > > > > > > > > What environment variables should we look for?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Scott
> > > > > > > > >
> > > > > > > > > wei huang wrote:
> > > > > > > > > > Hi Scott,
> > > > > > > > > >
> > > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you
> > > > > > > > > > have any environmental variables when you are running the program? Does
> > > > > > > > > > the error happen on simple test like cpi?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Wei Huang
> > > > > > > > > >
> > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > > > > > Dept. of Computer Science and Engineering
> > > > > > > > > > Ohio State University
> > > > > > > > > > OH 43210
> > > > > > > > > > Tel: (614)292-8501
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote:
> > > > > > > > > >
> > > > > > > > > >> The low level ibv tests work fine.
> > > > > > > > > >
> > > > > > > > > > _______________________________________________
> > > > > > > > > > mvapich-discuss mailing list
> > > > > > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > mvapich-discuss mailing list
> > > > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > > > _______________________________________________
> > > > > > > mvapich-discuss mailing list
> > > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > > --
> > > Le Yan
> > > User support
> > > Louisiana Optical Network Initiative (LONI)
> > > Office: 225-578-7524
> > > Fax: 225-578-6400
> > >
> > >
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list