[mvapich-discuss] Help with polled desc error

Shao-Ching Huang schuang at ats.ucla.edu
Sat Feb 9 01:02:25 EST 2008


Hi

No failure was found in these mpiGraph runs. It's just that there is
significant variation among the entries of the matrices, compared to
another IB cluster of ours.

http://reynolds.turb.ucla.edu/~schuang/mpiGraph/

Thanks.

Shao-Ching


On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote:
> Hi,
> 
> How often do you observe the failures when running the mpiGraph test? Do
> all the failure happen at startup, as your simple program?
> 
> Thanks.
> 
> Regards,
> Wei Huang
> 
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
> 
> 
> On Fri, 1 Feb 2008, Shao-Ching Huang wrote:
> 
> >
> > Hi Wei,
> >
> > We cleaned up a few things and re-ran the mpiGraph tests. The updated
> > results are posted here:
> >
> > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html
> > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html
> >
> > Please ignore results in my previous email. Thank you.
> >
> > Regards,
> > Shao-Ching
> >
> >
> > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote:
> > >
> > > Hi Wei,
> > >
> > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1)
> > > MPI process per node:
> > >
> > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out
> > >
> > > The results from the two runs are posted here:
> > >
> > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/
> > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/
> > >
> > > During the tests, some other users are also running jobs on some of
> > > these 48 nodes.
> > >
> > > Could you please help us interpret these results, if possible?
> > >
> > > Thank you.
> > >
> > > Shao-Ching Huang
> > >
> > >
> > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote:
> > > > Hi Scott,
> > > >
> > > > We went up to 256 processes (32 nodes) and did not see the problem in few
> > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure
> > > > the fabrics and system setup are ok. To diagnose this, we suggest you
> > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph.
> > > > This test stresses the interconnects. It should fail at a much higher
> > > > frequency than simple cpi program if there is a problem with your system
> > > > setup.
> > > >
> > > > Thanks.
> > > >
> > > > Regards,
> > > > Wei Huang
> > > >
> > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > Dept. of Computer Science and Engineering
> > > > Ohio State University
> > > > OH 43210
> > > > Tel: (614)292-8501
> > > >
> > > >
> > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote:
> > > >
> > > > > My co-worker passed this along...
> > > > >
> > > > > Yes, the error happens on the cpi.c program too.  It happened 2 times
> > > > > among the 9 cases I ran.
> > > > >
> > > > > I was using 128 processes (on 32 4-core nodes).
> > > > >
> > > > > ---
> > > > >
> > > > > and another...
> > > > >
> > > > >    It happens for a simple MPI program which just does MPI_Init and
> > > > > MPI_Finalize and print out number of processors.  It happened for
> > > > > anything from 4 nodes (16 processors ) and more.
> > > > >
> > > > > What environment variables should we look for?
> > > > >
> > > > > Thanks,
> > > > > Scott
> > > > >
> > > > > wei huang wrote:
> > > > > > Hi Scott,
> > > > > >
> > > > > > On how many processes (and how many nodes) you ran your program? Do you
> > > > > > have any environmental variables when you are running the program? Does
> > > > > > the error happen on simple test like cpi?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Regards,
> > > > > > Wei Huang
> > > > > >
> > > > > > 774 Dreese Lab, 2015 Neil Ave,
> > > > > > Dept. of Computer Science and Engineering
> > > > > > Ohio State University
> > > > > > OH 43210
> > > > > > Tel: (614)292-8501
> > > > > >
> > > > > >
> > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote:
> > > > > >
> > > > > >> The low level ibv tests work fine.
> > > > > >
> > > > > > _______________________________________________
> > > > > > mvapich-discuss mailing list
> > > > > > mvapich-discuss at cse.ohio-state.edu
> > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >


More information about the mvapich-discuss mailing list