[mvapich-discuss] Hard to diagnose errors

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Aug 23 13:25:48 EDT 2012


Hi all, this issue was debugged off list.  Although we were unable to
determine the original cause of the problems Matt was facing, we were
able to get things working after an upgrade to mvapich2-1.8 and
resolving a permission issue related to his OFED install.

On Mon, Jul 30, 2012 at 12:42:29PM -0400, Matthew Russell wrote:
> Hi Jonathan, thanks for your response.
> 
> The nodes in my node file exist, I am able to run simple executables on
> them too with the same machine file:
> 
> [matt at dena]~% salloc mpirun_rsh -hostfile machines -np 14 hello_mvapich2
> salloc: Granted job allocation 87
> Hello world from process 012 out of 014, processor name Dena1
> ....
> Hello world from process 004 out of 014, processor name Dena2
> Hello world from process 008 out of 014, processor name Dena4
> salloc: Relinquishing job allocation 87
> 
> I tried what you recommended below, but received the same results.
> 
> Could my login scripts be effecting it?  For instance,
> 
> [matt at dena]~/models/cmaq/trunk/bld/dena-451/scripts/cctm% salloc -N2
> salloc: Granted job allocation 92
> Loading PGI 12.5
> Loading AURAMS
> 
> Those last two lines of text are output upon login, could having that
> output be causing it to crash?
> 
> I'm just grabbing at straws here because these errors are very cryptic if
> they contain any information at all.
> 
> Thanks
> 
> On Fri, Jul 27, 2012 at 5:26 PM, Jonathan Perkins <
> perkinjo at cse.ohio-state.edu> wrote:
> 
> > On Fri, Jul 27, 2012 at 02:28:26PM -0400, Matthew Russell wrote:
> > > Hi,
> > >
> > > I'm trying to run CMAQ, an air quality model, on a cluster with mvapich
> > > using slurm.  I don't understand this error though:
> > >
> > > $ salloc mpirun_rsh -hostfile machines8 -np 2
> > >
> > /home/matt/models/cmaq/trunk/bld/dena-451/scripts/cctm/CCTM_e2a_Linux2_x86_64pgi
> >
> > Is salloc giving you the machines that are mentioned in machines8?  It
> > may be a simple issue where you're attempting to run on nodes that you
> > don't have an allocation for.
> >
> 
> 
> 
> > Perhaps you can try the following...
> > $ salloc -N2
> > $ scontrol show hostnames > hosts
> > $ mpirun_rsh -hostfile hosts -np 2 /path/to/program
> >
> > > salloc: Granted job allocation 85
> > > [cli_1]: readline failed
> > > [cli_1]: readline failed
> > > [cli_0]: readline failed
> > > [cli_0]: readline failed
> > > Fatal error in MPI_Init: Other MPI error, error stack:
> > > MPIR_Init_thread(388)...........:
> > > MPID_Init(125)..................:
> > > MPIDI_Populate_vc_node_ids(1222):
> > > MPID_Get_max_node_id(822).......: PMI_KVS_Put returned -1
> > > salloc: Relinquishing job allocation 85
> > >
> > > What does this mean?
> >
> > There was an error during the initialization process of our MPI library.
> >
> > >
> > > The executable was compiled with mpf90 and mpcc, using the mvapich
> > > binaries, etc.
> >
> > --
> > Jonathan Perkins
> > http://www.cse.ohio-state.edu/~perkinjo
> >

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list