[mvapich-discuss] Hard to diagnose errors

Matthew Russell matthew.g.russell at gmail.com
Mon Jul 30 12:42:29 EDT 2012


Hi Jonathan, thanks for your response.

The nodes in my node file exist, I am able to run simple executables on
them too with the same machine file:

[matt at dena]~% salloc mpirun_rsh -hostfile machines -np 14 hello_mvapich2
salloc: Granted job allocation 87
Hello world from process 012 out of 014, processor name Dena1
....
Hello world from process 004 out of 014, processor name Dena2
Hello world from process 008 out of 014, processor name Dena4
salloc: Relinquishing job allocation 87

I tried what you recommended below, but received the same results.

Could my login scripts be effecting it?  For instance,

[matt at dena]~/models/cmaq/trunk/bld/dena-451/scripts/cctm% salloc -N2
salloc: Granted job allocation 92
Loading PGI 12.5
Loading AURAMS

Those last two lines of text are output upon login, could having that
output be causing it to crash?

I'm just grabbing at straws here because these errors are very cryptic if
they contain any information at all.

Thanks

On Fri, Jul 27, 2012 at 5:26 PM, Jonathan Perkins <
perkinjo at cse.ohio-state.edu> wrote:

> On Fri, Jul 27, 2012 at 02:28:26PM -0400, Matthew Russell wrote:
> > Hi,
> >
> > I'm trying to run CMAQ, an air quality model, on a cluster with mvapich
> > using slurm.  I don't understand this error though:
> >
> > $ salloc mpirun_rsh -hostfile machines8 -np 2
> >
> /home/matt/models/cmaq/trunk/bld/dena-451/scripts/cctm/CCTM_e2a_Linux2_x86_64pgi
>
> Is salloc giving you the machines that are mentioned in machines8?  It
> may be a simple issue where you're attempting to run on nodes that you
> don't have an allocation for.
>



> Perhaps you can try the following...
> $ salloc -N2
> $ scontrol show hostnames > hosts
> $ mpirun_rsh -hostfile hosts -np 2 /path/to/program
>
> > salloc: Granted job allocation 85
> > [cli_1]: readline failed
> > [cli_1]: readline failed
> > [cli_0]: readline failed
> > [cli_0]: readline failed
> > Fatal error in MPI_Init: Other MPI error, error stack:
> > MPIR_Init_thread(388)...........:
> > MPID_Init(125)..................:
> > MPIDI_Populate_vc_node_ids(1222):
> > MPID_Get_max_node_id(822).......: PMI_KVS_Put returned -1
> > salloc: Relinquishing job allocation 85
> >
> > What does this mean?
>
> There was an error during the initialization process of our MPI library.
>
> >
> > The executable was compiled with mpf90 and mpcc, using the mvapich
> > binaries, etc.
>
> --
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120730/75340550/attachment.html


More information about the mvapich-discuss mailing list