[mvapich-discuss] Hard to diagnose errors

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Jul 27 17:26:30 EDT 2012


On Fri, Jul 27, 2012 at 02:28:26PM -0400, Matthew Russell wrote:
> Hi,
> 
> I'm trying to run CMAQ, an air quality model, on a cluster with mvapich
> using slurm.  I don't understand this error though:
> 
> $ salloc mpirun_rsh -hostfile machines8 -np 2
> /home/matt/models/cmaq/trunk/bld/dena-451/scripts/cctm/CCTM_e2a_Linux2_x86_64pgi

Is salloc giving you the machines that are mentioned in machines8?  It
may be a simple issue where you're attempting to run on nodes that you
don't have an allocation for.

Perhaps you can try the following...
$ salloc -N2
$ scontrol show hostnames > hosts
$ mpirun_rsh -hostfile hosts -np 2 /path/to/program

> salloc: Granted job allocation 85
> [cli_1]: readline failed
> [cli_1]: readline failed
> [cli_0]: readline failed
> [cli_0]: readline failed
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(388)...........:
> MPID_Init(125)..................:
> MPIDI_Populate_vc_node_ids(1222):
> MPID_Get_max_node_id(822).......: PMI_KVS_Put returned -1
> salloc: Relinquishing job allocation 85
> 
> What does this mean?

There was an error during the initialization process of our MPI library.

> 
> The executable was compiled with mpf90 and mpcc, using the mvapich
> binaries, etc.

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list