[mvapich-discuss] mvapich2/torque problem

Zhiwei Liu z.liu at usciences.edu
Fri Apr 8 11:43:16 EDT 2016


Thanks for the quick reply. Below are the further details requested:

Yes, internode or single node job both run without problem outside the torque environment.

Outputs from mvapich2 with MV2_SHOW_ENV_INFO=2 on one node is attached, this runs OK since it is a single node run. The output.err file contains the environment settings.

My CPUs are opteron 6128, each machine has two 8-core CPUS, thus 16 cores per node.
Infiniband is MT25408(MT26248 on master node)  connectX Mellanox using the mlx4 driver, rdma (Infiniband/iWARP), ib_umad, ib_ipoib etc

mpirun_rsh with version 2.2b complains about syntax error of my hostfile so it does not work.

However, when I use 2.1, mpirun_rsh does not complain syntax error. Though same thing happens, which is single node in torque would run, internode in torque would not run, both single and internode would run outside torque.

Many thanks.

zhiwei

From: Jonathan Perkins <perkinjo at cse.ohio-state.edu<mailto:perkinjo at cse.ohio-state.edu>>
Date: Friday, April 8, 2016 at 10:54 AM
To: Zhiwei Liu <z.liu at usciences.edu<mailto:z.liu at usciences.edu>>, "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] mvapich2/torque problem

Hello.  In order to debug this further can you give us some more details about your system.  Can you give us the output from MVAPICH2 when you run a job (on one node) with MV2_SHOW_ENV_INFO=2 set.  Also can you let us know the cpu and core count as well as IB HCA in use.

Are you able to run any internode jobs outside of the torque environment?  Can you also try using mpirun_rsh instead of mpirun (give mpirun_rsh the '-hostfile $PBS_NODEFILE' option)?  I suggest focusing on trying with MVAPICH2 v2.2b as we aren't actively developing the older versions.

Thanks in advance for the additional information and trying mpirun_rsh.

On Fri, Apr 8, 2016 at 10:44 AM Zhiwei Liu <z.liu at usciences.edu<mailto:z.liu at usciences.edu>> wrote:
Dear all,

I have a rather strange problem that I needed some help with. Here are some details:

system: ubuntu 14.04 on both master and computing nodes, infiniband interconnect
mvapich2 (2.2b 2.1 and 1.9b) installed
Torque resource management 6.0.1 installed (also tried with an older version).

The problem is mpirun (mpiexec_hydra) works fine for any parallel programs (either a simple testing a.out program or Amber 14 simulation program) if it were run interactively.

However, if I submit a mpirun job to the PBS queue (torque 6.0.1). It works only for single node (can be more than one processors within one node). If no. of nodes requested is more than one, the job dies.

Attached is a log file. I don’t see any problem, it looks like the appropriate no. of nodes and processors have been allocated, some mpi actitivities are going on, then ….dies. (there are some messages about we don’t understand ….. but they also appears when I run the job interactively.)

The problem persists with all three versions of mvapich2 installed and two versions of torque.

Any help would be very much appreciated.

Zhiwei Liu
at the university of the sciences in philadelphia
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: output.log1nodeMV2
Type: application/octet-stream
Size: 20582 bytes
Desc: output.log1nodeMV2
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160408/2736fd4b/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: output.err1nodeMV2
Type: application/octet-stream
Size: 6459 bytes
Desc: output.err1nodeMV2
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160408/2736fd4b/attachment-0003.obj>


More information about the mvapich-discuss mailing list