[mvapich-discuss] mvapich2/torque problem

Zhiwei Liu z.liu at usciences.edu
Fri Apr 8 01:19:22 EDT 2016


Dear all,

I have a rather strange problem that I needed some help with. Here are some details:

system: ubuntu 14.04 on both master and computing nodes, infiniband interconnect
mvapich2 (2.2b 2.1 and 1.9b) installed
Torque resource management 6.0.1 installed (also tried with an older version).

The problem is mpirun (mpiexec_hydra) works fine for any parallel programs (either a simple testing a.out program or Amber 14 simulation program) if it were run interactively.

However, if I submit a mpirun job to the PBS queue (torque 6.0.1). It works only for single node (can be more than one processors within one node). If no. of nodes requested is more than one, the job dies.

Attached is a log file. I don’t see any problem, it looks like the appropriate no. of nodes and processors have been allocated, some mpi actitivities are going on, then ….dies. (there are some messages about we don’t understand ….. but they also appears when I run the job interactively.)

The problem persists with all three versions of mvapich2 installed and two versions of torque.

Any help would be very much appreciated.

Zhiwei Liu
at the university of the sciences in philadelphia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: output.log
Type: application/octet-stream
Size: 28176 bytes
Desc: output.log
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160408/4696e0e3/attachment-0001.obj>


More information about the mvapich-discuss mailing list