[mvapich-discuss] mvapich2/torque problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Apr 8 10:54:56 EDT 2016


Hello.  In order to debug this further can you give us some more details
about your system.  Can you give us the output from MVAPICH2 when you run a
job (on one node) with MV2_SHOW_ENV_INFO=2 set.  Also can you let us know
the cpu and core count as well as IB HCA in use.

Are you able to run any internode jobs outside of the torque environment?
Can you also try using mpirun_rsh instead of mpirun (give mpirun_rsh the
'-hostfile $PBS_NODEFILE' option)?  I suggest focusing on trying with
MVAPICH2 v2.2b as we aren't actively developing the older versions.

Thanks in advance for the additional information and trying mpirun_rsh.

On Fri, Apr 8, 2016 at 10:44 AM Zhiwei Liu <z.liu at usciences.edu> wrote:

> Dear all,
>
> I have a rather strange problem that I needed some help with. Here are
> some details:
>
> system: ubuntu 14.04 on both master and computing nodes, infiniband
> interconnect
> mvapich2 (2.2b 2.1 and 1.9b) installed
> Torque resource management 6.0.1 installed (also tried with an older
> version).
>
> The problem is mpirun (mpiexec_hydra) works fine for any parallel programs
> (either a simple testing a.out program or Amber 14 simulation program) if
> it were run interactively.
>
> However, if I submit a mpirun job to the PBS queue (torque 6.0.1). It
> works only for single node (can be more than one processors within one
> node). If no. of nodes requested is more than one, the job dies.
>
> Attached is a log file. I don’t see any problem, it looks like the
> appropriate no. of nodes and processors have been allocated, some mpi
> actitivities are going on, then ….dies. (there are some messages about we
> don’t understand ….. but they also appears when I run the job
> interactively.)
>
> The problem persists with all three versions of mvapich2 installed and two
> versions of torque.
>
> Any help would be very much appreciated.
>
> Zhiwei Liu
> at the university of the sciences in philadelphia
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160408/70b8acd2/attachment.html>


More information about the mvapich-discuss mailing list