[mvapich-discuss] mvapich2/torque problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Apr 8 15:31:41 EDT 2016


Thanks for the information.  So far I don't see anything that sticks out as
an issue.  Can you rebuild MVAPICH2 with the debug settings in case more
information may be output from the library?

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-userguide.html#x1-1270009.1.14

I would also suggest to disable the debugging that you've enabled with
mpiexec as I don't believe that the problem is related to the launcher.

On Fri, Apr 8, 2016 at 11:43 AM Zhiwei Liu <z.liu at usciences.edu> wrote:

> Thanks for the quick reply. Below are the further details requested:
>
> Yes, internode or single node job both run without problem outside the
> torque environment.
>
> Outputs from mvapich2 with MV2_SHOW_ENV_INFO=2 on one node is attached,
> this runs OK since it is a single node run. The output.err file contains
> the environment settings.
>
> My CPUs are opteron 6128, each machine has two 8-core CPUS, thus 16 cores
> per node.
> Infiniband is MT25408(MT26248 on master node)  connectX Mellanox using the
> mlx4 driver, rdma (Infiniband/iWARP), ib_umad, ib_ipoib etc
>
> mpirun_rsh with version 2.2b complains about syntax error of my hostfile
> so it does not work.
>
> However, when I use 2.1, mpirun_rsh does not complain syntax error. Though
> same thing happens, which is single node in torque would run, internode in
> torque would not run, both single and internode would run outside torque.
>
> Many thanks.
>
> zhiwei
>
> From: Jonathan Perkins <perkinjo at cse.ohio-state.edu<mailto:
> perkinjo at cse.ohio-state.edu>>
> Date: Friday, April 8, 2016 at 10:54 AM
> To: Zhiwei Liu <z.liu at usciences.edu<mailto:z.liu at usciences.edu>>, "
> mvapich-discuss at cse.ohio-state.edu<mailto:
> mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>>
> Subject: Re: [mvapich-discuss] mvapich2/torque problem
>
> Hello.  In order to debug this further can you give us some more details
> about your system.  Can you give us the output from MVAPICH2 when you run a
> job (on one node) with MV2_SHOW_ENV_INFO=2 set.  Also can you let us know
> the cpu and core count as well as IB HCA in use.
>
> Are you able to run any internode jobs outside of the torque environment?
> Can you also try using mpirun_rsh instead of mpirun (give mpirun_rsh the
> '-hostfile $PBS_NODEFILE' option)?  I suggest focusing on trying with
> MVAPICH2 v2.2b as we aren't actively developing the older versions.
>
> Thanks in advance for the additional information and trying mpirun_rsh.
>
> On Fri, Apr 8, 2016 at 10:44 AM Zhiwei Liu <z.liu at usciences.edu<mailto:
> z.liu at usciences.edu>> wrote:
> Dear all,
>
> I have a rather strange problem that I needed some help with. Here are
> some details:
>
> system: ubuntu 14.04 on both master and computing nodes, infiniband
> interconnect
> mvapich2 (2.2b 2.1 and 1.9b) installed
> Torque resource management 6.0.1 installed (also tried with an older
> version).
>
> The problem is mpirun (mpiexec_hydra) works fine for any parallel programs
> (either a simple testing a.out program or Amber 14 simulation program) if
> it were run interactively.
>
> However, if I submit a mpirun job to the PBS queue (torque 6.0.1). It
> works only for single node (can be more than one processors within one
> node). If no. of nodes requested is more than one, the job dies.
>
> Attached is a log file. I don’t see any problem, it looks like the
> appropriate no. of nodes and processors have been allocated, some mpi
> actitivities are going on, then ….dies. (there are some messages about we
> don’t understand ….. but they also appears when I run the job
> interactively.)
>
> The problem persists with all three versions of mvapich2 installed and two
> versions of torque.
>
> Any help would be very much appreciated.
>
> Zhiwei Liu
> at the university of the sciences in philadelphia
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu<mailto:
> mvapich-discuss at cse.ohio-state.edu>
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160408/92afa76d/attachment.html>


More information about the mvapich-discuss mailing list