[mvapich-discuss] Getting Started Help

Hari Subramoni subramoni.1 at osu.edu
Tue Jun 28 07:59:31 EDT 2016


Hi All,

This issue has been resolved through discussions off the list. The fixes
will be available with the next release of MVAPICH2.

Thx,
Hari.

On Tue, Jun 7, 2016 at 6:36 PM, Galloway, Michael D. <gallowaymd at ornl.gov>
wrote:

> Hari,
>
>
>
> I am getting this now:
>
>
>
> [mgx at mod-condo-login02 output]$ cat mod_mv2_hello.6424.mod-condo-pbs01.err
>
> [cli_0]: aborting job:
>
> Fatal error in MPI_Init:
>
> Other MPI error, error stack:
>
> MPIR_Init_thread(514)..........:
>
> MPID_Init(365).................: channel initialization failed
>
> MPIDI_CH3_Init(505)............:
>
> MPIDI_CH3I_SHMEM_Helper_fn(926): write: Success
>
>
>
> My script is:
>
>
>
> [mgx at mod-condo-login02 mv2]$ more mod_mv2_hello.pbs
>
> #PBS -N mod_mv2_hello
>
> #PBS -l nodes=1:ppn=32
>
> #PBS -l walltime=24:00:00
>
> #PBS -V
>
> #PBS -q batch
>
> ##PBS -o ctest2
>
> ##PBS -j oe
>
> #PBS -o /home/mgx/output/$PBS_JOBNAME.$PBS_JOBID.out
>
> #PBS -e /home/mgx/output/$PBS_JOBNAME.$PBS_JOBID.err
>
> #sleep 60
>
> #hostname
>
> mpiexec  -n 32  -hostfile $PBS_NODEFILE -env MV2_USE_SHMEM_COLL 0
> /home/mgx/testing/mv2/hellow
>
> #sleep 60
>
>
>
> and my mpichversion:
>
>
>
> [mgx at mod-condo-login02 mv2]$ mpichversion
>
> MVAPICH2 Version:           2.2rc1
>
> MVAPICH2 Release date: Tue Mar 29 22:00:00 EST 2016
>
> MVAPICH2 Device:            ch3:mrail
>
> MVAPICH2 configure:       --prefix=/software/tools/apps/mvapich/gnu/2.2rc1
> --with-hwloc --with-pbs=/opt/torque --with-device=ch3:mrail --with-rdma=gen2
>
> MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2
>
> MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
>
> MVAPICH2 F77: gfortran -L/lib -L/lib   -O2
>
> MVAPICH2 FC:     gfortran   -O2
>
>
>
>
>
> *From: *<hari.subramoni at gmail.com> on behalf of Hari Subramoni <
> subramoni.1 at osu.edu>
> *Date: *Tuesday, June 7, 2016 at 5:40 PM
>
> *To: *Michael Galloway <gallowaymd at ornl.gov>
> *Cc: *"mvapich-discuss at cse.ohio-state.edu" <
> mvapich-discuss at cse.ohio-state.edu>
> *Subject: *Re: [mvapich-discuss] Getting Started Help
>
>
>
> Hello Michael,
>
>
>
> Good to know that MVAPICH2-2.2rc1 works fine for you out of the box. May I
> assume that you are not interested in debugging the issue you were seeing
> with 2.1?
>
>
>
> mpirun_rsh will give you very good startup performance. However, if you
> would like to use PBS, the following section of the userguide has more
> information on how you can configure MVAPICH2 to run with PBS.
>
>
>
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-userguide.html#x1-360005.2.4
>
>
>
> Please let me know if you face any issues with this.
>
>
>
> Regards,
>
> Hari.
>
>
>
> On Tue, Jun 7, 2016 at 2:35 PM, Galloway, Michael D. <gallowaymd at ornl.gov>
> wrote:
>
> Hari, thanks!
>
>
>
> If I use MV2_USE_SHMEM_COLL=0 2.1 does indeed run.
>
>
>
> mgx at mod-condo-login02 mv2]$ mpirun_rsh -np 2 mod-condo-c01 mod-condo-c02
> ./hellow
>
> Hello world from process 0 of 2
>
> Hello world from process 1 of 2
>
>
>
> I built 2.2rc1 but there is now mpirrun_rsh:
>
>
>
> [mgx at mod-condo-login02 mv2]$ ls -l
> /software/tools/apps/mvapich/gnu/2.2rc1/bin/
>
> total 10176
>
> -rwxr-xr-x 1 root root 1403306 Jun  7 14:00 hydra_nameserver
>
> -rwxr-xr-x 1 root root 1400230 Jun  7 14:00 hydra_persist
>
> -rwxr-xr-x 1 root root 1652880 Jun  7 14:00 hydra_pmi_proxy
>
> lrwxrwxrwx 1 root root       6 Jun  7 14:01 mpic++ -> mpicxx
>
> -rwxr-xr-x 1 root root   10201 Jun  7 14:01 mpicc
>
> -rwxr-xr-x 1 root root   13231 Jun  7 14:01 mpichversion
>
> -rwxr-xr-x 1 root root    9762 Jun  7 14:01 mpicxx
>
> lrwxrwxrwx 1 root root      13 Jun  7 14:00 mpiexec -> mpiexec.hydra
>
> -rwxr-xr-x 1 root root 1918904 Jun  7 14:00 mpiexec.hydra
>
> lrwxrwxrwx 1 root root       7 Jun  7 14:01 mpif77 -> mpifort
>
> lrwxrwxrwx 1 root root       7 Jun  7 14:01 mpif90 -> mpifort
>
> -rwxr-xr-x 1 root root   13516 Jun  7 14:01 mpifort
>
> -rwxr-xr-x 1 root root   13191 Jun  7 14:01 mpiname
>
> lrwxrwxrwx 1 root root      13 Jun  7 14:00 mpirun -> mpiexec.hydra
>
> -rwxr-xr-x 1 root root 3956771 Jun  7 14:01 mpivars
>
> -rwxr-xr-x 1 root root    3426 Jun  7 14:01 parkill
>
>
>
>
>
>
>
>
>
> *From: *<hari.subramoni at gmail.com> on behalf of Hari Subramoni <
> subramoni.1 at osu.edu>
> *Date: *Tuesday, June 7, 2016 at 12:35 PM
> *To: *Michael Galloway <gallowaymd at ornl.gov>
> *Cc: *"mvapich-discuss at cse.ohio-state.edu" <
> mvapich-discuss at cse.ohio-state.edu>
> *Subject: *Re: [mvapich-discuss] Getting Started Help
>
>
>
> Hello Michael,
>
>
>
> Are you running on an OpenPower system by any chance? If so, I would like
> to note that we introduced support for it in our latest release (please
> refer to point #3 below).
>
>
>
> As a workaround, can you please try running after
> setting MV2_USE_SHMEM_COLL=0 and see if things pass?
>
>
>
> There are a few things I would like to note. I would highly recommend you
> follow these.
>
>
>
> 1. We have a quick start guide available at the following location that
> lets you know how to get up and running quickly.
>
>
>
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-quickstart.html
>
>
>
> 2. You seem to be using the nemesis interface
> (--with-device=ch3:nemesis:ib). We recommend using the OFA-IB-CH3 interface
> for best performance and latest functionality. Please refer to the
> following section of the userguide for more details on how to build
> MVAPICH2 for the OFA-IB-CH3 interface
>
>
>
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-userguide.html#x1-120004.4
>
>
>
> 3. You seem to be using an older version of MVAPICH2. Given that you are
> starting out, I would recommend using the latest version - MVAPICH2-2.2rc1
> so that you get the latest performance and feature enhancements. You can
> get the source tarball from the following site
>
>
>
> http://mvapich.cse.ohio-state.edu/downloads/
>
>
>
> Regards,
>
> Hari.
>
>
>
> On Tue, Jun 7, 2016 at 9:05 AM, Galloway, Michael D. <gallowaymd at ornl.gov>
> wrote:
>
> Alright, I will confess to being a n00b with mpich/mvapich2, I’m trying to
> understand how to build and run apps on our clusters. My build is this:
>
>
>
> [mgx at mod-condo-login01 mv2]$ mpichversion
>
> MVAPICH2 Version:           2.1
>
> MVAPICH2 Release date: Fri Apr 03 20:00:00 EDT 2015
>
> MVAPICH2 Device:            ch3:nemesis
>
> MVAPICH2 configure:       --with-device=ch3:nemesis:ib
> --with-pbs=/opt/torque --enable-hwlock
> --prefix=/software/tools/apps/mvapich2/gcc4/2.1
>
> MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2
>
> MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
>
> MVAPICH2 F77: gfortran   -O2
>
> MVAPICH2 FC:     gfortran   -O2
>
>
>
> [mgx at mod-condo-login01 mv2]$ mpicc -v
>
> mpicc for MVAPICH2 version 2.1
>
> Using built-in specs.
>
> COLLECT_GCC=gcc
>
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
>
> Target: x86_64-redhat-linux
>
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --with-bugurl=
> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
> --enable-threads=posix --enable-checking=release --with-system-zlib
> --enable-__cxa_atexit --disable-libunwind-exceptions
> --enable-gnu-unique-object --enable-linker-build-id
> --with-linker-hash-style=gnu
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto
> --enable-plugin --enable-initfini-array --disable-libgcj
> --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install
> --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install
> --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64
> --build=x86_64-redhat-linux
>
> Thread model: posix
>
> gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
>
>
>
>
>
> Our cluster is IB fabric like:
>
>
>
> [mgx at mod-condo-login01 mv2]$ ibv_devinfo
>
> hca_id:  mlx4_0
>
>                 transport:
> InfiniBand (0)
>
>                 fw_ver:
> 2.34.5000
>
>                 node_guid:
> e41d:2d03:007b:eff0
>
>                 sys_image_guid:
> e41d:2d03:007b:eff3
>
>                 vendor_id:
> 0x02c9
>
>                 vendor_part_id:                                  4099
>
>                 hw_ver:                                                 0x0
>
>                 board_id:
> MT_1090120019
>
>                 phys_port_cnt:                                    2
>
>                                 port:       1
>
>
> state:                                     PORT_ACTIVE (4)
>
>
> max_mtu:                             4096 (5)
>
>
> active_mtu:                         4096 (5)
>
>
> sm_lid:                                   1
>
>
> port_lid:                                170
>
>
> port_lmc:                              0x00
>
>
> link_layer:                             InfiniBand
>
>
>
>                                 port:       2
>
>
> state:                                     PORT_ACTIVE (4)
>
>
> max_mtu:                             4096 (5)
>
>
> active_mtu:                         4096 (5)
>
>
> sm_lid:                                   0
>
>
> port_lid:                                0
>
>
> port_lmc:                              0x00
>
>
> link_layer:                             Ethernet
>
>
>
> I build the simple hellow.c code thus:
>
>
>
> [mgx at mod-condo-login01 mv2]$ mpicc hellow.c -o hellow
>
> [mgx at mod-condo-login01 mv2]$ ldd hellow
>
>                 linux-vdso.so.1 =>  (0x00007ffee85e7000)
>
>                 libmpi.so.12 =>
> /software/tools/apps/mvapich2/gcc4/2.1/lib/libmpi.so.12 (0x00002b23cb5b7000)
>
>                 libc.so.6 => /lib64/libc.so.6 (0x00002b23cbb0b000)
>
>                 librt.so.1 => /lib64/librt.so.1 (0x00002b23cbecc000)
>
>                 libnuma.so.1 => /lib64/libnuma.so.1 (0x00002b23cc0d4000)
>
>                 libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b23cc2e0000)
>
>                 libdl.so.2 => /lib64/libdl.so.2 (0x00002b23cc649000)
>
>                 libibumad.so.3 => /lib64/libibumad.so.3
> (0x00002b23cc84d000)
>
>                 libibverbs.so.1 => /lib64/libibverbs.so.1
> (0x00002b23cca56000)
>
>                 libgfortran.so.3 => /lib64/libgfortran.so.3
> (0x00002b23ccc68000)
>
>                 libm.so.6 => /lib64/libm.so.6 (0x00002b23ccf8a000)
>
>                 libpthread.so.0 => /lib64/libpthread.so.0
> (0x00002b23cd28c000)
>
>                 libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b23cd4a8000)
>
>                 libquadmath.so.0 => /lib64/libquadmath.so.0
> (0x00002b23cd6be000)
>
>                 /lib64/ld-linux-x86-64.so.2 (0x00002b23cb393000)
>
>                 libz.so.1 => /lib64/libz.so.1 (0x00002b23cd8fa000)
>
>                 liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b23cdb10000)
>
>                 libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
> (0x00002b23cdd35000)
>
>                 libnl-3.so.200 => /lib64/libnl-3.so.200
> (0x00002b23cdf84000)
>
>
>
> and a simple run errors like this:
>
>
>
> [mgx at mod-condo-login01 mv2]$  mpirun_rsh -np 1 mod-condo-c01
> /home/mgx/testing/mv2/hellow
>
> Fatal error in MPI_Init: Other MPI error, error stack:
>
> MPIR_Init_thread(514)..........:
>
> MPID_Init(359).................: channel initialization failed
>
> MPIDI_CH3_Init(131)............:
>
> MPIDI_CH3I_SHMEM_COLL_Init(932): write: Success
>
> [mod-condo-c01.ornl.gov:mpispawn_0][readline] Unexpected End-Of-File on
> file descriptor 5. MPI process died?
>
> [mod-condo-c01.ornl.gov:mpispawn_0][mtpmi_processops] Error while reading
> PMI socket. MPI process died?
>
> [mod-condo-c01.ornl.gov:mpispawn_0][child_handler] MPI process (rank: 0,
> pid: 106241) exited with status 1
>
> [mgx at mod-condo-login01 mv2]$ [mod-condo-c01.ornl.gov:mpispawn_0][report_error]
> connect() failed: Connection refused (111)
>
>
>
> I know I must be doing some simple mistakes, I am used to working with
> openmpi. Thanks!
>
>
>
> --- Michael
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160628/a29cb075/attachment-0001.html>


More information about the mvapich-discuss mailing list