[mvapich-discuss] Getting Started Help

Galloway, Michael D. gallowaymd at ornl.gov
Tue Jun 7 18:36:14 EDT 2016


Hari,

I am getting this now:

[mgx at mod-condo-login02 output]$ cat mod_mv2_hello.6424.mod-condo-pbs01.err
[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(514)..........:
MPID_Init(365).................: channel initialization failed
MPIDI_CH3_Init(505)............:
MPIDI_CH3I_SHMEM_Helper_fn(926): write: Success

My script is:

[mgx at mod-condo-login02 mv2]$ more mod_mv2_hello.pbs
#PBS -N mod_mv2_hello
#PBS -l nodes=1:ppn=32
#PBS -l walltime=24:00:00
#PBS -V
#PBS -q batch
##PBS -o ctest2
##PBS -j oe
#PBS -o /home/mgx/output/$PBS_JOBNAME.$PBS_JOBID.out
#PBS -e /home/mgx/output/$PBS_JOBNAME.$PBS_JOBID.err
#sleep 60
#hostname
mpiexec  -n 32  -hostfile $PBS_NODEFILE -env MV2_USE_SHMEM_COLL 0  /home/mgx/testing/mv2/hellow
#sleep 60

and my mpichversion:

[mgx at mod-condo-login02 mv2]$ mpichversion
MVAPICH2 Version:           2.2rc1
MVAPICH2 Release date: Tue Mar 29 22:00:00 EST 2016
MVAPICH2 Device:            ch3:mrail
MVAPICH2 configure:       --prefix=/software/tools/apps/mvapich/gnu/2.2rc1 --with-hwloc --with-pbs=/opt/torque --with-device=ch3:mrail --with-rdma=gen2
MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: gfortran -L/lib -L/lib   -O2
MVAPICH2 FC:     gfortran   -O2


From: <hari.subramoni at gmail.com> on behalf of Hari Subramoni <subramoni.1 at osu.edu>
Date: Tuesday, June 7, 2016 at 5:40 PM
To: Michael Galloway <gallowaymd at ornl.gov>
Cc: "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at cse.ohio-state.edu>
Subject: Re: [mvapich-discuss] Getting Started Help

Hello Michael,

Good to know that MVAPICH2-2.2rc1 works fine for you out of the box. May I assume that you are not interested in debugging the issue you were seeing with 2.1?

mpirun_rsh will give you very good startup performance. However, if you would like to use PBS, the following section of the userguide has more information on how you can configure MVAPICH2 to run with PBS.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-userguide.html#x1-360005.2.4

Please let me know if you face any issues with this.

Regards,
Hari.

On Tue, Jun 7, 2016 at 2:35 PM, Galloway, Michael D. <gallowaymd at ornl.gov<mailto:gallowaymd at ornl.gov>> wrote:
Hari, thanks!

If I use MV2_USE_SHMEM_COLL=0 2.1 does indeed run.

mgx at mod-condo-login02 mv2]$ mpirun_rsh -np 2 mod-condo-c01 mod-condo-c02 ./hellow
Hello world from process 0 of 2
Hello world from process 1 of 2

I built 2.2rc1 but there is now mpirrun_rsh:

[mgx at mod-condo-login02 mv2]$ ls -l /software/tools/apps/mvapich/gnu/2.2rc1/bin/
total 10176
-rwxr-xr-x 1 root root 1403306 Jun  7 14:00 hydra_nameserver
-rwxr-xr-x 1 root root 1400230 Jun  7 14:00 hydra_persist
-rwxr-xr-x 1 root root 1652880 Jun  7 14:00 hydra_pmi_proxy
lrwxrwxrwx 1 root root       6 Jun  7 14:01 mpic++ -> mpicxx
-rwxr-xr-x 1 root root   10201 Jun  7 14:01 mpicc
-rwxr-xr-x 1 root root   13231 Jun  7 14:01 mpichversion
-rwxr-xr-x 1 root root    9762 Jun  7 14:01 mpicxx
lrwxrwxrwx 1 root root      13 Jun  7 14:00 mpiexec -> mpiexec.hydra
-rwxr-xr-x 1 root root 1918904 Jun  7 14:00 mpiexec.hydra
lrwxrwxrwx 1 root root       7 Jun  7 14:01 mpif77 -> mpifort
lrwxrwxrwx 1 root root       7 Jun  7 14:01 mpif90 -> mpifort
-rwxr-xr-x 1 root root   13516 Jun  7 14:01 mpifort
-rwxr-xr-x 1 root root   13191 Jun  7 14:01 mpiname
lrwxrwxrwx 1 root root      13 Jun  7 14:00 mpirun -> mpiexec.hydra
-rwxr-xr-x 1 root root 3956771 Jun  7 14:01 mpivars
-rwxr-xr-x 1 root root    3426 Jun  7 14:01 parkill




From: <hari.subramoni at gmail.com<mailto:hari.subramoni at gmail.com>> on behalf of Hari Subramoni <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>
Date: Tuesday, June 7, 2016 at 12:35 PM
To: Michael Galloway <gallowaymd at ornl.gov<mailto:gallowaymd at ornl.gov>>
Cc: "mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>" <mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>>
Subject: Re: [mvapich-discuss] Getting Started Help

Hello Michael,

Are you running on an OpenPower system by any chance? If so, I would like to note that we introduced support for it in our latest release (please refer to point #3 below).

As a workaround, can you please try running after setting MV2_USE_SHMEM_COLL=0 and see if things pass?

There are a few things I would like to note. I would highly recommend you follow these.

1. We have a quick start guide available at the following location that lets you know how to get up and running quickly.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-quickstart.html

2. You seem to be using the nemesis interface (--with-device=ch3:nemesis:ib). We recommend using the OFA-IB-CH3 interface for best performance and latest functionality. Please refer to the following section of the userguide for more details on how to build MVAPICH2 for the OFA-IB-CH3 interface

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2rc1-userguide.html#x1-120004.4

3. You seem to be using an older version of MVAPICH2. Given that you are starting out, I would recommend using the latest version - MVAPICH2-2.2rc1 so that you get the latest performance and feature enhancements. You can get the source tarball from the following site

http://mvapich.cse.ohio-state.edu/downloads/

Regards,
Hari.

On Tue, Jun 7, 2016 at 9:05 AM, Galloway, Michael D. <gallowaymd at ornl.gov<mailto:gallowaymd at ornl.gov>> wrote:
Alright, I will confess to being a n00b with mpich/mvapich2, I’m trying to understand how to build and run apps on our clusters. My build is this:

[mgx at mod-condo-login01 mv2]$ mpichversion
MVAPICH2 Version:           2.1
MVAPICH2 Release date: Fri Apr 03 20:00:00 EDT 2015
MVAPICH2 Device:            ch3:nemesis
MVAPICH2 configure:       --with-device=ch3:nemesis:ib --with-pbs=/opt/torque --enable-hwlock --prefix=/software/tools/apps/mvapich2/gcc4/2.1
MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: gfortran   -O2
MVAPICH2 FC:     gfortran   -O2

[mgx at mod-condo-login01 mv2]$ mpicc -v
mpicc for MVAPICH2 version 2.1
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)


Our cluster is IB fabric like:

[mgx at mod-condo-login01 mv2]$ ibv_devinfo
hca_id:  mlx4_0
                transport:                                             InfiniBand (0)
                fw_ver:                                                  2.34.5000
                node_guid:                                           e41d:2d03:007b:eff0
                sys_image_guid:                                 e41d:2d03:007b:eff3
                vendor_id:                                            0x02c9
                vendor_part_id:                                  4099
                hw_ver:                                                 0x0
                board_id:                                              MT_1090120019
                phys_port_cnt:                                    2
                                port:       1
                                                state:                                     PORT_ACTIVE (4)
                                                max_mtu:                             4096 (5)
                                                active_mtu:                         4096 (5)
                                                sm_lid:                                   1
                                                port_lid:                                170
                                                port_lmc:                              0x00
                                                link_layer:                             InfiniBand

                                port:       2
                                                state:                                     PORT_ACTIVE (4)
                                                max_mtu:                             4096 (5)
                                                active_mtu:                         4096 (5)
                                                sm_lid:                                   0
                                                port_lid:                                0
                                                port_lmc:                              0x00
                                                link_layer:                             Ethernet

I build the simple hellow.c code thus:

[mgx at mod-condo-login01 mv2]$ mpicc hellow.c -o hellow
[mgx at mod-condo-login01 mv2]$ ldd hellow
                linux-vdso.so.1 =>  (0x00007ffee85e7000)
                libmpi.so.12 => /software/tools/apps/mvapich2/gcc4/2.1/lib/libmpi.so.12 (0x00002b23cb5b7000)
                libc.so.6 => /lib64/libc.so.6 (0x00002b23cbb0b000)
                librt.so.1 => /lib64/librt.so.1 (0x00002b23cbecc000)
                libnuma.so.1 => /lib64/libnuma.so.1 (0x00002b23cc0d4000)
                libxml2.so.2 => /lib64/libxml2.so.2 (0x00002b23cc2e0000)
                libdl.so.2 => /lib64/libdl.so.2 (0x00002b23cc649000)
                libibumad.so.3 => /lib64/libibumad.so.3 (0x00002b23cc84d000)
                libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00002b23cca56000)
                libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00002b23ccc68000)
                libm.so.6 => /lib64/libm.so.6 (0x00002b23ccf8a000)
                libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b23cd28c000)
                libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b23cd4a8000)
                libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00002b23cd6be000)
                /lib64/ld-linux-x86-64.so.2 (0x00002b23cb393000)
                libz.so.1 => /lib64/libz.so.1 (0x00002b23cd8fa000)
                liblzma.so.5 => /lib64/liblzma.so.5 (0x00002b23cdb10000)
                libnl-route-3.so.200 => /lib64/libnl-route-3.so.200 (0x00002b23cdd35000)
                libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00002b23cdf84000)

and a simple run errors like this:

[mgx at mod-condo-login01 mv2]$  mpirun_rsh -np 1 mod-condo-c01 /home/mgx/testing/mv2/hellow
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(514)..........:
MPID_Init(359).................: channel initialization failed
MPIDI_CH3_Init(131)............:
MPIDI_CH3I_SHMEM_COLL_Init(932): write: Success
[mod-condo-c01.ornl.gov:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[mod-condo-c01.ornl.gov:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[mod-condo-c01.ornl.gov:mpispawn_0][child_handler] MPI process (rank: 0, pid: 106241) exited with status 1
[mgx at mod-condo-login01 mv2]$ [mod-condo-c01.ornl.gov:mpispawn_0][report_error] connect() failed: Connection refused (111)

I know I must be doing some simple mistakes, I am used to working with openmpi. Thanks!

--- Michael


_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20160607/0696b11e/attachment-0001.html>


More information about the mvapich-discuss mailing list