[mvapich-discuss] perhaps odd behavior..

Hari Subramoni subramoni.1 at osu.edu
Fri Jan 9 15:27:25 EST 2015


Hello Steve,

By default MVAPICH2 will identify all available HCAs and use the first port
on these HCAs for communication. However, it will only use one port unless
the user explicitly states that MVAPICH2 should use more than one port by
setting the "MV2_NUM_PORTS" environment variable.

>From your e-mail, I'm assuming that things are running fine "MV2_NUM_HCAS=2
MV2_NUM_PORTS=1" - is this correct?

At this point, I'm guessing that on the system where things are failing,
there is a bad HCA.

Could you please give us the output ibv_devinfo on the system wher things
are passing and on the system where things are failing? Also, could you
please configure MVAPICH2 in debug mode (--enable-g=dbg --enable-fast=none)
and run it with "MV2_SHOW_ENV_INFO=1 MV2_DEBUG_SHOW_BACKTRACE=1" and send
us the output?

Regards,
Hari.

On Fri, Jan 9, 2015 at 3:02 PM, Steve Heistand <steve.heistand at nasa.gov>
wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> so we have the latest mvapich build:
>
> MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail
>
> Compilation
> CC: icc -fpic -m64   -DNDEBUG -DNVALGRIND -O2
> CXX: icpc -fpic -m64  -DNDEBUG -DNVALGRIND -O2
> F77: ifort -L/lib -L/lib -m64 -fpic  -O2
> FC: ifort -m64 -fpic  -O2
>
> Configuration
> - --with-device=ch3:mrail --with-rdma=gen2 CC=icc CXX=icpc F77=ifort
> FC=ifort CFLAGS=-fpic
> - -m64 CXXFLAGS=-fpic -m64 FFLAGS=-m64 -fpic FCFLAGS=-m64 -fpic
> --enable-f77 --enable-fc
> - --enable-cxx --enable-romio --enable-threads=default --with-hwloc
> -disable-multi-aliases
> - -enable-xrc=no -enable-hybrid --prefix=XXX --with-file-system=lustre
>
> it was compiled on and for the most part run on machines that have 1 IB
> card with dual
> ports. This is all fine so far.
> However when we run on a system that has dual cards each with a single
> port the job dies
> at startup.
>
> If I tell it that the system is dual hca single port via environment
> variables it runs fine.
>
> Im at this point unsure if it actually uses both ports on either
> configuration.
>
> I would have thought it would have probed the hardware to figure out what
> set up
> it had when it tried to bond to the multiple ports.
>
> unless its actually crashing in the probe section of the mpi_init
> routines...
>
> thoughts?
>
> thanks
>
> s
>
>
> - --
> ************************************************************************
>  Steve Heistand                          NASA Ames Research Center
>  Email: steve.heistand at nasa.gov          Steve Heistand/Mail Stop 258-6
>  Work Phone: (650) 604-4369              Bldg. 258, Rm. 232-5
>  Scientific & HPC Application            P.O. Box 1
>  Development/Optimization                Moffett Field, CA 94035-0001
> ************************************************************************
>  "Any opinions expressed are those of our alien overlords, not my own."
>
> # For Remedy                        #
> #Action: Resolve                    #
> #Resolution: Resolved               #
> #Reason: No Further Action Required #
> #Tier1: User Code                   #
> #Tier2: Other                       #
> #Tier3: Assistance                  #
> #Notification: None                 #
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.14 (GNU/Linux)
>
> iEYEARECAAYFAlSwM78ACgkQoBCTJSAkVrGcowCfbb4olsTD75zUTpAUbU/RRXlI
> vPUAn06naxXOaR6ICj2YPSNoyIYKlqxy
> =fWil
> -----END PGP SIGNATURE-----
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150109/bae10ef9/attachment.html>


More information about the mvapich-discuss mailing list