[mvapich-discuss] MPI communication problem with mvapich2-1.8a1p1

Jonathan Perkins perkinjo at cse.ohio-state.edu
Thu Feb 9 13:18:34 EST 2012


Hi all, this problem has been resolved and the fix to the initial
problem is available in the nightly 1.7 branch.  The build failure
found with the different configuration options when using
mcmodel=medium has also been solved and will be available in the
nightly branch in the next day or so.

On Fri, Jan 27, 2012 at 7:02 PM, Nirmal Seenu <nirmal at fnal.gov> wrote:
> I couldn't launch the MPI process with the version that was build with
> --disable-fast --enable-g=dbg options and it fails with the following error
> message:
>
> [nirmal at cci001 ~]$ export
> PATH=/usr/local/mvapich2-1.8a1p1-gcc-test/bin:$PATH
> [nirmal at cci001 mvapich2-1.8a1p1-gcc-test]$ mpiexec ./IMB-MPI1
> Assertion failed in file mpid_vc.c at line 840: *max_id_p >= 0
> [cli_0]: aborting job:
> internal ABORT - process 0
> Assertion failed in file mpid_vc.c at line 840: *max_id_p >= 0
> [cli_4]: aborting job:
> internal ABORT - process 0
>
>
> In the other version that was build with --enable-fast consistently hangs
> while running IMB after it completes ping pong and sendrecv on 2 and 4
> processes successfully:
>
> [nirmal at cci001 ~]$ export PATH=/usr/local/mvapich2-1.8a1p1-gcc/bin:$PATH
> [nirmal at cci001 ~]$ cd run-imb/mvapich2-1.8a1p1-gcc
> [nirmal at cci001 mvapich2-1.8a1p1-gcc]$ mpiexec ./IMB-MPI1
> #---------------------------------------------------
> #    Intel (R) MPI Benchmark Suite V3.2.3, MPI-1 part
> #---------------------------------------------------
> # Date                  : Fri Jan 27 17:49:15 2012
> # Machine               : x86_64
> # System                : Linux
> # Release               : 2.6.18-274.17.1.el5
> # Version               : #1 SMP Tue Jan 10 16:13:44 EST 2012
> # MPI Version           : 2.2
> # MPI Thread Environment:
> ...
> ...
> ...
> #-----------------------------------------------------------------------------
> # Benchmarking Sendrecv
> # #processes = 4
> # ( 60 additional processes waiting in MPI_Barrier)
> #-----------------------------------------------------------------------------
>       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec] Mbytes/sec
> ...
> ...
>      2097152           20      3723.20      3726.20      3724.86 1073.48
>      4194304           10      7400.30      7424.88      7414.66 1077.46
>
>
> Nirmal
>
>
> On 1/27/2012 5:11 PM, Jonathan Perkins wrote:
>>
>> I'm not sure if this is related to some interaction with
>> -mcmodel=medium or not.  This happens with both sets of options?  I'll
>> try to reproduce this build failure but can you still send a trace of
>> the processes when they are hanging?
>>
>> Use your build options but replacing ``--enable-fast with
>> --disable-fast --enable-g=dbg''.
>>
>> On Fri, Jan 27, 2012 at 6:01 PM, Nirmal Seenu<nirmal at fnal.gov>  wrote:
>>>
>>> I am getting the following error during make with the options that you
>>> mentioned:
>>>
>>> make[4]: Entering directory
>>> `/usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc'
>>> Making all in src
>>> make[5]: Entering directory
>>> `/usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/src'
>>>  CC     topology.lo
>>>  CC     traversal.lo
>>>  CC     distances.lo
>>>  CC     topology-synthetic.lo
>>>  CC     bind.lo
>>>  CC     cpuset.lo
>>>  CC     misc.lo
>>>  CC     topology-xml.lo
>>>  CC     topology-linux.lo
>>>  CC     topology-x86.lo
>>> topology-x86.c: In function 'look_proc':
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>>
>>> /usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/include/private/cpuid.h:54:
>>> error: can't find a register in class 'BREG' while reloading 'asm'
>>> make[5]: *** [topology-x86.lo] Error 1
>>> make[5]: Leaving directory
>>> `/usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc/src'
>>> make[4]: *** [all-recursive] Error 1
>>> make[4]: Leaving directory
>>> `/usr/local/src/mvapich2-1.8a1p1/src/pm/hydra/tools/topo/hwloc/hwloc'
>>> make[3]: *** [all-recursive] Error 1
>>> make[3]: Leaving directory `/usr/local/src/mvapich2-1.8a1p1/src/pm/hydra'
>>> make[2]: *** [all-redirect] Error 1
>>> make[2]: Leaving directory `/usr/local/src/mvapich2-1.8a1p1/src/pm'
>>> make[1]: *** [all-redirect] Error 2
>>> make[1]: Leaving directory `/usr/local/src/mvapich2-1.8a1p1/src'
>>> make: *** [all-redirect] Error 2
>>>
>>>
>>> I am able to build with the options that I mentioned in my previous email
>>> though.
>>>
>>> Nirmal
>>>
>>>
>>> On 01/27/2012 04:38 PM, Jonathan Perkins wrote:
>>>>
>>>>
>>>> Please try the following...
>>>> ./configure --prefix=/usr/local/mvapich2-1.8a1p1-gcc --enable-fast
>>>> --enable-f77 --enable-fc --enable-cxx --enable-romio --enable-mpe
>>>>
>>>> If you would like to try and provide stack traces to us use...
>>>> ./configure --prefix=/usr/local/mvapich2-1.8a1p1-gcc --disable-fast
>>>> --enable-g=dbg --enable-f77 --enable-fc --enable-cxx --enable-romio
>>>> --enable-mpe
>>>>
>>>> On Fri, Jan 27, 2012 at 5:31 PM, Nirmal Seenu<nirmal at fnal.gov>    wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I doubt that the options used to build MVAPICH2 is the problem here as
>>>>> the
>>>>> remote MPI process launch successfully and they do a little bit of
>>>>> communication before they hang.
>>>>>
>>>>> I use the same options to build the version mvapich2-1.2p1,
>>>>> mvapich2-1.5,
>>>>> mvapich2-1.6rc2 and mvapich2-1.6-r4751 and they all work fine.
>>>>>
>>>>> What options do I need on MVAPICH2 build to use mpiexec launcher to use
>>>>> TM
>>>>> interface to launch MPI jobs?
>>>>>
>>>>> Nirmal
>>>>>
>>>>>
>>>>> On 01/27/2012 03:53 PM, Jonathan Perkins wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hello Nirmal, sorry to hear that you're having trouble.  Let me
>>>>>> suggest that you remove some of the options that you've specified at
>>>>>> the configure step.  We no longer support MPD so you should remove the
>>>>>> --enable-pmiport and --with-pm=mpd options.  I actually think it'll be
>>>>>> simpler for you to remove more options and then only add an option if
>>>>>> you need it and things are working.
>>>>>>
>>>>>> Please try the following configuration for MVAPICH2 and let us know if
>>>>>> you still have trouble or not.
>>>>>> ./configure --prefix=/usr/local/mvapich2-1.8a1p1-gcc --enable-fast
>>>>>> --enable-f77 --enable-fc --enable-cxx --enable-romio --enable-mpe
>>>>>>
>>>>>> On Fri, Jan 27, 2012 at 3:57 PM, Nirmal Seenu<nirmal at fnal.gov>
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I am having trouble running the Intel MPI Benchmark(IMB_3.2.3 where I
>>>>>>> run
>>>>>>> IMB-MPI1 without any options) on the latest version of
>>>>>>> MVAPICH2-1.8a1p1.
>>>>>>>
>>>>>>> The MPI process gets launched properly on the worker nodes but the
>>>>>>> benchmark
>>>>>>> hangs within a few seconds after the launch and doesn't make any
>>>>>>> progress. I
>>>>>>> checked the infiniband fabric and everything is healthy. We mount
>>>>>>> Lustre
>>>>>>> over native IB on all the worker nodes and the lustre mounts are
>>>>>>> healthy
>>>>>>> as
>>>>>>> well.
>>>>>>>
>>>>>>> This reproducible on MVAPICH2 compiled with GCC and PGI compiler 11.7
>>>>>>> as
>>>>>>> well.
>>>>>>>
>>>>>>> Details about the installation:
>>>>>>>
>>>>>>> The worker nodes run RHEL 5.3 with the latest kernel
>>>>>>> 2.6.18-274.17.1.el5
>>>>>>> and
>>>>>>> we use the Infiniband drivers that are distributed as a part of the
>>>>>>> kernel.
>>>>>>>
>>>>>>> MVAPICH2 gcc version was compiled with the following compiler:
>>>>>>> gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50)
>>>>>>>
>>>>>>> The following were the options used to compile the MVAPICH2 and the
>>>>>>> MPIEXEC:
>>>>>>>
>>>>>>> export CC=gcc
>>>>>>> export CXX=g++
>>>>>>> export F77=gfortran
>>>>>>> export FC=gfortran
>>>>>>>
>>>>>>> export CFLAGS=-mcmodel=medium
>>>>>>> export CXXFLAGS=-mcmodel=medium
>>>>>>> export FFLAGS=-mcmodel=medium
>>>>>>> export FCFLAGS=-mcmodel=medium
>>>>>>> export LDFLAGS=-mcmodel=medium
>>>>>>>
>>>>>>> MVAPICH2:
>>>>>>> ./configure --prefix=/usr/local/mvapich2-1.8a1p1-gcc --enable-fast
>>>>>>> --enable-f77 --enable-fc --enable-cxx --enable-romio --enable-pmiport
>>>>>>> --enable-mpe --with-pm=mpd --with-pmi=simple --with-thread-package
>>>>>>> --with-hwloc
>>>>>>>
>>>>>>> MPIEXEC:
>>>>>>> ./configure --prefix=/usr/local/mvapich2-1.8a1p1-gcc
>>>>>>> --with-pbs=/usr/local/pbs
>>>>>>> --with-mpicc=/usr/local/mvapich2-1.8a1p1-gcc/bin/mpicc
>>>>>>> --with-mpicxx=/usr/local/mvapich2-1.8a1p1-gcc/bin/mpicxx
>>>>>>> --with-mpif77=/usr/local/mvapich2-1.8a1p1-gcc/bin/mpif77
>>>>>>> --with-mpif90=/usr/local/mvapich2-1.8a1p1-gcc/bin/mpif90
>>>>>>> --disable-mpich-gm
>>>>>>> --disable-mpich-p4 --disable-mpich-rai --with-default-comm=pmi
>>>>>>>
>>>>>>> I was able to run the Intel MPI Benchmark using the following
>>>>>>> versions
>>>>>>> of
>>>>>>> MVAPICH2 that was compiled with the same version of gcc:
>>>>>>> mvapich2-1.2p1
>>>>>>> mvapich2-1.5
>>>>>>> mvapich2-1.6rc2
>>>>>>> mvapich2-1.6-r4751
>>>>>>>
>>>>>>> I will be more than happy to provide more details if needed. Thanks
>>>>>>> in
>>>>>>> advance for looking into this problem.
>>>>>>>
>>>>>>> Nirmal
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list