[mvapich-discuss] MVAPICH2 version 1.8 hangs on MPI_Finalize when using nemesis

Carson Holt Carson.Holt at oicr.on.ca
Thu Aug 16 18:05:47 EDT 2012


For ch3:mrail, I was able to find what was creating the 64 cpu threshold.
By setting MV2_ON_DEMAND_THRESHOLD to a higher number (default is 64).

Now I'll try the patch for nemesis, and let you know if that solves the
MPI_Finalize hang issue I see there.

Thanks,
Carson





On 12-08-16 5:59 PM, "Devendar Bureddy" <bureddy at cse.ohio-state.edu> wrote:

>Hi Carson
>
>MVAPICH2 internally uses it's own memory module called "ptmalloc".  It
>seems there is some weird interaction with shared libraries when Perl is
>being used with MVAPICH2. You should not see this behavior with C
>application.
>
>You can work around issues with Perl in the following way.
>
>1. With Nemesis:IB :
>
>This hang is surfaced because of the ptmalloc initialization failure. The
>attached patch should fix this. Please follow the instructions indicated
>below to apply the patch:
>
>[$ cd mvapich2-1.8
>$ patch -p1 < diff.nemesis
>$ make && make install
>
>
>2. With  -with-device=ch3:mrail --with-rdma=gen2 :
>
>adding   "--disable-registration-cache" to the configuration should
>solve failures with > 64 cores
>
>Please note that, with both of the above fixes, MVAPICH2 internally will
>run without "registration cache", a feature for performance optimization.
>Thus, you may see some performance degradation. We are looking into this
>issue further and will get back to you.
>
>Let us know whether you are able to run larger-scale jobs with the above
>fixes for nemesis:ib and -with-device=ch3:mrail --with-rdma=gen2
>
>Thanks
>Devendar
>
>On Thu, Aug 16, 2012 at 11:39 AM, Carson Holt <Carson.Holt at oicr.on.ca>
>wrote:
>> MVAPICH2 version 1.8 hangs on MPI_Finalize when using nemesis.  The
>>hanging
>> only happens when configured with device=ch3:nemesis:ib.  The changelog
>>for
>> version 1.8 says this bug is fixed, but it is still happening.
>>
>> I am using shared libraries to call MVAPICH2 from perl, but even if the
>>only
>> calls are MPI_init() and then immediately MPI_Finalize() it still hangs
>> (i.e. no Send or Recv calls).  I have attached a minimal test script I
>>use
>> for testing that consistently causes the error (run it from perl with
>>the
>> Inline::C module installed).
>>
>> Here are the results of mpich2version script showing the configuration
>>used
>> when installing MVAPICH2 -->
>> MVAPICH2 Version:     1.8
>> MVAPICH2 Release date: Mon Apr 30 14:56:40 EDT 2012
>> MVAPICH2 Device:       ch3:nemesis
>> MVAPICH2 configure:   --enable-romio --with-file-system=nfs+ufs
>> --enable-sharedlibs=gcc --with-ib-include=/usr/include
>> --with-ib-libpath=/usr/lib64 --enable-shared
>>--with-device=ch3:nemesis:ib
>> --enable-cxx
>> MVAPICH2 CC:   gcc -fPIC -D_GNU_SOURC -D_GNU_SOURCE   -DNDEBUG
>>-DNVALGRIND
>> -O2
>> MVAPICH2 CXX: g++ -fPIC  -DNDEBUG -DNVALGRIND -O2
>> MVAPICH2 F77: gfortran -fPIC  -O2
>> MVAPICH2 FC:   gfortran -fPIC  -O2
>>
>>
>> Also when MVAPICH2 is installed with --with-device=ch3:mrail
>> --with-rdma=gen2 (the default Linux settings), everything works
>>correctly
>> for my test script; however, I can only launch on up to exactly 64
>> processors (anything above that fails).  This is for both mpirun_rsh and
>> mpiexec.  Is this a known limitation of the default interface?  Using
>>the
>> nemesis configuration on the other hand, I don't hit a processor limit,
>>but
>> my application freezes on the MPI_Finalize call.
>>
>> Thanks,
>> Carson
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>
>
>-- 
>Devendar




More information about the mvapich-discuss mailing list