[mvapich-discuss] WRF 3.4.1 hangs at mpi_finalize with mvapich2-1.9rc1

Parker Norton parker.norton at gmail.com
Thu May 9 13:18:55 EDT 2013


Hello,

I have been successfully using mvapich2-1.5.1 with the Weather Research and
Forecast Model (WRF) version 3.4.1.  Recently our cluster OS was updated
which necessitated re-compiling the WRF model and required libraries.  I
chose to use mvapich2-1.9rc1 for the parallel library.  I successfully used
the Intel compilers (version 11.1) to compile the software.

However when I run the WRF model I get the following behavior.  For long
runs (73 day simulations) the logfile indicates the model completed
successfully but then just hangs, never terminating execution.  When I turn
on additional debugging output for the model it appears to be hanging on
the MPI_Finalize call.

When I perform a 1-day run of the same model the model successfully
completes and terminates execution but I get the following additional
messages in my log output:
     leaked context IDs detected: mask=0x2b5524fd3260 mask[0]=0x1ffffff
     In direct memory block for handle type GROUP, 2 handles are still
allocated
     In direct memory block for handle type ATTR, 2 handles are still
allocated
     In direct memory block for handle type KEYVAL, 1 handles are still
allocated
     In direct memory block for handle type COMM, 7 handles are still
allocated

I found a discussion at
http://www.nacad.ufrj.br/online/sgi/007-3773-018/sgi_html/ch10.html#Z1175712035tlsthat
indicated the problem with MPI_Finalize hanging was usually related to
unmatched or uncomplete send/recv requests. During my searches I was not
able to find any discussions where others were experiencing problems with
WRF related to this.

I also tried compiling mvapich2-1.7a2 to see if that version would work
correctly with WRF but it exhibits the same behavior.

I was able to get a binary of the mvapich2-1.5.1 library that I had been
using on the old system onto the new system and got it to work.  When I use
this rather dated version of the mvapich2 library the WRF model runs
without any problems or additional error/warning messages.

At this point I am able to run my WRF model with the older version of
mvapich2 but I would like to be able to take advantage of the improvements
and bug fixes in the newer versions.

The system I am on uses Infiniband to connect the nodes.  The configure
line I used is:

    ./configure --prefix=/usr/local/mvapich2-1.9rc1-intell11
--enable-shared --enable-g=all --enable-error-messages=all F77="ifort"
FC="ifort" CC="icc" CXX="icpc"

Results from mpichversion:
    MVAPICH2 Version:         1.9rc1
    MVAPICH2 Release date:    Tue Apr 16 12:35:17 EDT 2013
    MVAPICH2 Device:          ch3:mrail
    MVAPICH2 configure:       --prefix=/usr/local/mvapich2-1.9rc1-intel11
--enable-shared --enable-g=all --enable-error-messages=all
    MVAPICH2 CC:      icc    -g -DNDEBUG -DNVALGRIND -O2
    MVAPICH2 CXX:     icpc   -g -DNDEBUG -DNVALGRIND -O2
    MVAPICH2 F77:     gfortran -L/lib -L/lib   -g -O2
    MVAPICH2 FC:      ifort   -g -O2


Any help or insights that could be offered in figuring this out would be
appreciated.  Please let me know if you have further questions.

Parker Norton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130509/606d11b4/attachment.html


More information about the mvapich-discuss mailing list