[mvapich-discuss] WRF 3.4.1 hangs at mpi_finalize with
mvapich2-1.9rc1
Parker Norton
parker.norton at gmail.com
Thu May 9 13:18:55 EDT 2013
Hello,
I have been successfully using mvapich2-1.5.1 with the Weather Research and
Forecast Model (WRF) version 3.4.1. Recently our cluster OS was updated
which necessitated re-compiling the WRF model and required libraries. I
chose to use mvapich2-1.9rc1 for the parallel library. I successfully used
the Intel compilers (version 11.1) to compile the software.
However when I run the WRF model I get the following behavior. For long
runs (73 day simulations) the logfile indicates the model completed
successfully but then just hangs, never terminating execution. When I turn
on additional debugging output for the model it appears to be hanging on
the MPI_Finalize call.
When I perform a 1-day run of the same model the model successfully
completes and terminates execution but I get the following additional
messages in my log output:
leaked context IDs detected: mask=0x2b5524fd3260 mask[0]=0x1ffffff
In direct memory block for handle type GROUP, 2 handles are still
allocated
In direct memory block for handle type ATTR, 2 handles are still
allocated
In direct memory block for handle type KEYVAL, 1 handles are still
allocated
In direct memory block for handle type COMM, 7 handles are still
allocated
I found a discussion at
http://www.nacad.ufrj.br/online/sgi/007-3773-018/sgi_html/ch10.html#Z1175712035tlsthat
indicated the problem with MPI_Finalize hanging was usually related to
unmatched or uncomplete send/recv requests. During my searches I was not
able to find any discussions where others were experiencing problems with
WRF related to this.
I also tried compiling mvapich2-1.7a2 to see if that version would work
correctly with WRF but it exhibits the same behavior.
I was able to get a binary of the mvapich2-1.5.1 library that I had been
using on the old system onto the new system and got it to work. When I use
this rather dated version of the mvapich2 library the WRF model runs
without any problems or additional error/warning messages.
At this point I am able to run my WRF model with the older version of
mvapich2 but I would like to be able to take advantage of the improvements
and bug fixes in the newer versions.
The system I am on uses Infiniband to connect the nodes. The configure
line I used is:
./configure --prefix=/usr/local/mvapich2-1.9rc1-intell11
--enable-shared --enable-g=all --enable-error-messages=all F77="ifort"
FC="ifort" CC="icc" CXX="icpc"
Results from mpichversion:
MVAPICH2 Version: 1.9rc1
MVAPICH2 Release date: Tue Apr 16 12:35:17 EDT 2013
MVAPICH2 Device: ch3:mrail
MVAPICH2 configure: --prefix=/usr/local/mvapich2-1.9rc1-intel11
--enable-shared --enable-g=all --enable-error-messages=all
MVAPICH2 CC: icc -g -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: icpc -g -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: gfortran -L/lib -L/lib -g -O2
MVAPICH2 FC: ifort -g -O2
Any help or insights that could be offered in figuring this out would be
appreciated. Please let me know if you have further questions.
Parker Norton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130509/606d11b4/attachment.html
More information about the mvapich-discuss
mailing list