[mvapich-discuss] WRF 3.4.1 hangs at mpi_finalize with mvapich2-1.9rc1

Devendar Bureddy bureddy at cse.ohio-state.edu
Thu May 9 14:45:53 EDT 2013


Hi Parker

We would like to reproduce hang issue(with 74-day simulation) to analyze it
further.  Can you please provide additional details like WRF configuration,
I/O configuration and run-time details( #processes, run-time parameters
..etc).

The additional messages with 1-day run are because of some objects(comm,
group..etc) are not freed in the application.  I think these messages
should not be a major concern here.


-Devendar




On Thu, May 9, 2013 at 1:18 PM, Parker Norton <parker.norton at gmail.com>wrote:

> Hello,
>
> I have been successfully using mvapich2-1.5.1 with the Weather Research
> and Forecast Model (WRF) version 3.4.1.  Recently our cluster OS was
> updated which necessitated re-compiling the WRF model and required
> libraries.  I chose to use mvapich2-1.9rc1 for the parallel library.  I
> successfully used the Intel compilers (version 11.1) to compile the
> software.
>
> However when I run the WRF model I get the following behavior.  For long
> runs (73 day simulations) the logfile indicates the model completed
> successfully but then just hangs, never terminating execution.  When I turn
> on additional debugging output for the model it appears to be hanging on
> the MPI_Finalize call.
>
> When I perform a 1-day run of the same model the model successfully
> completes and terminates execution but I get the following additional
> messages in my log output:
>      leaked context IDs detected: mask=0x2b5524fd3260 mask[0]=0x1ffffff
>      In direct memory block for handle type GROUP, 2 handles are still
> allocated
>      In direct memory block for handle type ATTR, 2 handles are still
> allocated
>      In direct memory block for handle type KEYVAL, 1 handles are still
> allocated
>      In direct memory block for handle type COMM, 7 handles are still
> allocated
>
> I found a discussion at
> http://www.nacad.ufrj.br/online/sgi/007-3773-018/sgi_html/ch10.html#Z1175712035tlsthat indicated the problem with MPI_Finalize hanging was usually related to
> unmatched or uncomplete send/recv requests. During my searches I was not
> able to find any discussions where others were experiencing problems with
> WRF related to this.
>
> I also tried compiling mvapich2-1.7a2 to see if that version would work
> correctly with WRF but it exhibits the same behavior.
>
> I was able to get a binary of the mvapich2-1.5.1 library that I had been
> using on the old system onto the new system and got it to work.  When I use
> this rather dated version of the mvapich2 library the WRF model runs
> without any problems or additional error/warning messages.
>
> At this point I am able to run my WRF model with the older version of
> mvapich2 but I would like to be able to take advantage of the improvements
> and bug fixes in the newer versions.
>
> The system I am on uses Infiniband to connect the nodes.  The configure
> line I used is:
>
>     ./configure --prefix=/usr/local/mvapich2-1.9rc1-intell11
> --enable-shared --enable-g=all --enable-error-messages=all F77="ifort"
> FC="ifort" CC="icc" CXX="icpc"
>
> Results from mpichversion:
>     MVAPICH2 Version:         1.9rc1
>     MVAPICH2 Release date:    Tue Apr 16 12:35:17 EDT 2013
>     MVAPICH2 Device:          ch3:mrail
>     MVAPICH2 configure:       --prefix=/usr/local/mvapich2-1.9rc1-intel11
> --enable-shared --enable-g=all --enable-error-messages=all
>     MVAPICH2 CC:      icc    -g -DNDEBUG -DNVALGRIND -O2
>     MVAPICH2 CXX:     icpc   -g -DNDEBUG -DNVALGRIND -O2
>     MVAPICH2 F77:     gfortran -L/lib -L/lib   -g -O2
>     MVAPICH2 FC:      ifort   -g -O2
>
>
> Any help or insights that could be offered in figuring this out would be
> appreciated.  Please let me know if you have further questions.
>
> Parker Norton
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


-- 
Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130509/0f06a3ce/attachment.html


More information about the mvapich-discuss mailing list