[mvapich-discuss] Assertion problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Mon Apr 6 09:18:43 EDT 2009


Benjamin:
Thanks for using mvapich2.  I'll discuss this issue with the other
developers and get back to you in order to either resolve or to pinpoint
the cause of this issue.

On Mon, Apr 06, 2009 at 02:31:37PM +0200, Benjamin Fersch wrote:
> Dear List Members,
> 
> 
> I'm running the WRF-ARW V3.0.1.1 weather model on our Inifiband HPC.
> The cluster is installed with OpenFabrics.
> 
> I usually submit my jobs on 10 to 12 nodes and every node has  4 Opteron
> CPU's.
> 
> The program was compiled with PORTLAND pgi64-7.2-5 mvapich2.
> 
> My problem is that the model doesn't run through, properly. After about
> 6 hours computation time the following error shows up and the program stops.
> 
> wrf.exe: ch3u_rndv.c:333: MPIDI_CH3_PktHandler_RndvClrToSend: Assertion
> `sreq->mrail.rndv_buf_off == 0' failed
> 
> This error came up only once:
> wrf.exe: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion
> `rndv->protocol == VAPI_PROTOCOL_R3' failed
> 
> It seems that increasing the number of nodes used, the crash occurs
> faster. So, would this be a memory problem, like an overflow?
> Running the program on a single cpu with mpirun doesn't result in a crash.
> 
>  I'm not very experienced in MPI details.
> 
> Can anybody help with this problem?
> 
> 
> Thank you!
> 
> 
> Benjamin
> 
> 
> output of ofed_info:
> 
> OFED-1.3.1
> libibverbs:
> git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3
> commit 40b771aa6a9c0ad092b2e20775b4723d3b173792
> libmthca:
> git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3
> commit 9501e698d257949acfab2edc90812602966dbcc9
> libmlx4:
> git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3
> commit 3869d6dab7e12fe452270ca641f7dd7082b42482
> libehca:
> git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3
> commit fd898180cfa3b737f893f432a80b91bac3396325
> libipathverbs:
> git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3
> commit 82be4d81859d1fd2edf830220fe65a9923b80a46
> libcxgb3:
> git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3
> commit 6f7485feb244d8571fcab2292ef92c97bea48df0
> libnes:
> git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3
> commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3
> libibcm:
> git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3
> commit 53ec35f544bbc1838bbadc2210909c25a954a5e2
> librdmacm:
> git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3
> commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1
> dapl1:
> git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3
> commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779
> dapl2:
> git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3
> commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177
> libsdp:
> git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3
> commit c8102dccc502930442b23de658674d386456b350
> sdpnetstat:
> git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3
> commit 3341620a7259c4f7bdd4180864b98e260c3dc223
> srptools:
> git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3
> commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d
> perftest:
> git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3
> commit 6321b5468f7293088cc003809049c02b176130d8
> qlvnictools:
> git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3
> commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd
> tvflash:
> git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3
> commit f5e7407a7f2058448df5e5320d9843f944427429
> mstflint:
> git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3
> commit 78bbd3d521a9078553a991111ffb6f76665b9ee9
> qperf:
> git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3
> commit 6221aabd038df0b7033e035378ca190641ed2295
> management:
> git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3
> commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf
> ibutils:
> git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3
> commit 7daf94fab6eaf307316326f3f49704e6080a1508
> ibsim:
> git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3
> commit 55113d9f919709c7c97ea41d29991941b9c8be70
> 
> ofa_kernel-1.3.1:
> Git:
> git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
> commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c
> 
> # MPI
> mvapich-1.0.1-2533.src.rpm
> mvapich2-1.0.3-1.src.rpm
> openmpi-1.2.6-1.src.rpm
> mpitests-3.0-773.src.rpm
> 
> 
> 
> -- 
> Dipl. Hydr. Benjamin Fersch
> 
> Institute for Meteorology and Climate Research (IMK-IFU)
> KIT Karlsruhe Institute of Technology (FZK)
> Kreuzeckbahnstraße 19
> 82467 Garmisch-Partenkirchen (Germany)
> 
> Phone: +49 8821 183-267
> Fax:   +49 8821 183-243
> 
> ________________________________________________________________________
> 
> Forschungszentrum Karlsruhe GmbH, Weberstraße 5, 76133 Karlsruhe
> 
> Amtsgericht Mannheim, HRB 100302
> Vorsitzende des Aufsichtsrates: MinDir'in Bärbel Brumme-Bothe
> Vorstand (Geschäftsführung): Prof. Dr. Eberhard Umbach (Vorsitzender);
> Dr. Alexander Kurz (stellv. Vorsitzender); Dr.-Ing. Peter Fritz;
> Prof. Dr.-Ing. Detlef Löhe; Prof. Dr. Horst Hippler;  Prof. Dr. Reinhard
> Maschuw;
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090406/eb1b9d23/attachment.bin


More information about the mvapich-discuss mailing list