From ibatis2 at 163.com Tue Jan 1 07:12:05 2008 From: ibatis2 at 163.com (jetspeed) Date: Tue Jan 1 07:23:01 2008 Subject: [mvapich-discuss] what package may I need? Message-ID: <20080101201205.6c0a5aa8.ibatis2@163.com> Hi,all I use Mvapich2 0.9.8, on PowerPC, RHEL4 , when I use mpicc to compile hpl, I got many errors as below: (mvapich successfully compiled simple MPI program and Mpich2 on this machine successfully compiled HPL) /usr/bin/ld: /usr/mpi/gcc/mvapich2-0.9.8-15/lib/libmpich.a(malloc.o)(.text+0x4ac0): unresolvable R_PPC64_REL24 relocation against symbol `pthread_mutex_trylock@@GLIBC_2.3' /usr/bin/ld: /usr/mpi/gcc/mvapich2-0.9.8-15/lib/libmpich.a(malloc.o)(.text+0x4b10): unresolvable R_PPC64_REL24 relocation against symbol `pthread_mutex_unlock@@GLIBC_2.3' /usr/bin/ld: final link failed: Nonrepresentable section on output collect2: ld returned 1 exit status make[2]: *** [dexe.grd] Error 1 my glibc is glibc = 2.3.4-2.19, glibc-devel-2.3.4-2.19.ppc, glibc-devel-2.3.4-2.19.ppc64. What package do I need for successful compile? , or any suggestions? Thanks in advance From nilesh_awate at yahoo.com Wed Jan 2 03:50:02 2008 From: nilesh_awate at yahoo.com (nilesh awate) Date: Wed Jan 2 04:00:17 2008 Subject: [mvapich-discuss] different mpiexec options Message-ID: <985896.2893.qm@web94101.mail.in2.yahoo.com> Hi Lei, thanks a bunch, my problem has been solved . . . actually i had seen that machine file option in help, but i did't find much(how to spacify) about it in man mpiexec thanks & regards, Nilesh Awate C-DAC R&D ----- Original Message ---- From: LEI CHAI To: nilesh awate Cc: mvapich-discuss@cse..ohio-state.edu Sent: Tuesday, 1 January, 2008 12:37:56 AM Subject: Re: [mvapich-discuss] different mpiexec options Hi Nilesh, You can map processes to machines by using the -machinefile option. For example, suppose you have four nodes, m[1-4], and you want to run 4 processes on a single node, say m1, without modifying mpd.hosts you can run the program like this: $ mpiexec -machinefile ./mf -n 4 ./a.out where mf is a file containing the machine mapping, e.g. $ cat mf m1 m1 m1 m1 And if you want to run 4 processes on 2 nodes, then mf may look like this: $ cat mf m1 m2 m1 m2 More information about running mvapich2 can be found from mvapich2 user guide: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2.html Thanks, Lei Content-Type: multipart/alternative; boundary="0-2031564087-1199096938=:11089" --0-2031564087-1199096938=:11089 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi all,=0A=0AI'm using mvapich2-1.0.1 with OFED1.2 udapl stack=0Ai've setup= 4 nodes & using them.=0Abut when i run foll commmand=0Ampiexec -n 4 ./mpit= st=0Ampitst get executed all 4 nodes . .. .=0Acan i restrict its execution t= o only 2 nodes( without reducing node number in mpd.hosts))=0Aby spacifying= option while running mpiexec ?=0Awhich different option we can give to mpi= exec ?=0Asuupose i want to run 4 instance of executable on single node havi= ng quadra core cpu=0Ahow can i tell it to mpiexec to run it on single node = ,let the other node remain idle.=0A=0Awaiting for reply=0Aregards,=0A Niles= h Awate=0AC-DAC R&D=0A=0A=0A=0A=0A=0A=0A Get the freedom to save as ma= ny mails as you wish. To know how, go to http://help.yahoo.com/l/in/yahoo/m= ail/yahoomail/tools/tools-08.html --0-2031564087-1199096938=:11089 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hi all,

I'm using mvapich2-1.0.1 with OFED1.2 udapl = stack
i've setup 4 nodes & using them.
but when i run foll commma= nd
mpiexec -n 4 ./mpitst
mpitst get executed all 4 nodes . . .
can i restrict its execution to only 2 node= s( without reducing node number in mpd.hosts))
by spacifying opti= on while running mpiexec ?
which different option we can give to mpiexec= ?
suupose i want to run 4 instance of executable on single node having = quadra core cpu
how can i tell it to mpiexec to run it on single node ,l= et the other node remain idle.

waiting for reply
regards,
&nbs= p;Nilesh Awate
C-DAC R&D



=0A=0A=0A
Chat on a cool, new interface. No download required. Click here. --0-2031564087-1199096938=:11089-- Now you can chat without downloading messenger. Go to http://in.messenger.yahoo.com/webmessengerpromo.php -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080102/4e9db8c5/attachment-0001.html From ibatis2 at 163.com Thu Jan 3 05:54:03 2008 From: ibatis2 at 163.com (jetspeed) Date: Thu Jan 3 05:59:24 2008 Subject: [mvapich-discuss] 64bit error? Message-ID: <20080103185403.5b789147.ibatis2@163.com> Hi,all I use Mvapich2 0.9.8(OFED1.2.5.4 for InfiniBand), on PowerPC, RHEL4 , when I use mpicc to compile hpl, I got many errors as below: /usr/bin/ld: /usr/mpi/gcc/mvapich2-0.9.8-15/lib/libmpich.a(malloc.o)(.text+0x4ac0): unresolvable R_PPC64_REL24 relocation against symbol `pthread_mutex_trylock@@GLIBC_2.3' /usr/bin/ld: /usr/mpi/gcc/mvapich2-0.9.8-15/lib/libmpich.a(malloc.o)(.text+0x4b10): unresolvable R_PPC64_REL24 relocation against symbol `pthread_mutex_unlock@@GLIBC_2.3' /usr/bin/ld: final link failed: Nonrepresentable section on output collect2: ld returned 1 exit status make[2]: *** [dexe.grd] Error 1 but after I set the BINARY64 = 1 in the Lapack makefile which produces 64 bit binary, and link the lapack.a in HPL, the compile will succeed. but running the xhpl program, got errors as follow: rank 4 in job 6 inode01_42535 caused collective abort of all ranks exit status of rank 4: killed by signal 9 rank 3 in job 6 inode01_42535 caused collective abort of all ranks exit status of rank 3: killed by signal 9 rank 1 in job 6 inode01_42535 caused collective abort of all ranks exit status of rank 1: killed by signal 9 rank 0 in job 6 inode01_42535 caused collective abort of all ranks exit status of rank 0: killed by signal 9 the question is what does the "R_PPC64_REL24 relocation" mean? How can I compile and run the HPL tests by using Mvapich2 ? anyone done this ? From ibatis2 at 163.com Thu Jan 3 09:22:48 2008 From: ibatis2 at 163.com (jetspeed) Date: Thu Jan 3 09:28:17 2008 Subject: [mvapich-discuss] 64bit error? In-Reply-To: References: <20080103185403.5b789147.ibatis2@163.com> Message-ID: <20080103222248.a905506e.ibatis2@163.com> I tried MVAPICH2 1.0.1 £¬ It works! but mpicc can't use -m64 option.(I compiled by the default make.mvapich2.ofa) I guess the MVAPICH2 0.9.8 in my OFED1.2.5.4 was compiled to 64bit version, so it should use 64bit Lapack. Is that right? is there a setting to define 32bit/64bit? I see there is a script make.mvapich2.def to find the architure(my `uname -m` outputs ppc64), On Thu, 3 Jan 2008 06:35:09 -0500 (EST) Dhabaleswar Panda wrote: > Thanks for your note. > > Do you see the same problem with MVAPICH2 1.0.1 from our web site? > > Unfortunately, we do not have any working PowerPC system with RHEL4 to > reproduce and analyze this problem. > > If you can provide us remote access to your system for some time, we will > be happy to analyze and solve this. Let us know whether this will be > feasible. Accordingly, I will ask one of my team members to be in touch > with you regarding this. > > DK > > On Thu, 3 Jan 2008, jetspeed wrote: > > > Hi,all > > I use Mvapich2 0.9.8(OFED1.2.5.4 for InfiniBand), on PowerPC, RHEL4 , when I use mpicc to compile hpl, I got many errors as below: > > > > /usr/bin/ld: /usr/mpi/gcc/mvapich2-0.9.8-15/lib/libmpich.a(malloc.o)(.text+0x4ac0): unresolvable R_PPC64_REL24 relocation against symbol `pthread_mutex_trylock@@GLIBC_2.3' > > /usr/bin/ld: /usr/mpi/gcc/mvapich2-0.9.8-15/lib/libmpich.a(malloc.o)(.text+0x4b10): unresolvable R_PPC64_REL24 relocation against symbol `pthread_mutex_unlock@@GLIBC_2.3' > > /usr/bin/ld: final link failed: Nonrepresentable section on output > > collect2: ld returned 1 exit status > > make[2]: *** [dexe.grd] Error 1 > > > > > > but after I set the BINARY64 = 1 in the Lapack makefile which produces 64 bit binary, and link the lapack.a in HPL, the compile will succeed. but running the xhpl program, got errors as follow: > > > > rank 4 in job 6 inode01_42535 caused collective abort of all ranks > > exit status of rank 4: killed by signal 9 > > rank 3 in job 6 inode01_42535 caused collective abort of all ranks > > exit status of rank 3: killed by signal 9 > > rank 1 in job 6 inode01_42535 caused collective abort of all ranks > > exit status of rank 1: killed by signal 9 > > rank 0 in job 6 inode01_42535 caused collective abort of all ranks > > exit status of rank 0: killed by signal 9 > > > > > > the question is what does the "R_PPC64_REL24 relocation" mean? How can I compile and run the HPL tests by using Mvapich2 ? anyone done this ? > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > From eborisch at ieee.org Thu Jan 3 10:24:08 2008 From: eborisch at ieee.org (Eric A. Borisch) Date: Thu Jan 3 10:24:18 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: <1a7dd1d151.1d1511a7dd@osu.edu> References: <1a7dd1d151.1d1511a7dd@osu.edu> Message-ID: <392f95800801030724l209990f1g8609288378f70dac@mail.gmail.com> Lei, Thanks for the information. I would suggest that, if this can't be fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should be removed from the default compile options for the versions where it is (apparently) not fully supported. This is a very nasty bug. The MPI layer reports back no errors, but the data isn't actually transferred successfully. In addition, it presents as a timing / waiting error to the user, as all of the local (shared mem) peers transfer data successfully, so significant time can be spent chasing down a suspected user oversight for what is actually an error within the MPI layer. This would apply to the MVAPICH and MVAPICH2, in both the vapi and vapi_multirail makefiles. In addition, it should be documented that the LAZY_MEM_UNREGISTER switch is NOT compatible with vapi-based channels. Thanks, Eric On Dec 21, 2007 5:29 PM, LEI CHAI wrote: > Hi Eric, > > Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests. > > Thanks, > Lei > > > > ----- Original Message ----- > From: "Eric A. Borisch" > Date: Friday, December 21, 2007 10:23 am > Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 > > > I seem to be running into a memory registration issue. > > > > Observations: > > > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) > > into a > > local buffer on the root rank, I receive all of the data from any > > ranks that are running on the same machine, but only part (or none at > > all) of the data from ranks running on external machines. The transfer > > length is above the eager/rendezvous threshold. > > 2) Once the problem occurs, it is persistent. However, if I force > > MVAPICH to re-register by calling "while(dreg_evict())" at this point > > and then re-transfer, the correct data is received. (Same memory being > > transferred from / to.) > > 3) I've only witnessed problems occurring above the 4G (as > > returned by > > malloc()) memory range. > > 4) When I receive partial data from ranks, the data ends on a (4k) > > page bound. Data past this bound (which should have been updated) is > > unchanged during the transfer, yet both the sender and receiver report > > no errors. (This is very bad!) > > 5) Stepping through the code on both ends of the transfer shows the > > software agreeing on the (correct) length and location as far down as > > I can follow it. > > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows > > no issues. (Other than the expected performance hit.) > > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2- > > 1.0 (vapi) > > 8) The user code is also sending data out (from a different buffer) > > over ethernet to a remote gui from the root node. > > > > I can't move to gen2 at this point -- we are using a vendor library > > for interfacing to another system, and this library uses VAPI. > > > > uname -a output: > > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST > > 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > Intel SE7520JR2 motherboards. 4G physical ram on each node. > > > > It appears (perhaps this is obvious) that the assumption that memory > > registered (by the dreg.c code) remains registered until explicitly > > unregistered (again, by the dreg.c code) is being violated in some > > way. This, however, is wading in to uncharted (for me, at least) linux > > memory management waters. The user code is doing nothing to fiddle > > with registration in any explicit way. (With the exception of as > > mentioned in (2)) > > > > Please let me know what other information I can provide to resolve > > this. I'm still trying to put together a small test program to cause > > the problem, but have been unsuccessful so far. > > > > Thanks, > > Eric > > -- > > Eric A. Borisch > > eborisch@ieee.org > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > -- Eric A. Borisch eborisch@ieee.org From ben.held at staarinc.com Thu Jan 3 12:56:19 2008 From: ben.held at staarinc.com (Ben Held) Date: Thu Jan 3 12:56:30 2008 Subject: [mvapich-discuss] Building mvapich-based applications without access to infiniband system Message-ID: <00a401c84e31$f1eeeb30$d5ccc190$@held@staarinc.com> Our company offers a commercial product that we currently build for standard MPICH-1 and LAM. We have a client that has a new Infiniband Linux cluster that has MVAPICH installed on it. Our company does not own any infiniband hardware, but we are faced with providing an application for this customer's cluster. Is this possible and how do we proceed. It appears that the build process for mvapich automatically detects the hardware (that we don't have), so I have concerns that building mvapich here and the linking it into our app will result in a binary that will not run on their cluster. Thanks, Ben Ben Held Simulation Technology & Applied Research, Inc. 11520 N. Port Washington Rd., Suite 201 Mequon, WI 53092 P: 1.262.240.0291 x101 F: 1.262.240.0294 E: ben.held@staarinc.com http://www.staarinc.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080103/ed499588/attachment.html From jsquyres at cisco.com Thu Jan 3 13:29:08 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu Jan 3 13:29:39 2008 Subject: [mvapich-discuss] Building mvapich-based applications without access to infiniband system In-Reply-To: <00a401c84e31$f1eeeb30$d5ccc190$@held@staarinc.com> References: <00a401c84e31$f1eeeb30$d5ccc190$@held@staarinc.com> Message-ID: <3C002364-889A-49C3-9F20-674974357287@cisco.com> Ben -- You might want to install the OFED stack, which comes with Open MPI, MVAPICH 1, and MVAPICH 2 pre-installed (all with IB support included). Cisco offers pre-made RPMs of these ins OFED distribution off cisco.com -- I believe that other vendors do as well, but I don't know the details. The pre-made binary RPMs avoids the problem of the build system trying to detect specific hardware when you have none. After installation, you can use the mpi-selector-menu command to select which MPI to use in OFED (see the man page for details). However, you're still in a bit of an odd spot in that you want to ship a product but don't have the hardware to test it on. Even if you get it to compile/link, you don't have a way to test whether it actually works or not. That's a real bummer (and could be a support nightmare). :-( FWIW, if budgets are tight, you could buy a pair of IB HCAs and connect them back-to-back without a switch for pretty cheap. This is nowhere near real testing, but at least it would give you some indication of whether your app works over an IB-enabled MPI or not. On Jan 3, 2008, at 12:56 PM, Ben Held wrote: > Our company offers a commercial product that we currently build for > standard MPICH-1 and LAM. We have a client that has a new > Infiniband Linux cluster that has MVAPICH installed on it. Our > company does not own any infiniband hardware, but we are faced with > providing an application for this customer?s cluster. Is this > possible and how do we proceed. It appears that the build process > for mvapich automatically detects the hardware (that we don?t have), > so I have concerns that building mvapich here and the linking it > into our app will result in a binary that will not run on their > cluster. > > Thanks, > Ben > > Ben Held > Simulation Technology & Applied Research, Inc. > 11520 N. Port Washington Rd., Suite 201 > Mequon, WI 53092 > P: 1.262.240.0291 x101 > F: 1.262.240.0294 > E: ben.held@staarinc.com > http://www.staarinc.com > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jeff Squyres Cisco Systems From tom.mitchell at qlogic.com Thu Jan 3 17:47:46 2008 From: tom.mitchell at qlogic.com (Tom Mitchell) Date: Thu Jan 3 17:48:09 2008 Subject: [mvapich-discuss] Building mvapich-based applications without access to infiniband system Message-ID: <20080103224746.GC29464@qlogic.com> On Jan 03 11:56, Ben Held wrote: > > Our company offers a commercial product that we currently build for standard > MPICH-1 and LAM. We have a client that has a new Infiniband Linux cluster > that has MVAPICH installed on it. Our company does not own any infiniband > hardware, but we are faced with providing an application for this > customer???s cluster. Is this possible and how do we proceed. It appears > that the build process for mvapich automatically detects the hardware (that > we don???t have), so I have concerns that building mvapich here and the > linking it into our app will result in a binary that will not run on their > cluster. Ben, Jeff Squyres had very good advice. I would like to add that MPI is an API not ABI. As you branch out you will have to pay attention to the binary stacks that you target at compile time. Examples might be HP-MPI, Cisco's MPI, QLogic's MPI, Open MPI, LAM, Intel MPI... and more including a customers hand crafted MPI. The MVAPICH that your client built for his Infiniband Linux cluster will have been compiled with a specific set of options and a specific compiler. Having built MVAPICH the client would have versions of the helper scripts mpicc, mpif77, mpif90... these scripts match the correct compiler to the correct library and almost all the other moving parts. If you look at the ABI issue for compilers in isolation you can find subtle things like Fortran logical True and False having underlying differences in the digital representation. For some Fortran compilers the logical .TRUE. and .FALSE. use the int pair 1 and 0. While others use 0 and -1.... getargs, memcpy are also other places I know where ABI mismatches can happen. The logical .TRUE. and .FALSE. case is interesting because correct Boolean logic transformations by the compiler can convert working code to code that fails in strange ways after ABI cross linking or a change in optimization.... This can be critical for Basic Linear Algebra packages.... where a researcher finds that compiler A gives +5% on library foo.so and compiler B gives +5% on library bar.so and then MPI was built with compiler C. Or worse ld search order finds unexpected and different packages out on nodes in a cluster. To research this a bit look at Open MPI. The Open MPI configure script and README does have comments that discuss and address the logical .TRUE. and .FALSE. issue. For Open MPI users the ompi_info command is valuable to rediscover which Fortran compiler Open MPI was configured with but may not have all the compiler flags (like "....Portland Group compilers provide the "-Munixlogical" option, and Intel compilers (version >= 8.) provide the "-fpscomp logicals" option...." Also the environment (see also alternatives) can get in the mix.... Now with gcc we also have gcc3 and gcc4 versions to watch... As you branch out your build environments need full and detailed records so you can reproduce/ debug these issues. Since MPI is an API you would do well to collect as many MPIs and compilers as you can find then build and test with each. In this case you only have the one additional customers MVAPICH and cluster to work with. That has the potential of making your life easy as long as the customer can give you access. It is the next handful of customers that makes things interesting. If you build your package on the customers cluster do log all you can about the cluster and build environment. A security fix, aptget, yum update, emerge world or up2date can change things that you do not expect ;-) Have fun, mitch > > > Thanks, > > Ben > > > Ben Held > Simulation Technology & Applied Research, Inc. > 11520 N. Port Washington Rd., Suite 201 > Mequon, WI 53092 > P: 1.262.240.0291 x101 > F: 1.262.240.0294 > E: [1]ben.held@staarinc.com > [2]http://www.staarinc.com > > References > > 1. mailto:ben.held@staarinc.com > 2. http://www.staarinc.com/ > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- T o m M i t c h e l l Host Solutions Group, QLogic Corp. http://www.qlogic.com http://support.qlogic.com From brian.budge at gmail.com Thu Jan 3 20:46:15 2008 From: brian.budge at gmail.com (Brian Budge) Date: Thu Jan 3 20:46:49 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB Message-ID: <5b7094580801031746h57c0e7f0i8f80b45e6f6918e7@mail.gmail.com> Hi all - I'm new to the list here... hi! I have been using OpenMPI for a while, and LAM before that, but new requirements keep pushing me to new implementations. In particular, I was interested in using infiniband (using OFED 1.2.5.1) in a multi-threaded environment. It seems that MVAPICH is the library for that particular combination :) In any case, I installed MVAPICH, and I can boot the daemons, and run the ring speed test with no problems. When I run any programs with mpirun, however, I get an error when sending or receiving more than 8192 bytes. For example, if I run the bandwidth test from the benchmarks page (osu_bw.c), I get the following: --------------------------------------------------------------- budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out Thursday 06:16:00 burn burn-3 # OSU MPI Bandwidth Test v3.0 # Size Bandwidth (MB/s) 1 1.24 2 2.72 4 5.44 8 10.18 16 19.09 32 29.69 64 65.01 128 147.31 256 244.61 512 354.32 1024 367.91 2048 451.96 4096 550.66 8192 598.35 [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send Internal Error: invalid error code ffffffff (Ring Index out of range) in MPIDI_CH3_RndvSend:263 Fatal error in MPI_Waitall: Other MPI error, error stack: MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, status_array=0xdb3140) failed (unknown)(): Other MPI error rank 1 in job 4 burn_37156 caused collective abort of all ranks exit status of rank 1: killed by signal 9 --------------------------------------------------------------- I get a similar problem with the latency test, however, the protocol that is complained about is different: -------------------------------------------------------------------- budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out Thursday 09:21:20 # OSU MPI Latency Test v3.0 # Size Latency (us) 0 3.93 1 4.07 2 4.06 4 3.82 8 3.98 16 4.03 32 4.00 64 4.28 128 5.22 256 5.88 512 8.65 1024 9.11 2048 11.53 4096 16.17 8192 25.67 [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to send Internal Error: invalid error code ffffffff (Ring Index out of range) in MPIDI_CH3_RndvSend:263 Fatal error in MPI_Recv: Other MPI error, error stack: MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1, MPI_COMM_WORLD, status=0x7fff14c7bde0) failed (unknown)(): Other MPI error rank 1 in job 5 burn_37156 caused collective abort of all ranks -------------------------------------------------------------------- The protocols (0 and 8126589) are consistent if I run the program multiple times. Anyone have any ideas? If you need more info, please let me know. Thanks, Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080103/28e0f7e4/attachment.html From lexa at adam.botik.ru Fri Jan 4 09:03:22 2008 From: lexa at adam.botik.ru (Alexei I. Adamovich) Date: Fri Jan 4 09:03:48 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: <392f95800801030724l209990f1g8609288378f70dac@mail.gmail.com> References: <1a7dd1d151.1d1511a7dd@osu.edu> <392f95800801030724l209990f1g8609288378f70dac@mail.gmail.com> Message-ID: <20080104140322.GA27951@adam.botik.ru> Eric, what is the version of glibc you are using? I've found the following message on Wolfram Gloger's malloc homepage (http://www.malloc.de/en/index.html): WG> ... WG> New ptmalloc2 release Jun 5th, 2006! WG> WG> Here you can download the current snapshot of ptmalloc2 (C source WG> code), the second version of ptmalloc based on Doug Lea's WG> malloc-2.7.x. This code has already been included in WG> glibc-2.3.x. In multi-thread Applications, ptmalloc2 is currently WG> slightly more memory-efficient than ptmalloc3. WG> WG> .. So, I guess, the usage of more fresh glibc could be a solution. Please, inform me if you have evaluated this possibility already. In case you have RPM-based Linux distribution, you could found your current glibc version using 'rpm -qa | grep -i libc' command. Lei, am I wrong? Is the ptmalloc2 being used only as a thread-safe version of malloc, or possibly there is a more sufficient reason for using just the ptmalloc2 source code supplied? Sincerely, Alexei I. Adamovich On Thu, Jan 03, 2008 at 09:24:08AM -0600, Eric A. Borisch wrote: > Lei, > > Thanks for the information. I would suggest that, if this can't be > fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should > be removed from the default compile options for the versions where it > is (apparently) not fully supported. > > This is a very nasty bug. The MPI layer reports back no errors, but > the data isn't actually transferred successfully. In addition, it > presents as a timing / waiting error to the user, as all of the local > (shared mem) peers transfer data successfully, so significant time can > be spent chasing down a suspected user oversight for what is actually > an error within the MPI layer. > > This would apply to the MVAPICH and MVAPICH2, in both the vapi and > vapi_multirail makefiles. > > In addition, it should be documented that the LAZY_MEM_UNREGISTER > switch is NOT compatible with vapi-based channels. > > Thanks, > Eric > > On Dec 21, 2007 5:29 PM, LEI CHAI wrote: > > Hi Eric, > > > > Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests. > > > > Thanks, > > Lei > > > > > > > > ----- Original Message ----- > > From: "Eric A. Borisch" > > Date: Friday, December 21, 2007 10:23 am > > Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 > > > > > I seem to be running into a memory registration issue. > > > > > > Observations: > > > > > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) > > > into a > > > local buffer on the root rank, I receive all of the data from any > > > ranks that are running on the same machine, but only part (or none at > > > all) of the data from ranks running on external machines. The transfer > > > length is above the eager/rendezvous threshold. > > > 2) Once the problem occurs, it is persistent. However, if I force > > > MVAPICH to re-register by calling "while(dreg_evict())" at this point > > > and then re-transfer, the correct data is received. (Same memory being > > > transferred from / to.) > > > 3) I've only witnessed problems occurring above the 4G (as > > > returned by > > > malloc()) memory range. > > > 4) When I receive partial data from ranks, the data ends on a (4k) > > > page bound. Data past this bound (which should have been updated) is > > > unchanged during the transfer, yet both the sender and receiver report > > > no errors. (This is very bad!) > > > 5) Stepping through the code on both ends of the transfer shows the > > > software agreeing on the (correct) length and location as far down as > > > I can follow it. > > > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows > > > no issues. (Other than the expected performance hit.) > > > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2- > > > 1.0 (vapi) > > > 8) The user code is also sending data out (from a different buffer) > > > over ethernet to a remote gui from the root node. > > > > > > I can't move to gen2 at this point -- we are using a vendor library > > > for interfacing to another system, and this library uses VAPI. > > > > > > uname -a output: > > > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST > > > 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > Intel SE7520JR2 motherboards. 4G physical ram on each node. > > > > > > It appears (perhaps this is obvious) that the assumption that memory > > > registered (by the dreg.c code) remains registered until explicitly > > > unregistered (again, by the dreg.c code) is being violated in some > > > way. This, however, is wading in to uncharted (for me, at least) linux > > > memory management waters. The user code is doing nothing to fiddle > > > with registration in any explicit way. (With the exception of as > > > mentioned in (2)) > > > > > > Please let me know what other information I can provide to resolve > > > this. I'm still trying to put together a small test program to cause > > > the problem, but have been unsuccessful so far. > > > > > > Thanks, > > > Eric > > > -- > > > Eric A. Borisch > > > eborisch@ieee.org > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > -- > Eric A. Borisch > eborisch@ieee.org > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From panda at cse.ohio-state.edu Fri Jan 4 13:23:20 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Jan 4 13:23:26 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: <392f95800801030724l209990f1g8609288378f70dac@mail.gmail.com> Message-ID: Hi Eric, Thanks for your suggestions. We will make these changes to vapi and vapi_multirail devices and add the information to the user guides too. Thanks, DK > Lei, > > Thanks for the information. I would suggest that, if this can't be > fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should > be removed from the default compile options for the versions where it > is (apparently) not fully supported. > > This is a very nasty bug. The MPI layer reports back no errors, but > the data isn't actually transferred successfully. In addition, it > presents as a timing / waiting error to the user, as all of the local > (shared mem) peers transfer data successfully, so significant time can > be spent chasing down a suspected user oversight for what is actually > an error within the MPI layer. > > This would apply to the MVAPICH and MVAPICH2, in both the vapi and > vapi_multirail makefiles. > > In addition, it should be documented that the LAZY_MEM_UNREGISTER > switch is NOT compatible with vapi-based channels. > > Thanks, > Eric > > On Dec 21, 2007 5:29 PM, LEI CHAI wrote: > > Hi Eric, > > > > Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests. > > > > Thanks, > > Lei > > > > > > > > ----- Original Message ----- > > From: "Eric A. Borisch" > > Date: Friday, December 21, 2007 10:23 am > > Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 > > > > > I seem to be running into a memory registration issue. > > > > > > Observations: > > > > > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) > > > into a > > > local buffer on the root rank, I receive all of the data from any > > > ranks that are running on the same machine, but only part (or none at > > > all) of the data from ranks running on external machines. The transfer > > > length is above the eager/rendezvous threshold. > > > 2) Once the problem occurs, it is persistent. However, if I force > > > MVAPICH to re-register by calling "while(dreg_evict())" at this point > > > and then re-transfer, the correct data is received. (Same memory being > > > transferred from / to.) > > > 3) I've only witnessed problems occurring above the 4G (as > > > returned by > > > malloc()) memory range. > > > 4) When I receive partial data from ranks, the data ends on a (4k) > > > page bound. Data past this bound (which should have been updated) is > > > unchanged during the transfer, yet both the sender and receiver report > > > no errors. (This is very bad!) > > > 5) Stepping through the code on both ends of the transfer shows the > > > software agreeing on the (correct) length and location as far down as > > > I can follow it. > > > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows > > > no issues. (Other than the expected performance hit.) > > > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2- > > > 1.0 (vapi) > > > 8) The user code is also sending data out (from a different buffer) > > > over ethernet to a remote gui from the root node. > > > > > > I can't move to gen2 at this point -- we are using a vendor library > > > for interfacing to another system, and this library uses VAPI. > > > > > > uname -a output: > > > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST > > > 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > Intel SE7520JR2 motherboards. 4G physical ram on each node. > > > > > > It appears (perhaps this is obvious) that the assumption that memory > > > registered (by the dreg.c code) remains registered until explicitly > > > unregistered (again, by the dreg.c code) is being violated in some > > > way. This, however, is wading in to uncharted (for me, at least) linux > > > memory management waters. The user code is doing nothing to fiddle > > > with registration in any explicit way. (With the exception of as > > > mentioned in (2)) > > > > > > Please let me know what other information I can provide to resolve > > > this. I'm still trying to put together a small test program to cause > > > the problem, but have been unsuccessful so far. > > > > > > Thanks, > > > Eric > > > -- > > > Eric A. Borisch > > > eborisch@ieee.org > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > -- > Eric A. Borisch > eborisch@ieee.org > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Fri Jan 4 14:03:07 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Jan 4 14:03:12 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: <20080104140322.GA27951@adam.botik.ru> Message-ID: Alexei, ptmalloc2 is being used in our case to provide enhanced performance and correctness. To speed up communications we cache registration of memory regions (a costly operation) that are being used for communication. To provide correct behavior we need to intercept malloc/free and friends so old registrations can be flushed (otherwise the virtual->physical mapping can change, leading to incorrect results). Matt On Fri, 4 Jan 2008, Alexei I. Adamovich wrote: > Eric, > > what is the version of glibc you are using? > > I've found the following message on Wolfram Gloger's malloc homepage > (http://www.malloc.de/en/index.html): > > WG> ... > WG> New ptmalloc2 release Jun 5th, 2006! > WG> > WG> Here you can download the current snapshot of ptmalloc2 (C source > WG> code), the second version of ptmalloc based on Doug Lea's > WG> malloc-2.7.x. This code has already been included in > WG> glibc-2.3.x. In multi-thread Applications, ptmalloc2 is currently > WG> slightly more memory-efficient than ptmalloc3. > WG> > WG> .. > > So, I guess, the usage of more fresh glibc could be a solution. > > Please, inform me if you have evaluated this possibility already. > > In case you have RPM-based Linux distribution, you could found > your current glibc version using > > 'rpm -qa | grep -i libc' > > command. > > > Lei, > > am I wrong? Is the ptmalloc2 being used only as a thread-safe version of malloc, > or possibly there is a more sufficient reason for using just the ptmalloc2 > source code supplied? > > Sincerely, > > Alexei I. Adamovich > > On Thu, Jan 03, 2008 at 09:24:08AM -0600, Eric A. Borisch wrote: > > Lei, > > > > Thanks for the information. I would suggest that, if this can't be > > fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should > > be removed from the default compile options for the versions where it > > is (apparently) not fully supported. > > > > This is a very nasty bug. The MPI layer reports back no errors, but > > the data isn't actually transferred successfully. In addition, it > > presents as a timing / waiting error to the user, as all of the local > > (shared mem) peers transfer data successfully, so significant time can > > be spent chasing down a suspected user oversight for what is actually > > an error within the MPI layer. > > > > This would apply to the MVAPICH and MVAPICH2, in both the vapi and > > vapi_multirail makefiles. > > > > In addition, it should be documented that the LAZY_MEM_UNREGISTER > > switch is NOT compatible with vapi-based channels. > > > > Thanks, > > Eric > > > > On Dec 21, 2007 5:29 PM, LEI CHAI wrote: > > > Hi Eric, > > > > > > Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests. > > > > > > Thanks, > > > Lei > > > > > > > > > > > > ----- Original Message ----- > > > From: "Eric A. Borisch" > > > Date: Friday, December 21, 2007 10:23 am > > > Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 > > > > > > > I seem to be running into a memory registration issue. > > > > > > > > Observations: > > > > > > > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) > > > > into a > > > > local buffer on the root rank, I receive all of the data from any > > > > ranks that are running on the same machine, but only part (or none at > > > > all) of the data from ranks running on external machines. The transfer > > > > length is above the eager/rendezvous threshold. > > > > 2) Once the problem occurs, it is persistent. However, if I force > > > > MVAPICH to re-register by calling "while(dreg_evict())" at this point > > > > and then re-transfer, the correct data is received. (Same memory being > > > > transferred from / to.) > > > > 3) I've only witnessed problems occurring above the 4G (as > > > > returned by > > > > malloc()) memory range. > > > > 4) When I receive partial data from ranks, the data ends on a (4k) > > > > page bound. Data past this bound (which should have been updated) is > > > > unchanged during the transfer, yet both the sender and receiver report > > > > no errors. (This is very bad!) > > > > 5) Stepping through the code on both ends of the transfer shows the > > > > software agreeing on the (correct) length and location as far down as > > > > I can follow it. > > > > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows > > > > no issues. (Other than the expected performance hit.) > > > > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2- > > > > 1.0 (vapi) > > > > 8) The user code is also sending data out (from a different buffer) > > > > over ethernet to a remote gui from the root node. > > > > > > > > I can't move to gen2 at this point -- we are using a vendor library > > > > for interfacing to another system, and this library uses VAPI. > > > > > > > > uname -a output: > > > > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST > > > > 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > Intel SE7520JR2 motherboards. 4G physical ram on each node. > > > > > > > > It appears (perhaps this is obvious) that the assumption that memory > > > > registered (by the dreg.c code) remains registered until explicitly > > > > unregistered (again, by the dreg.c code) is being violated in some > > > > way. This, however, is wading in to uncharted (for me, at least) linux > > > > memory management waters. The user code is doing nothing to fiddle > > > > with registration in any explicit way. (With the exception of as > > > > mentioned in (2)) > > > > > > > > Please let me know what other information I can provide to resolve > > > > this. I'm still trying to put together a small test program to cause > > > > the problem, but have been unsuccessful so far. > > > > > > > > Thanks, > > > > Eric > > > > -- > > > > Eric A. Borisch > > > > eborisch@ieee.org > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > -- > > Eric A. Borisch > > eborisch@ieee.org > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From eborisch at ieee.org Fri Jan 4 14:30:13 2008 From: eborisch at ieee.org (Eric A. Borisch) Date: Fri Jan 4 14:30:20 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: References: <20080104140322.GA27951@adam.botik.ru> Message-ID: <392f95800801041130v593fb6bax7ea02d0796bf5818@mail.gmail.com> Matt, I'm curious if this is something that could be correctly handled in the vapi variants or if this is something that is fundamentally possible with gen2 but not vapi. In response to Alexi's question, I'm running glibc-2.3.4-2.25 Thanks, Eric On Jan 4, 2008 1:03 PM, Matthew Koop wrote: > Alexei, > > ptmalloc2 is being used in our case to provide enhanced performance and > correctness. To speed up communications we cache registration of memory > regions (a costly operation) that are being used for communication. To > provide correct behavior we need to intercept malloc/free and friends so > old registrations can be flushed (otherwise the virtual->physical mapping > can change, leading to incorrect results). > > Matt > > > On Fri, 4 Jan 2008, Alexei I. Adamovich wrote: > > > Eric, > > > > what is the version of glibc you are using? > > > > I've found the following message on Wolfram Gloger's malloc homepage > > (http://www.malloc.de/en/index.html): > > > > WG> ... > > WG> New ptmalloc2 release Jun 5th, 2006! > > WG> > > WG> Here you can download the current snapshot of ptmalloc2 (C source > > WG> code), the second version of ptmalloc based on Doug Lea's > > WG> malloc-2.7.x. This code has already been included in > > WG> glibc-2.3.x. In multi-thread Applications, ptmalloc2 is currently > > WG> slightly more memory-efficient than ptmalloc3. > > WG> > > WG> .. > > > > So, I guess, the usage of more fresh glibc could be a solution. > > > > Please, inform me if you have evaluated this possibility already. > > > > In case you have RPM-based Linux distribution, you could found > > your current glibc version using > > > > 'rpm -qa | grep -i libc' > > > > command. > > > > > > Lei, > > > > am I wrong? Is the ptmalloc2 being used only as a thread-safe version of malloc, > > or possibly there is a more sufficient reason for using just the ptmalloc2 > > source code supplied? > > > > Sincerely, > > > > Alexei I. Adamovich > > > > On Thu, Jan 03, 2008 at 09:24:08AM -0600, Eric A. Borisch wrote: > > > Lei, > > > > > > Thanks for the information. I would suggest that, if this can't be > > > fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should > > > be removed from the default compile options for the versions where it > > > is (apparently) not fully supported. > > > > > > This is a very nasty bug. The MPI layer reports back no errors, but > > > the data isn't actually transferred successfully. In addition, it > > > presents as a timing / waiting error to the user, as all of the local > > > (shared mem) peers transfer data successfully, so significant time can > > > be spent chasing down a suspected user oversight for what is actually > > > an error within the MPI layer. > > > > > > This would apply to the MVAPICH and MVAPICH2, in both the vapi and > > > vapi_multirail makefiles. > > > > > > In addition, it should be documented that the LAZY_MEM_UNREGISTER > > > switch is NOT compatible with vapi-based channels. > > > > > > Thanks, > > > Eric > > > > > > On Dec 21, 2007 5:29 PM, LEI CHAI wrote: > > > > Hi Eric, > > > > > > > > Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests. > > > > > > > > Thanks, > > > > Lei > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Eric A. Borisch" > > > > Date: Friday, December 21, 2007 10:23 am > > > > Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 > > > > > > > > > I seem to be running into a memory registration issue. > > > > > > > > > > Observations: > > > > > > > > > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) > > > > > into a > > > > > local buffer on the root rank, I receive all of the data from any > > > > > ranks that are running on the same machine, but only part (or none at > > > > > all) of the data from ranks running on external machines. The transfer > > > > > length is above the eager/rendezvous threshold. > > > > > 2) Once the problem occurs, it is persistent. However, if I force > > > > > MVAPICH to re-register by calling "while(dreg_evict())" at this point > > > > > and then re-transfer, the correct data is received. (Same memory being > > > > > transferred from / to.) > > > > > 3) I've only witnessed problems occurring above the 4G (as > > > > > returned by > > > > > malloc()) memory range. > > > > > 4) When I receive partial data from ranks, the data ends on a (4k) > > > > > page bound. Data past this bound (which should have been updated) is > > > > > unchanged during the transfer, yet both the sender and receiver report > > > > > no errors. (This is very bad!) > > > > > 5) Stepping through the code on both ends of the transfer shows the > > > > > software agreeing on the (correct) length and location as far down as > > > > > I can follow it. > > > > > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows > > > > > no issues. (Other than the expected performance hit.) > > > > > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2- > > > > > 1.0 (vapi) > > > > > 8) The user code is also sending data out (from a different buffer) > > > > > over ethernet to a remote gui from the root node. > > > > > > > > > > I can't move to gen2 at this point -- we are using a vendor library > > > > > for interfacing to another system, and this library uses VAPI. > > > > > > > > > > uname -a output: > > > > > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST > > > > > 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > > > Intel SE7520JR2 motherboards. 4G physical ram on each node. > > > > > > > > > > It appears (perhaps this is obvious) that the assumption that memory > > > > > registered (by the dreg.c code) remains registered until explicitly > > > > > unregistered (again, by the dreg.c code) is being violated in some > > > > > way. This, however, is wading in to uncharted (for me, at least) linux > > > > > memory management waters. The user code is doing nothing to fiddle > > > > > with registration in any explicit way. (With the exception of as > > > > > mentioned in (2)) > > > > > > > > > > Please let me know what other information I can provide to resolve > > > > > this. I'm still trying to put together a small test program to cause > > > > > the problem, but have been unsuccessful so far. > > > > > > > > > > Thanks, > > > > > Eric > > > > > -- > > > > > Eric A. Borisch > > > > > eborisch@ieee.org > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Eric A. Borisch > > > eborisch@ieee.org > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > -- Eric A. Borisch eborisch@ieee.org Howard Roark laughed. From ben.held at staarinc.com Fri Jan 4 15:59:24 2008 From: ben.held at staarinc.com (Ben Held) Date: Fri Jan 4 15:59:36 2008 Subject: [mvapich-discuss] Troubles building/installing OFEM 1.2 on Fedora Core 4 64-bit Message-ID: <009c01c84f14$b00a89c0$101f9d40$@held@staarinc.com> We are seeing a failure during the install process (out of rpmbuild) on a Fedora Core 4 64-bit system. The tail of the log is here: Hunk #1 succeeded at 456 (offset 156 lines). Hunk #2 succeeded at 569 (offset 75 lines). Hunk #3 succeeded at 672 (offset 157 lines). Hunk #4 succeeded at 1444 (offset 281 lines). Hunk #5 succeeded at 1340 (offset 157 lines). Hunk #6 succeeded at 1791 with fuzz 1 (offset 498 lines). /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/kernel_patches/backport/2.6.11_FC4/use r_mad_3935_to_2_6_11_FC4.patch patching file drivers/infiniband/core/user_mad.c patch: **** malformed patch at line 12: @@ -827,13 +952,13 @@ static int ib_umad_init_port(struct ib_d Failed to apply patch: /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/kernel_patches/backport/2.6.11_FC4/use r_mad_3935_to_2_6_11_FC4.patch error: Bad exit status from /var/tmp/rpm-tmp.88475 (%install) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.88475 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root /var/tmp/OFED' --defi ne 'configure_options --with-cxgb3-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with- user_access-mod --with-addr_trans-mod --with-rds-mod ' --define 'KVERSION 2.6.11-1.1369_FC4smp' --define 'KSRC /lib/modules/2.6.11-1.1369_FC4smp/b uild' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network-scripts' --define 'modprob e_update 1' --define 'include_ipoib_conf 1' /usr/etc/OFED-1.2-rc5/SRPMS/ofa_kernel-1.2-rc5.src.rpm" Any ideas? Regards, Ben Held Simulation Technology & Applied Research, Inc. 11520 N. Port Washington Rd., Suite 201 Mequon, WI 53092 P: 1.262.240.0291 x101 F: 1.262.240.0294 E: ben.held@staarinc.com http://www.staarinc.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080104/28c0a6d1/attachment-0001.html From koop at cse.ohio-state.edu Fri Jan 4 17:27:16 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Jan 4 17:27:26 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: <392f95800801041130v593fb6bax7ea02d0796bf5818@mail.gmail.com> Message-ID: Eric, The problem is not an inherent issue with VAPI. Similar support could be ported to the VAPI device as well. Thus far, we have been including new features in the OpenFabrics/Gen2 device as vendors have mostly moved support to Gen2. Matt On Fri, 4 Jan 2008, Eric A. Borisch wrote: > Matt, > > I'm curious if this is something that could be correctly handled in > the vapi variants or if this is something that is fundamentally > possible with gen2 but not vapi. > > In response to Alexi's question, I'm running glibc-2.3.4-2.25 > > Thanks, > Eric > > On Jan 4, 2008 1:03 PM, Matthew Koop wrote: > > Alexei, > > > > ptmalloc2 is being used in our case to provide enhanced performance and > > correctness. To speed up communications we cache registration of memory > > regions (a costly operation) that are being used for communication. To > > provide correct behavior we need to intercept malloc/free and friends so > > old registrations can be flushed (otherwise the virtual->physical mapping > > can change, leading to incorrect results). > > > > Matt > > > > > > On Fri, 4 Jan 2008, Alexei I. Adamovich wrote: > > > > > Eric, > > > > > > what is the version of glibc you are using? > > > > > > I've found the following message on Wolfram Gloger's malloc homepage > > > (http://www.malloc.de/en/index.html): > > > > > > WG> ... > > > WG> New ptmalloc2 release Jun 5th, 2006! > > > WG> > > > WG> Here you can download the current snapshot of ptmalloc2 (C source > > > WG> code), the second version of ptmalloc based on Doug Lea's > > > WG> malloc-2.7.x. This code has already been included in > > > WG> glibc-2.3.x. In multi-thread Applications, ptmalloc2 is currently > > > WG> slightly more memory-efficient than ptmalloc3. > > > WG> > > > WG> .. > > > > > > So, I guess, the usage of more fresh glibc could be a solution. > > > > > > Please, inform me if you have evaluated this possibility already. > > > > > > In case you have RPM-based Linux distribution, you could found > > > your current glibc version using > > > > > > 'rpm -qa | grep -i libc' > > > > > > command. > > > > > > > > > Lei, > > > > > > am I wrong? Is the ptmalloc2 being used only as a thread-safe version of malloc, > > > or possibly there is a more sufficient reason for using just the ptmalloc2 > > > source code supplied? > > > > > > Sincerely, > > > > > > Alexei I. Adamovich > > > > > > On Thu, Jan 03, 2008 at 09:24:08AM -0600, Eric A. Borisch wrote: > > > > Lei, > > > > > > > > Thanks for the information. I would suggest that, if this can't be > > > > fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should > > > > be removed from the default compile options for the versions where it > > > > is (apparently) not fully supported. > > > > > > > > This is a very nasty bug. The MPI layer reports back no errors, but > > > > the data isn't actually transferred successfully. In addition, it > > > > presents as a timing / waiting error to the user, as all of the local > > > > (shared mem) peers transfer data successfully, so significant time can > > > > be spent chasing down a suspected user oversight for what is actually > > > > an error within the MPI layer. > > > > > > > > This would apply to the MVAPICH and MVAPICH2, in both the vapi and > > > > vapi_multirail makefiles. > > > > > > > > In addition, it should be documented that the LAZY_MEM_UNREGISTER > > > > switch is NOT compatible with vapi-based channels. > > > > > > > > Thanks, > > > > Eric > > > > > > > > On Dec 21, 2007 5:29 PM, LEI CHAI wrote: > > > > > Hi Eric, > > > > > > > > > > Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests. > > > > > > > > > > Thanks, > > > > > Lei > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Eric A. Borisch" > > > > > Date: Friday, December 21, 2007 10:23 am > > > > > Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 > > > > > > > > > > > I seem to be running into a memory registration issue. > > > > > > > > > > > > Observations: > > > > > > > > > > > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) > > > > > > into a > > > > > > local buffer on the root rank, I receive all of the data from any > > > > > > ranks that are running on the same machine, but only part (or none at > > > > > > all) of the data from ranks running on external machines. The transfer > > > > > > length is above the eager/rendezvous threshold. > > > > > > 2) Once the problem occurs, it is persistent. However, if I force > > > > > > MVAPICH to re-register by calling "while(dreg_evict())" at this point > > > > > > and then re-transfer, the correct data is received. (Same memory being > > > > > > transferred from / to.) > > > > > > 3) I've only witnessed problems occurring above the 4G (as > > > > > > returned by > > > > > > malloc()) memory range. > > > > > > 4) When I receive partial data from ranks, the data ends on a (4k) > > > > > > page bound. Data past this bound (which should have been updated) is > > > > > > unchanged during the transfer, yet both the sender and receiver report > > > > > > no errors. (This is very bad!) > > > > > > 5) Stepping through the code on both ends of the transfer shows the > > > > > > software agreeing on the (correct) length and location as far down as > > > > > > I can follow it. > > > > > > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows > > > > > > no issues. (Other than the expected performance hit.) > > > > > > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2- > > > > > > 1.0 (vapi) > > > > > > 8) The user code is also sending data out (from a different buffer) > > > > > > over ethernet to a remote gui from the root node. > > > > > > > > > > > > I can't move to gen2 at this point -- we are using a vendor library > > > > > > for interfacing to another system, and this library uses VAPI. > > > > > > > > > > > > uname -a output: > > > > > > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST > > > > > > 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > > > > > Intel SE7520JR2 motherboards. 4G physical ram on each node. > > > > > > > > > > > > It appears (perhaps this is obvious) that the assumption that memory > > > > > > registered (by the dreg.c code) remains registered until explicitly > > > > > > unregistered (again, by the dreg.c code) is being violated in some > > > > > > way. This, however, is wading in to uncharted (for me, at least) linux > > > > > > memory management waters. The user code is doing nothing to fiddle > > > > > > with registration in any explicit way. (With the exception of as > > > > > > mentioned in (2)) > > > > > > > > > > > > Please let me know what other information I can provide to resolve > > > > > > this. I'm still trying to put together a small test program to cause > > > > > > the problem, but have been unsuccessful so far. > > > > > > > > > > > > Thanks, > > > > > > Eric > > > > > > -- > > > > > > Eric A. Borisch > > > > > > eborisch@ieee.org > > > > > > _______________________________________________ > > > > > > mvapich-discuss mailing list > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Eric A. Borisch > > > > eborisch@ieee.org > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > -- > Eric A. Borisch > eborisch@ieee.org > > Howard Roark laughed. > From panda at cse.ohio-state.edu Fri Jan 4 17:31:53 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Jan 4 17:31:58 2008 Subject: [mvapich-discuss] Troubles building/installing OFEM 1.2 on Fedora Core 4 64-bit In-Reply-To: <009c01c84f14$b00a89c0$101f9d40$@held@staarinc.com> Message-ID: Ben - Sorry to know that you are experiencing problems in building/installing OFED 1.2 on Fedora Core 4 64 bit system. FYI, the latest released version of OFED 1.2 is OFED 1.2.5.4. Regarding your rpm build errors, I am forwarding your note to `ewg' and `general' lists of Open Fabrics. More experienced users on these two lists can give you prompt feedbacks and guidance on the basic OFED installation issues. Thanks, DK On Fri, 4 Jan 2008, Ben Held wrote: > We are seeing a failure during the install process (out of rpmbuild) on a > Fedora Core 4 64-bit system. The tail of the log is here: > > > > Hunk #1 succeeded at 456 (offset 156 lines). > > Hunk #2 succeeded at 569 (offset 75 lines). > > Hunk #3 succeeded at 672 (offset 157 lines). > > Hunk #4 succeeded at 1444 (offset 281 lines). > > Hunk #5 succeeded at 1340 (offset 157 lines). > > Hunk #6 succeeded at 1791 with fuzz 1 (offset 498 lines). > > > /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/kernel_patches/backport/2.6.11_FC4/use > r_mad_3935_to_2_6_11_FC4.patch > > patching file drivers/infiniband/core/user_mad.c > > patch: **** malformed patch at line 12: @@ -827,13 +952,13 @@ static int > ib_umad_init_port(struct ib_d > > > > Failed to apply patch: > /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/kernel_patches/backport/2.6.11_FC4/use > r_mad_3935_to_2_6_11_FC4.patch > > error: Bad exit status from /var/tmp/rpm-tmp.88475 (%install) > > > > > > RPM build errors: > > user vlad does not exist - using root > > group vlad does not exist - using root > > user vlad does not exist - using root > > group vlad does not exist - using root > > Bad exit status from /var/tmp/rpm-tmp.88475 (%install) > > ERROR: Failed executing "rpmbuild --rebuild --define '_topdir > /var/tmp/OFEDRPM' --define '_prefix /usr' --define 'build_root > /var/tmp/OFED' --defi > > ne 'configure_options --with-cxgb3-mod --with-ipoib-mod --with-mthca-mod > --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with- > > user_access-mod --with-addr_trans-mod --with-rds-mod ' --define 'KVERSION > 2.6.11-1.1369_FC4smp' --define 'KSRC /lib/modules/2.6.11-1.1369_FC4smp/b > > uild' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' > --define 'NETWORK_CONF_DIR /etc/sysconfig/network-scripts' --define 'modprob > > e_update 1' --define 'include_ipoib_conf 1' > /usr/etc/OFED-1.2-rc5/SRPMS/ofa_kernel-1.2-rc5.src.rpm" > > > > > > Any ideas? > > > > Regards, > > > > Ben Held > Simulation Technology & Applied Research, Inc. > 11520 N. Port Washington Rd., Suite 201 > Mequon, WI 53092 > P: 1.262.240.0291 x101 > F: 1.262.240.0294 > E: ben.held@staarinc.com > http://www.staarinc.com > > > > > > From jsquyres at cisco.com Fri Jan 4 17:48:41 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri Jan 4 17:49:00 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: References: Message-ID: <247F99BD-D5E2-415E-9078-55FF4071756A@cisco.com> On Jan 4, 2008, at 5:27 PM, Matthew Koop wrote: > The problem is not an inherent issue with VAPI. Similar support > could be > ported to the VAPI device as well. Thus far, we have been including > new > features in the OpenFabrics/Gen2 device as vendors have mostly moved > support to Gen2. I'll second this: Cisco is doing all of its new HPC IB development with the OpenFabrics stack (and has been over over a year). Open MPI has dropped VAPI support in its upcoming v1.3 release. We encourage all of our HPC customers to upgrade from VAPI-based stacks to OFED if possible. -- Jeff Squyres Cisco Systems From eborisch at ieee.org Fri Jan 4 17:53:13 2008 From: eborisch at ieee.org (Eric A. Borisch) Date: Fri Jan 4 17:53:21 2008 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 In-Reply-To: <247F99BD-D5E2-415E-9078-55FF4071756A@cisco.com> References: <247F99BD-D5E2-415E-9078-55FF4071756A@cisco.com> Message-ID: <392f95800801041453n4d73c5bbxa6e21d02d0a3829b@mail.gmail.com> I plan to move over and will as soon as the vendor we share an interface with (also via Infiniband, but not MPI) moves over. (Out of the scope of this discussion. :) For now, it sounds like I'll be turning off this (LAZY_MEM_UNREGISTER) option in the vapi code and running in that fashion. Thanks for the help, Eric On Jan 4, 2008 4:48 PM, Jeff Squyres wrote: > On Jan 4, 2008, at 5:27 PM, Matthew Koop wrote: > > > The problem is not an inherent issue with VAPI. Similar support > > could be > > ported to the VAPI device as well. Thus far, we have been including > > new > > features in the OpenFabrics/Gen2 device as vendors have mostly moved > > support to Gen2. > > I'll second this: Cisco is doing all of its new HPC IB development > with the OpenFabrics stack (and has been over over a year). Open MPI > has dropped VAPI support in its upcoming v1.3 release. > > We encourage all of our HPC customers to upgrade from VAPI-based > stacks to OFED if possible. > > -- > Jeff Squyres > Cisco Systems > > -- Eric A. Borisch eborisch@ieee.org From brian.budge at gmail.com Fri Jan 4 18:04:33 2008 From: brian.budge at gmail.com (Brian Budge) Date: Fri Jan 4 18:04:42 2008 Subject: [mvapich-discuss] Re: unrecognized protocol for send/recv over 8KB In-Reply-To: <5b7094580801031746h57c0e7f0i8f80b45e6f6918e7@mail.gmail.com> References: <5b7094580801031746h57c0e7f0i8f80b45e6f6918e7@mail.gmail.com> Message-ID: <5b7094580801041504h392f7889vbe4712bfa8a71d46@mail.gmail.com> Hi again - I noticed this in the benchmark code: int large_message_size = 8192; Does MVAPICH internally treat messages over 8192 bytes differently than those under 8 KB? Could this be something wrong with how I've configured infiniband? I had a program running OpenMPI already over IB on the system, but maybe I need to configure something special for MVAPICH? Sorry if I appear to be grasping at straws... but I am ;) Thanks, Brian On Jan 3, 2008 5:46 PM, Brian Budge wrote: > Hi all - > > I'm new to the list here... hi! I have been using OpenMPI for a while, > and LAM before that, but new requirements keep pushing me to new > implementations. In particular, I was interested in using infiniband (using > OFED 1.2.5.1) in a multi-threaded environment. It seems that MVAPICH is > the library for that particular combination :) > > In any case, I installed MVAPICH, and I can boot the daemons, and run the > ring speed test with no problems. When I run any programs with mpirun, > however, I get an error when sending or receiving more than 8192 bytes. > > For example, if I run the bandwidth test from the benchmarks page > (osu_bw.c), I get the following: > --------------------------------------------------------------- > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > Thursday 06:16:00 > burn > burn-3 > # OSU MPI Bandwidth Test v3.0 > # Size Bandwidth (MB/s) > 1 1.24 > 2 2.72 > 4 5.44 > 8 10.18 > 16 19.09 > 32 29.69 > 64 65.01 > 128 147.31 > 256 244.61 > 512 354.32 > 1024 367.91 > 2048 451.96 > 4096 550.66 > 8192 598.35 > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send > Internal Error: invalid error code ffffffff (Ring Index out of range) in > MPIDI_CH3_RndvSend:263 > Fatal error in MPI_Waitall: > Other MPI error, error stack: > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > status_array=0xdb3140) failed > (unknown)(): Other MPI error > rank 1 in job 4 burn_37156 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > --------------------------------------------------------------- > > I get a similar problem with the latency test, however, the protocol that > is complained about is different: > -------------------------------------------------------------------- > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > Thursday 09:21:20 > # OSU MPI Latency Test v3.0 > # Size Latency (us) > 0 3.93 > 1 4.07 > 2 4.06 > 4 3.82 > 8 3.98 > 16 4.03 > 32 4.00 > 64 4.28 > 128 5.22 > 256 5.88 > 512 8.65 > 1024 9.11 > 2048 11.53 > 4096 16.17 > 8192 25.67 > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to > send > Internal Error: invalid error code ffffffff (Ring Index out of range) in > MPIDI_CH3_RndvSend:263 > Fatal error in MPI_Recv: > Other MPI error, error stack: > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1, > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > (unknown)(): Other MPI error > rank 1 in job 5 burn_37156 caused collective abort of all ranks > -------------------------------------------------------------------- > > The protocols (0 and 8126589) are consistent if I run the program multiple > times. > > Anyone have any ideas? If you need more info, please let me know. > > Thanks, > Brian > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080104/fb86b07d/attachment-0001.html From huanwei at cse.ohio-state.edu Fri Jan 4 21:12:46 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Fri Jan 4 21:12:52 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: Message-ID: Hi Brian, Thanks for letting us know this problem. Would you please let us know some more details to help us locate the issue. 1) More details on your platform. 2) Exact version of mvapich2 you are using. Is it from OFED package? or some version from our website. 3) If it is from our website, did you change anything from the default compiling scripts? Thanks. -- Wei > I'm new to the list here... hi! I have been using OpenMPI for a while, and > LAM before that, but new requirements keep pushing me to new > implementations. In particular, I was interested in using infiniband (using > OFED 1.2.5.1) in a multi-threaded environment. It seems that MVAPICH is the > library for that particular combination :) > > In any case, I installed MVAPICH, and I can boot the daemons, and run the > ring speed test with no problems. When I run any programs with mpirun, > however, I get an error when sending or receiving more than 8192 bytes. > > For example, if I run the bandwidth test from the benchmarks page > (osu_bw.c), I get the following: > --------------------------------------------------------------- > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > Thursday 06:16:00 > burn > burn-3 > # OSU MPI Bandwidth Test v3.0 > # Size Bandwidth (MB/s) > 1 1.24 > 2 2.72 > 4 5.44 > 8 10.18 > 16 19.09 > 32 29.69 > 64 65.01 > 128 147.31 > 256 244.61 > 512 354.32 > 1024 367.91 > 2048 451.96 > 4096 550.66 > 8192 598.35 > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send > Internal Error: invalid error code ffffffff (Ring Index out of range) in > MPIDI_CH3_RndvSend:263 > Fatal error in MPI_Waitall: > Other MPI error, error stack: > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > status_array=0xdb3140) failed > (unknown)(): Other MPI error > rank 1 in job 4 burn_37156 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > --------------------------------------------------------------- > > I get a similar problem with the latency test, however, the protocol that is > complained about is different: > -------------------------------------------------------------------- > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > Thursday 09:21:20 > # OSU MPI Latency Test v3.0 > # Size Latency (us) > 0 3.93 > 1 4.07 > 2 4.06 > 4 3.82 > 8 3.98 > 16 4.03 > 32 4.00 > 64 4.28 > 128 5.22 > 256 5.88 > 512 8.65 > 1024 9.11 > 2048 11.53 > 4096 16.17 > 8192 25.67 > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to > send > Internal Error: invalid error code ffffffff (Ring Index out of range) in > MPIDI_CH3_RndvSend:263 > Fatal error in MPI_Recv: > Other MPI error, error stack: > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1, > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > (unknown)(): Other MPI error > rank 1 in job 5 burn_37156 caused collective abort of all ranks > -------------------------------------------------------------------- > > The protocols (0 and 8126589) are consistent if I run the program multiple > times. > > Anyone have any ideas? If you need more info, please let me know. > > Thanks, > Brian > From brian.budge at gmail.com Fri Jan 4 21:23:58 2008 From: brian.budge at gmail.com (Brian Budge) Date: Fri Jan 4 21:24:07 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: References: Message-ID: <5b7094580801041823y10e4a565x36833e66431b94e1@mail.gmail.com> Hi Wei - I am running gentoo linux on amd64, 2 or 4 opteron 8216 per node. Kernel is 2.6.23-gentoo-r4 SMP. I have infiniband built into the kernel: CONFIG_INFINIBAND=y CONFIG_INFINIBAND_USER_MAD=y CONFIG_INFINIBAND_USER_ACCESS=y CONFIG_INFINIBAND_USER_MEM=y CONFIG_INFINIBAND_ADDR_TRANS=y CONFIG_INFINIBAND_MTHCA=y CONFIG_INFINIBAND_MTHCA_DEBUG=y CONFIG_INFINIBAND_AMSO1100=y CONFIG_MLX4_INFINIBAND=y CONFIG_INFINIBAND_IPOIB=y CONFIG_INFINIBAND_IPOIB_DEBUG=y I am using the openib-mvapich2-1.0.1 package in the gentoo-science overlay addition to the standard gentoo packages. I have also tried 1.0 with the same results. I compiled with multithreading turned on (haven't tried without this, but the sample codes I am initially testing are not multithreaded, although my application is). I also tried with or without rdma with no change. The script seems to be setting the build for SMALL_CLUSTER. Let me know what other information would be useful. Thanks, Brian On Jan 4, 2008 6:12 PM, wei huang wrote: > Hi Brian, > > Thanks for letting us know this problem. Would you please let us know some > more details to help us locate the issue. > > 1) More details on your platform. > > 2) Exact version of mvapich2 you are using. Is it from OFED package? or > some version from our website. > > 3) If it is from our website, did you change anything from the default > compiling scripts? > > Thanks. > > -- Wei > > I'm new to the list here... hi! I have been using OpenMPI for a while, > and > > LAM before that, but new requirements keep pushing me to new > > implementations. In particular, I was interested in using infiniband > (using > > OFED 1.2.5.1) in a multi-threaded environment. It seems that MVAPICH is > the > > library for that particular combination :) > > > > In any case, I installed MVAPICH, and I can boot the daemons, and run > the > > ring speed test with no problems. When I run any programs with mpirun, > > however, I get an error when sending or receiving more than 8192 bytes. > > > > For example, if I run the bandwidth test from the benchmarks page > > (osu_bw.c), I get the following: > > --------------------------------------------------------------- > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > Thursday 06:16:00 > > burn > > burn-3 > > # OSU MPI Bandwidth Test v3.0 > > # Size Bandwidth (MB/s) > > 1 1.24 > > 2 2.72 > > 4 5.44 > > 8 10.18 > > 16 19.09 > > 32 29.69 > > 64 65.01 > > 128 147.31 > > 256 244.61 > > 512 354.32 > > 1024 367.91 > > 2048 451.96 > > 4096 550.66 > > 8192 598.35 > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to > send > > Internal Error: invalid error code ffffffff (Ring Index out of range) in > > MPIDI_CH3_RndvSend:263 > > Fatal error in MPI_Waitall: > > Other MPI error, error stack: > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > status_array=0xdb3140) failed > > (unknown)(): Other MPI error > > rank 1 in job 4 burn_37156 caused collective abort of all ranks > > exit status of rank 1: killed by signal 9 > > --------------------------------------------------------------- > > > > I get a similar problem with the latency test, however, the protocol > that is > > complained about is different: > > -------------------------------------------------------------------- > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > Thursday 09:21:20 > > # OSU MPI Latency Test v3.0 > > # Size Latency (us) > > 0 3.93 > > 1 4.07 > > 2 4.06 > > 4 3.82 > > 8 3.98 > > 16 4.03 > > 32 4.00 > > 64 4.28 > > 128 5.22 > > 256 5.88 > > 512 8.65 > > 1024 9.11 > > 2048 11.53 > > 4096 16.17 > > 8192 25.67 > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req > to > > send > > Internal Error: invalid error code ffffffff (Ring Index out of range) in > > MPIDI_CH3_RndvSend:263 > > Fatal error in MPI_Recv: > > Other MPI error, error stack: > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, > tag=1, > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > (unknown)(): Other MPI error > > rank 1 in job 5 burn_37156 caused collective abort of all ranks > > -------------------------------------------------------------------- > > > > The protocols (0 and 8126589) are consistent if I run the program > multiple > > times. > > > > Anyone have any ideas? If you need more info, please let me know. > > > > Thanks, > > Brian > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080104/86c75792/attachment.html From tziporet at dev.mellanox.co.il Sun Jan 6 06:34:07 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun Jan 6 07:11:00 2008 Subject: [ewg] Re: [mvapich-discuss] Troubles building/installing OFEM 1.2 on Fedora Core 4 64-bit In-Reply-To: References: Message-ID: <4780BCAF.2020806@mellanox.co.il> Dhabaleswar Panda wrote: > Ben - Sorry to know that you are experiencing problems in > building/installing OFED 1.2 on Fedora Core 4 64 bit system. > > FYI, the latest released version of OFED 1.2 is OFED 1.2.5.4. > > Regarding your rpm build errors, I am forwarding your note to `ewg' and > `general' lists of Open Fabrics. More experienced users on these two > lists can give you prompt feedbacks and guidance on the basic OFED > installation issues. > > We do not support Fedora Core 4 with OFED 1.2 and 1.2.5 I suggest you move to Fedora Core 6 Tziporet From huanwei at cse.ohio-state.edu Sun Jan 6 09:38:20 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Sun Jan 6 09:38:25 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: <5b7094580801041823y10e4a565x36833e66431b94e1@mail.gmail.com> Message-ID: Hi Brian, > I am using the openib-mvapich2-1.0.1 package in the gentoo-science overlay > addition to the standard gentoo packages. I have also tried 1.0 with the > same results. > > I compiled with multithreading turned on (haven't tried without this, but > the sample codes I am initially testing are not multithreaded, although my > application is). I also tried with or without rdma with no change. The > script seems to be setting the build for SMALL_CLUSTER. So you are using make.mvapich2.ofa to compile the package? I am a bit confused about ''I also tried with or without rdma with no change''. What exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa stack... -- Wei > > Let me know what other information would be useful. > > Thanks, > Brian > > > > On Jan 4, 2008 6:12 PM, wei huang wrote: > > > Hi Brian, > > > > Thanks for letting us know this problem. Would you please let us know some > > more details to help us locate the issue. > > > > 1) More details on your platform. > > > > 2) Exact version of mvapich2 you are using. Is it from OFED package? or > > some version from our website. > > > > 3) If it is from our website, did you change anything from the default > > compiling scripts? > > > > Thanks. > > > > -- Wei > > > I'm new to the list here... hi! I have been using OpenMPI for a while, > > and > > > LAM before that, but new requirements keep pushing me to new > > > implementations. In particular, I was interested in using infiniband > > (using > > > OFED 1.2.5.1) in a multi-threaded environment. It seems that MVAPICH is > > the > > > library for that particular combination :) > > > > > > In any case, I installed MVAPICH, and I can boot the daemons, and run > > the > > > ring speed test with no problems. When I run any programs with mpirun, > > > however, I get an error when sending or receiving more than 8192 bytes. > > > > > > For example, if I run the bandwidth test from the benchmarks page > > > (osu_bw.c), I get the following: > > > --------------------------------------------------------------- > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > Thursday 06:16:00 > > > burn > > > burn-3 > > > # OSU MPI Bandwidth Test v3.0 > > > # Size Bandwidth (MB/s) > > > 1 1.24 > > > 2 2.72 > > > 4 5.44 > > > 8 10.18 > > > 16 19.09 > > > 32 29.69 > > > 64 65.01 > > > 128 147.31 > > > 256 244.61 > > > 512 354.32 > > > 1024 367.91 > > > 2048 451.96 > > > 4096 550.66 > > > 8192 598.35 > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to > > send > > > Internal Error: invalid error code ffffffff (Ring Index out of range) in > > > MPIDI_CH3_RndvSend:263 > > > Fatal error in MPI_Waitall: > > > Other MPI error, error stack: > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > > status_array=0xdb3140) failed > > > (unknown)(): Other MPI error > > > rank 1 in job 4 burn_37156 caused collective abort of all ranks > > > exit status of rank 1: killed by signal 9 > > > --------------------------------------------------------------- > > > > > > I get a similar problem with the latency test, however, the protocol > > that is > > > complained about is different: > > > -------------------------------------------------------------------- > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > Thursday 09:21:20 > > > # OSU MPI Latency Test v3.0 > > > # Size Latency (us) > > > 0 3.93 > > > 1 4.07 > > > 2 4.06 > > > 4 3.82 > > > 8 3.98 > > > 16 4.03 > > > 32 4.00 > > > 64 4.28 > > > 128 5.22 > > > 256 5.88 > > > 512 8.65 > > > 1024 9.11 > > > 2048 11.53 > > > 4096 16.17 > > > 8192 25.67 > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req > > to > > > send > > > Internal Error: invalid error code ffffffff (Ring Index out of range) in > > > MPIDI_CH3_RndvSend:263 > > > Fatal error in MPI_Recv: > > > Other MPI error, error stack: > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, > > tag=1, > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > (unknown)(): Other MPI error > > > rank 1 in job 5 burn_37156 caused collective abort of all ranks > > > -------------------------------------------------------------------- > > > > > > The protocols (0 and 8126589) are consistent if I run the program > > multiple > > > times. > > > > > > Anyone have any ideas? If you need more info, please let me know. > > > > > > Thanks, > > > Brian > > > > > > > > From nilesh_awate at yahoo.com Mon Jan 7 01:15:26 2008 From: nilesh_awate at yahoo.com (nilesh awate) Date: Mon Jan 7 01:15:36 2008 Subject: [mvapich-discuss] protocol used for MPI_FInaize in mvapich2 Message-ID: <80583.69286.qm@web94115.mail.in2.yahoo.com> Hi all, I'm using mvapich2-1.0.1 with OFED1.2(udapl stack) To know the flow of MPI_FInalize i put some debug statement in source code & tried simple mpi test code (only init & finalize api) I observed there is shutting down/closing protocol (in which every process does 2dto) some body plz tell how these dto (function trace of MPI_Finalize) happen what is exact protocol is mvapich follows. thanking, Nilesh 5, 50, 500, 5000 - Store N number of mails in your inbox. Go to http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080107/0c655951/attachment.html From methier at CGR.Harvard.edu Mon Jan 7 10:49:14 2008 From: methier at CGR.Harvard.edu (Michael Ethier) Date: Mon Jan 7 10:49:22 2008 Subject: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED Message-ID: Hello, I am new to this forum and hoping someone can help solve the following problem for me. We have a modeling application that initializes and runs fine using an ordinary Ethernet connection. When we compile using the Infiniband software package (mvapich-0.9.9) and run, the application fails with the following at then end: [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error IBV_WC_LOC_LEN_ERR, code=1, dest rank=1 at line 388 in file viacheck.c [0:moorcrofth] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line 2552 in file viacheck.c mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 : moorcroft8 ] forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) done. This occurs at the initialization phase it seems when communication starts between different nodes. If I set the hostfile to contain the same node so that all the cpus used are on 1 node, it initializes fine and runs. We are using Redhat Enterprise 4 Update 5 on x86_64 uname -a Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux In addition we are using mvapich-0.9.9 for our Infiniband software package, and Intel 9.1: [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version icc (ICC) 9.1 20070510 Copyright (C) 1985-2007 Intel Corporation. All rights reserved. [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90 --version ifort (IFORT) 9.1 20070510 Copyright (C) 1985-2007 Intel Corporation. All rights reserved. We are using the rsh communication protocol for this: /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........ Can anyone suggest how this problem can be solved ? Thank You in advance, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080107/df28a63c/attachment.html From koop at cse.ohio-state.edu Mon Jan 7 12:26:01 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Jan 7 12:26:08 2008 Subject: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED In-Reply-To: Message-ID: Michael, Do other more simple benchmarks work (e.g. osu_benchmarks/osu_bw)? If they do, this is something we'd like to take a closer look at. I'd be interested to know if setting VIADEV_USE_COALESCE=0 resolves the issue: e.g. mpirun_rsh -np 2 h1 h2 VIADEV_USE_COALESCE=0 ./exec Matt On Mon, 7 Jan 2008, Michael Ethier wrote: > Hello, > > > > I am new to this forum and hoping someone can help solve the following > problem for me. > > > > We have a modeling application that initializes and runs fine using an > ordinary Ethernet connection. > > > > When we compile using the Infiniband software package (mvapich-0.9.9) > and run, the application fails with the following > > at then end: > > > > [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error > IBV_WC_LOC_LEN_ERR, code=1, dest rank=1 > > at line 388 in file viacheck.c > > [0:moorcrofth] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, > code=16 > > at line 2552 in file viacheck.c > > mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 : > moorcroft8 ] > > forrtl: error (78): process killed (SIGTERM) > > forrtl: error (78): process killed (SIGTERM) > > done. > > > > This occurs at the initialization phase it seems when communication > starts between different nodes. > > If I set the hostfile to contain the same node so that all the cpus used > are on 1 node, it initializes fine and runs. > > > > We are using Redhat Enterprise 4 Update 5 on x86_64 > > > > uname -a > > Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 > x86_64 x86_64 x86_64 GNU/Linux > > > > In addition we are using mvapich-0.9.9 for our Infiniband software > package, and Intel 9.1: > > > > [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version > > icc (ICC) 9.1 20070510 > > Copyright (C) 1985-2007 Intel Corporation. All rights reserved. > > > > [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90 --version > > ifort (IFORT) 9.1 20070510 > > Copyright (C) 1985-2007 Intel Corporation. All rights reserved. > > > > We are using the rsh communication protocol for this: > > /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........ > > > > Can anyone suggest how this problem can be solved ? > > > > Thank You in advance, > > Mike > > > > From koop at cse.ohio-state.edu Mon Jan 7 12:27:38 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Jan 7 12:27:44 2008 Subject: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED In-Reply-To: Message-ID: Michael, Also, is your code making any system calls or forking? Matt On Mon, 7 Jan 2008, Matthew Koop wrote: > Michael, > > Do other more simple benchmarks work (e.g. osu_benchmarks/osu_bw)? > > If they do, this is something we'd like to take a closer look at. I'd be > interested to know if setting VIADEV_USE_COALESCE=0 resolves the issue: > > e.g. > mpirun_rsh -np 2 h1 h2 VIADEV_USE_COALESCE=0 ./exec > > > Matt > > On Mon, 7 Jan 2008, Michael Ethier wrote: > > > Hello, > > > > > > > > I am new to this forum and hoping someone can help solve the following > > problem for me. > > > > > > > > We have a modeling application that initializes and runs fine using an > > ordinary Ethernet connection. > > > > > > > > When we compile using the Infiniband software package (mvapich-0.9.9) > > and run, the application fails with the following > > > > at then end: > > > > > > > > [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error > > IBV_WC_LOC_LEN_ERR, code=1, dest rank=1 > > > > at line 388 in file viacheck.c > > > > [0:moorcrofth] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, > > code=16 > > > > at line 2552 in file viacheck.c > > > > mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 : > > moorcroft8 ] > > > > forrtl: error (78): process killed (SIGTERM) > > > > forrtl: error (78): process killed (SIGTERM) > > > > done. > > > > > > > > This occurs at the initialization phase it seems when communication > > starts between different nodes. > > > > If I set the hostfile to contain the same node so that all the cpus used > > are on 1 node, it initializes fine and runs. > > > > > > > > We are using Redhat Enterprise 4 Update 5 on x86_64 > > > > > > > > uname -a > > > > Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 > > x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > In addition we are using mvapich-0.9.9 for our Infiniband software > > package, and Intel 9.1: > > > > > > > > [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version > > > > icc (ICC) 9.1 20070510 > > > > Copyright (C) 1985-2007 Intel Corporation. All rights reserved. > > > > > > > > [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90 --version > > > > ifort (IFORT) 9.1 20070510 > > > > Copyright (C) 1985-2007 Intel Corporation. All rights reserved. > > > > > > > > We are using the rsh communication protocol for this: > > > > /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........ > > > > > > > > Can anyone suggest how this problem can be solved ? > > > > > > > > Thank You in advance, > > > > Mike > > > > > > > > > > From brian.budge at gmail.com Mon Jan 7 12:30:24 2008 From: brian.budge at gmail.com (Brian Budge) Date: Mon Jan 7 12:32:49 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: References: <5b7094580801041823y10e4a565x36833e66431b94e1@mail.gmail.com> Message-ID: <5b7094580801070930j26608c5qef31b73fa4d426e7@mail.gmail.com> Hi Wei - I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no difference. When I build with rdma, this adds the following: export LIBS="${LIBS} -lrdmacm" export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM" It seems that I am using the make.mvapich2.detect script to build. It asks me for my interface, and gives me the option for the mellanox interface, which I choose. I just tried a fresh install directly from the tarball instead of using the gentoo package. Now the program completes (goes beyond 8K message), but my bandwidth isn't very good. Running the osu_bw.c test, I get about 250 MB/s maximum. It seems like IB isn't being used. I did the following: ./make.mvapich2.detect #, and chose the mellanox option ./configure --enable-threads=multiple make make install So it seems that the package is doing something to enable infiniband that I am not doing with the tarball. Conversely, the tarball can run without crashing. Advice? Thanks, Brian On Jan 6, 2008 6:38 AM, wei huang < huanwei@cse.ohio-state.edu> wrote: > Hi Brian, > > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science > overlay > > addition to the standard gentoo packages. I have also tried 1.0 with > the > > same results. > > > > I compiled with multithreading turned on (haven't tried without this, > but > > the sample codes I am initially testing are not multithreaded, although > my > > application is). I also tried with or without rdma with no change. The > > > script seems to be setting the build for SMALL_CLUSTER. > > So you are using make.mvapich2.ofa to compile the package? I am a bit > confused about ''I also tried with or without rdma with no change''. What > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa > stack... > > -- Wei > > > > > Let me know what other information would be useful. > > > > Thanks, > > Brian > > > > > > > > On Jan 4, 2008 6:12 PM, wei huang wrote: > > > > > Hi Brian, > > > > > > Thanks for letting us know this problem. Would you please let us know > some > > > more details to help us locate the issue. > > > > > > 1) More details on your platform. > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED package? > or > > > some version from our website. > > > > > > 3) If it is from our website, did you change anything from the default > > > > compiling scripts? > > > > > > Thanks. > > > > > > -- Wei > > > > I'm new to the list here... hi! I have been using OpenMPI for a > while, > > > and > > > > LAM before that, but new requirements keep pushing me to new > > > > implementations. In particular, I was interested in using > infiniband > > > (using > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems that > MVAPICH is > > > the > > > > library for that particular combination :) > > > > > > > > In any case, I installed MVAPICH, and I can boot the daemons, and > run > > > the > > > > ring speed test with no problems. When I run any programs with > mpirun, > > > > however, I get an error when sending or receiving more than 8192 > bytes. > > > > > > > > For example, if I run the bandwidth test from the benchmarks page > > > > (osu_bw.c), I get the following: > > > > --------------------------------------------------------------- > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > Thursday 06:16:00 > > > > burn > > > > burn-3 > > > > # OSU MPI Bandwidth Test v3.0 > > > > # Size Bandwidth (MB/s) > > > > 1 1.24 > > > > 2 2.72 > > > > 4 5.44 > > > > 8 10.18 > > > > 16 19.09 > > > > 32 29.69 > > > > 64 65.01 > > > > 128 147.31 > > > > 256 244.61 > > > > 512 354.32 > > > > 1024 367.91 > > > > 2048 451.96 > > > > 4096 550.66 > > > > 8192 598.35 > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to > > > send > > > > Internal Error: invalid error code ffffffff (Ring Index out of > range) in > > > > MPIDI_CH3_RndvSend:263 > > > > Fatal error in MPI_Waitall: > > > > Other MPI error, error stack: > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > > > status_array=0xdb3140) failed > > > > (unknown)(): Other MPI error > > > > rank 1 in job 4 burn_37156 caused collective abort of all ranks > > > > exit status of rank 1: killed by signal 9 > > > > --------------------------------------------------------------- > > > > > > > > I get a similar problem with the latency test, however, the protocol > > > that is > > > > complained about is different: > > > > -------------------------------------------------------------------- > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > Thursday 09:21:20 > > > > # OSU MPI Latency Test v3.0 > > > > # Size Latency (us) > > > > 0 3.93 > > > > 1 4.07 > > > > 2 4.06 > > > > 4 3.82 > > > > 8 3.98 > > > > 16 4.03 > > > > 32 4.00 > > > > 64 4.28 > > > > 128 5.22 > > > > 256 5.88 > > > > 512 8.65 > > > > 1024 9.11 > > > > 2048 11.53 > > > > 4096 16.17 > > > > 8192 25.67 > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv > req > > > to > > > > send > > > > Internal Error: invalid error code ffffffff (Ring Index out of > range) in > > > > MPIDI_CH3_RndvSend:263 > > > > Fatal error in MPI_Recv: > > > > Other MPI error, error stack: > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, > > > tag=1, > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > > (unknown)(): Other MPI error > > > > rank 1 in job 5 burn_37156 caused collective abort of all ranks > > > > -------------------------------------------------------------------- > > > > > > > > The protocols (0 and 8126589) are consistent if I run the program > > > multiple > > > > times. > > > > > > > > Anyone have any ideas? If you need more info, please let me know. > > > > > > > > Thanks, > > > > Brian > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080107/4ae6be47/attachment-0001.html From methier at CGR.Harvard.edu Mon Jan 7 13:10:03 2008 From: methier at CGR.Harvard.edu (Michael Ethier) Date: Mon Jan 7 13:10:12 2008 Subject: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED In-Reply-To: References: Message-ID: Hi Matthew, The osu_bw test ran ok as seen below. I added the VIADEV_USE_COALESCE=0 variable to the command line and in the environment, and it made no difference, I set get the same errors. #!/bin/tcsh setenv VIADEV_USE_COALESCE 0 /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 -hostfile ./hostfile VIADEV_USE_COALESCE=0 ./raflesi -f ./EDRAFLES_IN Thank You, Mike The benchmark test: foo.test script has in it #!/bin/tcsh /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 -hostfile ./hostfile VIADEV_USE_COALESCE=0 /usr/mpi/intel/mvapich-0.9.9/tests/osu_benchmarks-2.2/osu_bw [gb16@moorcrofth run]$ ./foo.test # OSU MPI Bandwidth Test (Version 2.2) # Size Bandwidth (MB/s) 1 0.135198 2 0.273329 4 0.540415 8 1.087788 16 2.179976 32 4.371585 64 8.668233 128 17.290726 256 34.458536 512 68.269511 1024 129.384822 2048 239.992676 4096 392.348909 8192 542.819870 16384 452.196563 32768 625.604678 65536 764.094184 131072 836.010006 262144 871.899242 524288 890.772813 1048576 901.838432 2097152 906.494955 4194304 909.296621 [gb16@moorcrofth run]$ more ./hostfile moorcrofth moorcroft8 moorcroft11 -----Original Message----- From: Matthew Koop [mailto:koop@cse.ohio-state.edu] Sent: Monday, January 07, 2008 12:26 PM To: Michael Ethier Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event IBV_EVENT_QP_LAST_WQE_REACHED Michael, Do other more simple benchmarks work (e.g. osu_benchmarks/osu_bw)? If they do, this is something we'd like to take a closer look at. I'd be interested to know if setting VIADEV_USE_COALESCE=0 resolves the issue: e.g. mpirun_rsh -np 2 h1 h2 VIADEV_USE_COALESCE=0 ./exec Matt On Mon, 7 Jan 2008, Michael Ethier wrote: > Hello, > > > > I am new to this forum and hoping someone can help solve the following > problem for me. > > > > We have a modeling application that initializes and runs fine using an > ordinary Ethernet connection. > > > > When we compile using the Infiniband software package (mvapich-0.9.9) > and run, the application fails with the following > > at then end: > > > > [0:moorcrofth] Abort: [moorcrofth:0] Got completion with error > IBV_WC_LOC_LEN_ERR, code=1, dest rank=1 > > at line 388 in file viacheck.c > > [0:moorcrofth] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, > code=16 > > at line 2552 in file viacheck.c > > mpirun_rsh: Abort signaled from [0 : moorcrofth] remote host is [1 : > moorcroft8 ] > > forrtl: error (78): process killed (SIGTERM) > > forrtl: error (78): process killed (SIGTERM) > > done. > > > > This occurs at the initialization phase it seems when communication > starts between different nodes. > > If I set the hostfile to contain the same node so that all the cpus used > are on 1 node, it initializes fine and runs. > > > > We are using Redhat Enterprise 4 Update 5 on x86_64 > > > > uname -a > > Linux moorcrofth 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 > x86_64 x86_64 x86_64 GNU/Linux > > > > In addition we are using mvapich-0.9.9 for our Infiniband software > package, and Intel 9.1: > > > > [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpicc --version > > icc (ICC) 9.1 20070510 > > Copyright (C) 1985-2007 Intel Corporation. All rights reserved. > > > > [gb16@moorcrofth 60]$ /usr/mpi/intel/mvapich-0.9.9/bin/mpif90 --version > > ifort (IFORT) 9.1 20070510 > > Copyright (C) 1985-2007 Intel Corporation. All rights reserved. > > > > We are using the rsh communication protocol for this: > > /usr/mpi/intel/mvapich-0.9.9/bin/mpirun_rsh -rsh -np 3 ........ > > > > Can anyone suggest how this problem can be solved ? > > > > Thank You in advance, > > Mike > > > > From koop at cse.ohio-state.edu Mon Jan 7 16:21:24 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Jan 7 16:21:33 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: <5b7094580801070930j26608c5qef31b73fa4d426e7@mail.gmail.com> Message-ID: Brian, The make.mvapich.detect script is just a helper script (not meant to be executed directly). You need to use the make.mvapich.ofa script, which will call configure and make for you with the correct arguments. More information can be found in our MVAPICH2 user guide under "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP" https://mvapich.cse.ohio-state.edu/support/ Let us know if you have any other problems. Matt On Mon, 7 Jan 2008, Brian Budge wrote: > Hi Wei - > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no difference. > > When I build with rdma, this adds the following: > export LIBS="${LIBS} -lrdmacm" > export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM" > > It seems that I am using the make.mvapich2.detect script to build. It asks > me for my interface, and gives me the option for the mellanox interface, > which I choose. > > I just tried a fresh install directly from the tarball instead of using the > gentoo package. Now the program completes (goes beyond 8K message), but my > bandwidth isn't very good. Running the osu_bw.c test, I get about 250 MB/s > maximum. It seems like IB isn't being used. > > I did the following: > ./make.mvapich2.detect #, and chose the mellanox option > ./configure --enable-threads=multiple > make > make install > > So it seems that the package is doing something to enable infiniband that I > am not doing with the tarball. Conversely, the tarball can run without > crashing. > > Advice? > > Thanks, > Brian > > On Jan 6, 2008 6:38 AM, wei huang < huanwei@cse.ohio-state.edu> wrote: > > > Hi Brian, > > > > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science > > overlay > > > addition to the standard gentoo packages. I have also tried 1.0 with > > the > > > same results. > > > > > > I compiled with multithreading turned on (haven't tried without this, > > but > > > the sample codes I am initially testing are not multithreaded, although > > my > > > application is). I also tried with or without rdma with no change. The > > > > > script seems to be setting the build for SMALL_CLUSTER. > > > > So you are using make.mvapich2.ofa to compile the package? I am a bit > > confused about ''I also tried with or without rdma with no change''. What > > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa > > stack... > > > > -- Wei > > > > > > > > Let me know what other information would be useful. > > > > > > Thanks, > > > Brian > > > > > > > > > > > > On Jan 4, 2008 6:12 PM, wei huang wrote: > > > > > > > Hi Brian, > > > > > > > > Thanks for letting us know this problem. Would you please let us know > > some > > > > more details to help us locate the issue. > > > > > > > > 1) More details on your platform. > > > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED package? > > or > > > > some version from our website. > > > > > > > > 3) If it is from our website, did you change anything from the default > > > > > > compiling scripts? > > > > > > > > Thanks. > > > > > > > > -- Wei > > > > > I'm new to the list here... hi! I have been using OpenMPI for a > > while, > > > > and > > > > > LAM before that, but new requirements keep pushing me to new > > > > > implementations. In particular, I was interested in using > > infiniband > > > > (using > > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems that > > MVAPICH is > > > > the > > > > > library for that particular combination :) > > > > > > > > > > In any case, I installed MVAPICH, and I can boot the daemons, and > > run > > > > the > > > > > ring speed test with no problems. When I run any programs with > > mpirun, > > > > > however, I get an error when sending or receiving more than 8192 > > bytes. > > > > > > > > > > For example, if I run the bandwidth test from the benchmarks page > > > > > (osu_bw.c), I get the following: > > > > > --------------------------------------------------------------- > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > Thursday 06:16:00 > > > > > burn > > > > > burn-3 > > > > > # OSU MPI Bandwidth Test v3.0 > > > > > # Size Bandwidth (MB/s) > > > > > 1 1.24 > > > > > 2 2.72 > > > > > 4 5.44 > > > > > 8 10.18 > > > > > 16 19.09 > > > > > 32 29.69 > > > > > 64 65.01 > > > > > 128 147.31 > > > > > 256 244.61 > > > > > 512 354.32 > > > > > 1024 367.91 > > > > > 2048 451.96 > > > > > 4096 550.66 > > > > > 8192 598.35 > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to > > > > send > > > > > Internal Error: invalid error code ffffffff (Ring Index out of > > range) in > > > > > MPIDI_CH3_RndvSend:263 > > > > > Fatal error in MPI_Waitall: > > > > > Other MPI error, error stack: > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > > > > status_array=0xdb3140) failed > > > > > (unknown)(): Other MPI error > > > > > rank 1 in job 4 burn_37156 caused collective abort of all ranks > > > > > exit status of rank 1: killed by signal 9 > > > > > --------------------------------------------------------------- > > > > > > > > > > I get a similar problem with the latency test, however, the protocol > > > > that is > > > > > complained about is different: > > > > > -------------------------------------------------------------------- > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > Thursday 09:21:20 > > > > > # OSU MPI Latency Test v3.0 > > > > > # Size Latency (us) > > > > > 0 3.93 > > > > > 1 4.07 > > > > > 2 4.06 > > > > > 4 3.82 > > > > > 8 3.98 > > > > > 16 4.03 > > > > > 32 4.00 > > > > > 64 4.28 > > > > > 128 5.22 > > > > > 256 5.88 > > > > > 512 8.65 > > > > > 1024 9.11 > > > > > 2048 11.53 > > > > > 4096 16.17 > > > > > 8192 25.67 > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv > > req > > > > to > > > > > send > > > > > Internal Error: invalid error code ffffffff (Ring Index out of > > range) in > > > > > MPIDI_CH3_RndvSend:263 > > > > > Fatal error in MPI_Recv: > > > > > Other MPI error, error stack: > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, > > > > tag=1, > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > > > (unknown)(): Other MPI error > > > > > rank 1 in job 5 burn_37156 caused collective abort of all ranks > > > > > -------------------------------------------------------------------- > > > > > > > > > > The protocols (0 and 8126589) are consistent if I run the program > > > > multiple > > > > > times. > > > > > > > > > > Anyone have any ideas? If you need more info, please let me know. > > > > > > > > > > Thanks, > > > > > Brian > > > > > > > > > > > > > > > > > > > > > From brian.budge at gmail.com Mon Jan 7 19:15:09 2008 From: brian.budge at gmail.com (Brian Budge) Date: Mon Jan 7 19:15:21 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: References: <5b7094580801070930j26608c5qef31b73fa4d426e7@mail.gmail.com> Message-ID: <5b7094580801071615y29148164v469332e1e3e7aa83@mail.gmail.com> Hi Matt - I have now done the install from the ofa build file, and I can boot and run the ring test, but now when I run the osu_bw.c benchmark, the executable dies in MPI_Init(). The things I altered in make.mvapich2.ofa were: OPEN_IB_HOME=${OPEN_IB_HOME:-/usr} SHARED_LIBS=${SHARED_LIBS:-yes} and on the configure line I added: --disable-f77 --disable-f90 Here is the error message that I am getting: rank 1 in job 1 burn_60139 caused collective abort of all ranks exit status of rank 1: killed by signal 9 Thanks, Brian On Jan 7, 2008 1:21 PM, Matthew Koop wrote: > Brian, > > The make.mvapich.detect script is just a helper script (not meant to be > executed directly). You need to use the make.mvapich.ofa script, which > will call configure and make for you with the correct arguments. > > More information can be found in our MVAPICH2 user guide under > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP" > > https://mvapich.cse.ohio-state.edu/support/ > > Let us know if you have any other problems. > > Matt > > > > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > Hi Wei - > > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no > difference. > > > > When I build with rdma, this adds the following: > > export LIBS="${LIBS} -lrdmacm" > > export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM" > > > > It seems that I am using the make.mvapich2.detect script to build. It > asks > > me for my interface, and gives me the option for the mellanox interface, > > which I choose. > > > > I just tried a fresh install directly from the tarball instead of using > the > > gentoo package. Now the program completes (goes beyond 8K message), but > my > > bandwidth isn't very good. Running the osu_bw.c test, I get about 250 > MB/s > > maximum. It seems like IB isn't being used. > > > > I did the following: > > ./make.mvapich2.detect #, and chose the mellanox option > > ./configure --enable-threads=multiple > > make > > make install > > > > So it seems that the package is doing something to enable infiniband > that I > > am not doing with the tarball. Conversely, the tarball can run without > > crashing. > > > > Advice? > > > > Thanks, > > Brian > > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei@cse.ohio-state.edu> wrote: > > > > > Hi Brian, > > > > > > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science > > > overlay > > > > addition to the standard gentoo packages. I have also tried 1.0with > > > the > > > > same results. > > > > > > > > I compiled with multithreading turned on (haven't tried without > this, > > > but > > > > the sample codes I am initially testing are not multithreaded, > although > > > my > > > > application is). I also tried with or without rdma with no change. > The > > > > > > > script seems to be setting the build for SMALL_CLUSTER. > > > > > > So you are using make.mvapich2.ofa to compile the package? I am a bit > > > confused about ''I also tried with or without rdma with no change''. > What > > > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa > > > stack... > > > > > > -- Wei > > > > > > > > > > > Let me know what other information would be useful. > > > > > > > > Thanks, > > > > Brian > > > > > > > > > > > > > > > > On Jan 4, 2008 6:12 PM, wei huang > wrote: > > > > > > > > > Hi Brian, > > > > > > > > > > Thanks for letting us know this problem. Would you please let us > know > > > some > > > > > more details to help us locate the issue. > > > > > > > > > > 1) More details on your platform. > > > > > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED > package? > > > or > > > > > some version from our website. > > > > > > > > > > 3) If it is from our website, did you change anything from the > default > > > > > > > > compiling scripts? > > > > > > > > > > Thanks. > > > > > > > > > > -- Wei > > > > > > I'm new to the list here... hi! I have been using OpenMPI for a > > > while, > > > > > and > > > > > > LAM before that, but new requirements keep pushing me to new > > > > > > implementations. In particular, I was interested in using > > > infiniband > > > > > (using > > > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems that > > > MVAPICH is > > > > > the > > > > > > library for that particular combination :) > > > > > > > > > > > > In any case, I installed MVAPICH, and I can boot the daemons, > and > > > run > > > > > the > > > > > > ring speed test with no problems. When I run any programs with > > > mpirun, > > > > > > however, I get an error when sending or receiving more than 8192 > > > bytes. > > > > > > > > > > > > For example, if I run the bandwidth test from the benchmarks > page > > > > > > (osu_bw.c), I get the following: > > > > > > --------------------------------------------------------------- > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > Thursday 06:16:00 > > > > > > burn > > > > > > burn-3 > > > > > > # OSU MPI Bandwidth Test v3.0 > > > > > > # Size Bandwidth (MB/s) > > > > > > 1 1.24 > > > > > > 2 2.72 > > > > > > 4 5.44 > > > > > > 8 10.18 > > > > > > 16 19.09 > > > > > > 32 29.69 > > > > > > 64 65.01 > > > > > > 128 147.31 > > > > > > 256 244.61 > > > > > > 512 354.32 > > > > > > 1024 367.91 > > > > > > 2048 451.96 > > > > > > 4096 550.66 > > > > > > 8192 598.35 > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv > req to > > > > > send > > > > > > Internal Error: invalid error code ffffffff (Ring Index out of > > > range) in > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > Fatal error in MPI_Waitall: > > > > > > Other MPI error, error stack: > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > > > > > status_array=0xdb3140) failed > > > > > > (unknown)(): Other MPI error > > > > > > rank 1 in job 4 burn_37156 caused collective abort of all > ranks > > > > > > exit status of rank 1: killed by signal 9 > > > > > > --------------------------------------------------------------- > > > > > > > > > > > > I get a similar problem with the latency test, however, the > protocol > > > > > that is > > > > > > complained about is different: > > > > > > > -------------------------------------------------------------------- > > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > Thursday 09:21:20 > > > > > > # OSU MPI Latency Test v3.0 > > > > > > # Size Latency (us) > > > > > > 0 3.93 > > > > > > 1 4.07 > > > > > > 2 4.06 > > > > > > 4 3.82 > > > > > > 8 3.98 > > > > > > 16 4.03 > > > > > > 32 4.00 > > > > > > 64 4.28 > > > > > > 128 5.22 > > > > > > 256 5.88 > > > > > > 512 8.65 > > > > > > 1024 9.11 > > > > > > 2048 11.53 > > > > > > 4096 16.17 > > > > > > 8192 25.67 > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from > rndv > > > req > > > > > to > > > > > > send > > > > > > Internal Error: invalid error code ffffffff (Ring Index out of > > > range) in > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > Fatal error in MPI_Recv: > > > > > > Other MPI error, error stack: > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, > src=0, > > > > > tag=1, > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > > > > (unknown)(): Other MPI error > > > > > > rank 1 in job 5 burn_37156 caused collective abort of all > ranks > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > The protocols (0 and 8126589) are consistent if I run the > program > > > > > multiple > > > > > > times. > > > > > > > > > > > > Anyone have any ideas? If you need more info, please let me > know. > > > > > > > > > > > > Thanks, > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080107/b3f5f91e/attachment-0001.html From koop at cse.ohio-state.edu Mon Jan 7 20:12:26 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Jan 7 20:12:33 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: <5b7094580801071615y29148164v469332e1e3e7aa83@mail.gmail.com> Message-ID: Brian, Can you try the ibv_rc_pingpong program, which is a low-level (non-MPI) test that ships with OFED? This will make sure that your basic InfiniBand setup is working properly. Did any other error message print out other than the one you gave? Matt On Mon, 7 Jan 2008, Brian Budge wrote: > Hi Matt - > > I have now done the install from the ofa build file, and I can boot and run > the ring test, but now when I run the osu_bw.c benchmark, the executable > dies in MPI_Init(). > > The things I altered in make.mvapich2.ofa were: > > OPEN_IB_HOME=${OPEN_IB_HOME:-/usr} > SHARED_LIBS=${SHARED_LIBS:-yes} > > and on the configure line I added: > --disable-f77 --disable-f90 > > Here is the error message that I am getting: > > rank 1 in job 1 burn_60139 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > > Thanks, > Brian > > On Jan 7, 2008 1:21 PM, Matthew Koop wrote: > > > Brian, > > > > The make.mvapich.detect script is just a helper script (not meant to be > > executed directly). You need to use the make.mvapich.ofa script, which > > will call configure and make for you with the correct arguments. > > > > More information can be found in our MVAPICH2 user guide under > > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP" > > > > https://mvapich.cse.ohio-state.edu/support/ > > > > Let us know if you have any other problems. > > > > Matt > > > > > > > > > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > > > Hi Wei - > > > > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no > > difference. > > > > > > When I build with rdma, this adds the following: > > > export LIBS="${LIBS} -lrdmacm" > > > export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM" > > > > > > It seems that I am using the make.mvapich2.detect script to build. It > > asks > > > me for my interface, and gives me the option for the mellanox interface, > > > which I choose. > > > > > > I just tried a fresh install directly from the tarball instead of using > > the > > > gentoo package. Now the program completes (goes beyond 8K message), but > > my > > > bandwidth isn't very good. Running the osu_bw.c test, I get about 250 > > MB/s > > > maximum. It seems like IB isn't being used. > > > > > > I did the following: > > > ./make.mvapich2.detect #, and chose the mellanox option > > > ./configure --enable-threads=multiple > > > make > > > make install > > > > > > So it seems that the package is doing something to enable infiniband > > that I > > > am not doing with the tarball. Conversely, the tarball can run without > > > crashing. > > > > > > Advice? > > > > > > Thanks, > > > Brian > > > > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei@cse.ohio-state.edu> wrote: > > > > > > > Hi Brian, > > > > > > > > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science > > > > overlay > > > > > addition to the standard gentoo packages. I have also tried 1.0with > > > > the > > > > > same results. > > > > > > > > > > I compiled with multithreading turned on (haven't tried without > > this, > > > > but > > > > > the sample codes I am initially testing are not multithreaded, > > although > > > > my > > > > > application is). I also tried with or without rdma with no change. > > The > > > > > > > > > script seems to be setting the build for SMALL_CLUSTER. > > > > > > > > So you are using make.mvapich2.ofa to compile the package? I am a bit > > > > confused about ''I also tried with or without rdma with no change''. > > What > > > > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa > > > > stack... > > > > > > > > -- Wei > > > > > > > > > > > > > > Let me know what other information would be useful. > > > > > > > > > > Thanks, > > > > > Brian > > > > > > > > > > > > > > > > > > > > On Jan 4, 2008 6:12 PM, wei huang > > wrote: > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > Thanks for letting us know this problem. Would you please let us > > know > > > > some > > > > > > more details to help us locate the issue. > > > > > > > > > > > > 1) More details on your platform. > > > > > > > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED > > package? > > > > or > > > > > > some version from our website. > > > > > > > > > > > > 3) If it is from our website, did you change anything from the > > default > > > > > > > > > > compiling scripts? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > -- Wei > > > > > > > I'm new to the list here... hi! I have been using OpenMPI for a > > > > while, > > > > > > and > > > > > > > LAM before that, but new requirements keep pushing me to new > > > > > > > implementations. In particular, I was interested in using > > > > infiniband > > > > > > (using > > > > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems that > > > > MVAPICH is > > > > > > the > > > > > > > library for that particular combination :) > > > > > > > > > > > > > > In any case, I installed MVAPICH, and I can boot the daemons, > > and > > > > run > > > > > > the > > > > > > > ring speed test with no problems. When I run any programs with > > > > mpirun, > > > > > > > however, I get an error when sending or receiving more than 8192 > > > > bytes. > > > > > > > > > > > > > > For example, if I run the bandwidth test from the benchmarks > > page > > > > > > > (osu_bw.c), I get the following: > > > > > > > --------------------------------------------------------------- > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > Thursday 06:16:00 > > > > > > > burn > > > > > > > burn-3 > > > > > > > # OSU MPI Bandwidth Test v3.0 > > > > > > > # Size Bandwidth (MB/s) > > > > > > > 1 1.24 > > > > > > > 2 2.72 > > > > > > > 4 5.44 > > > > > > > 8 10.18 > > > > > > > 16 19.09 > > > > > > > 32 29.69 > > > > > > > 64 65.01 > > > > > > > 128 147.31 > > > > > > > 256 244.61 > > > > > > > 512 354.32 > > > > > > > 1024 367.91 > > > > > > > 2048 451.96 > > > > > > > 4096 550.66 > > > > > > > 8192 598.35 > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv > > req to > > > > > > send > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out of > > > > range) in > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > Fatal error in MPI_Waitall: > > > > > > > Other MPI error, error stack: > > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > > > > > > status_array=0xdb3140) failed > > > > > > > (unknown)(): Other MPI error > > > > > > > rank 1 in job 4 burn_37156 caused collective abort of all > > ranks > > > > > > > exit status of rank 1: killed by signal 9 > > > > > > > --------------------------------------------------------------- > > > > > > > > > > > > > > I get a similar problem with the latency test, however, the > > protocol > > > > > > that is > > > > > > > complained about is different: > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > Thursday 09:21:20 > > > > > > > # OSU MPI Latency Test v3.0 > > > > > > > # Size Latency (us) > > > > > > > 0 3.93 > > > > > > > 1 4.07 > > > > > > > 2 4.06 > > > > > > > 4 3.82 > > > > > > > 8 3.98 > > > > > > > 16 4.03 > > > > > > > 32 4.00 > > > > > > > 64 4.28 > > > > > > > 128 5.22 > > > > > > > 256 5.88 > > > > > > > 512 8.65 > > > > > > > 1024 9.11 > > > > > > > 2048 11.53 > > > > > > > 4096 16.17 > > > > > > > 8192 25.67 > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from > > rndv > > > > req > > > > > > to > > > > > > > send > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out of > > > > range) in > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > Fatal error in MPI_Recv: > > > > > > > Other MPI error, error stack: > > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, > > src=0, > > > > > > tag=1, > > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > > > > > (unknown)(): Other MPI error > > > > > > > rank 1 in job 5 burn_37156 caused collective abort of all > > ranks > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > > > The protocols (0 and 8126589) are consistent if I run the > > program > > > > > > multiple > > > > > > > times. > > > > > > > > > > > > > > Anyone have any ideas? If you need more info, please let me > > know. > > > > > > > > > > > > > > Thanks, > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From brian.budge at gmail.com Tue Jan 8 11:27:57 2008 From: brian.budge at gmail.com (Brian Budge) Date: Tue Jan 8 11:28:14 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: References: <5b7094580801071615y29148164v469332e1e3e7aa83@mail.gmail.com> Message-ID: <5b7094580801080827i7624cf3n61ebd6c1a3b0edb3@mail.gmail.com> Hi Matt - ibv_rc_pingpong worked, and I decided to try a new clean install, and it seems to be working quite a bit better now. I must have somehow added some nasty stuff to the Makefile during my previous attempts. Here is the output: # OSU MPI Bandwidth Test v3.0 # Size Bandwidth (MB/s) 1 1.18 2 2.59 4 4.92 8 10.38 16 20.31 32 40.12 64 77.14 128 144.37 256 241.72 512 362.12 1024 471.01 2048 546.45 4096 581.47 8192 600.65 16384 611.52 32768 632.87 65536 642.27 131072 646.30 262144 644.22 524288 644.15 1048576 649.36 2097152 662.55 4194304 672.55 How do these numbers look for a 10 Gb SDR HCA? Thanks for your help! Brian On Jan 7, 2008 5:12 PM, Matthew Koop wrote: > Brian, > > Can you try the ibv_rc_pingpong program, which is a low-level (non-MPI) > test that ships with OFED? This will make sure that your basic InfiniBand > setup is working properly. > > Did any other error message print out other than the one you gave? > > Matt > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > Hi Matt - > > > > I have now done the install from the ofa build file, and I can boot and > run > > the ring test, but now when I run the osu_bw.c benchmark, the executable > > > dies in MPI_Init(). > > > > The things I altered in make.mvapich2.ofa were: > > > > OPEN_IB_HOME=${OPEN_IB_HOME:-/usr} > > SHARED_LIBS=${SHARED_LIBS:-yes} > > > > and on the configure line I added: > > --disable-f77 --disable-f90 > > > > Here is the error message that I am getting: > > > > rank 1 in job 1 burn_60139 caused collective abort of all ranks > > exit status of rank 1: killed by signal 9 > > > > Thanks, > > Brian > > > > On Jan 7, 2008 1:21 PM, Matthew Koop wrote: > > > > > Brian, > > > > > > The make.mvapich.detect script is just a helper script (not meant to > be > > > executed directly). You need to use the make.mvapich.ofa script, which > > > will call configure and make for you with the correct arguments. > > > > > > More information can be found in our MVAPICH2 user guide under > > > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP" > > > > > > https://mvapich.cse.ohio-state.edu/support/ > > > > > > Let us know if you have any other problems. > > > > > > Matt > > > > > > > > > > > > > > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > > > > > Hi Wei - > > > > > > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no > > > difference. > > > > > > > > When I build with rdma, this adds the following: > > > > export LIBS="${LIBS} -lrdmacm" > > > > export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH > -DRDMA_CM" > > > > > > > > It seems that I am using the make.mvapich2.detect script to build. > It > > > asks > > > > me for my interface, and gives me the option for the mellanox > interface, > > > > which I choose. > > > > > > > > I just tried a fresh install directly from the tarball instead of > using > > > the > > > > gentoo package. Now the program completes (goes beyond 8K message), > but > > > my > > > > bandwidth isn't very good. Running the osu_bw.c test, I get about > 250 > > > MB/s > > > > maximum. It seems like IB isn't being used. > > > > > > > > I did the following: > > > > ./make.mvapich2.detect #, and chose the mellanox option > > > > ./configure --enable-threads=multiple > > > > make > > > > make install > > > > > > > > So it seems that the package is doing something to enable infiniband > > > that I > > > > am not doing with the tarball. Conversely, the tarball can run > without > > > > crashing. > > > > > > > > Advice? > > > > > > > > Thanks, > > > > Brian > > > > > > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei@cse.ohio-state.edu> > wrote: > > > > > > > > > Hi Brian, > > > > > > > > > > > I am using the openib-mvapich2-1.0.1 package in the > gentoo-science > > > > > overlay > > > > > > addition to the standard gentoo packages. I have also tried > 1.0with > > > > > the > > > > > > same results. > > > > > > > > > > > > I compiled with multithreading turned on (haven't tried without > > > this, > > > > > but > > > > > > the sample codes I am initially testing are not multithreaded, > > > although > > > > > my > > > > > > application is). I also tried with or without rdma with no > change. > > > The > > > > > > > > > > > script seems to be setting the build for SMALL_CLUSTER. > > > > > > > > > > So you are using make.mvapich2.ofa to compile the package? I am a > bit > > > > > confused about ''I also tried with or without rdma with no > change''. > > > What > > > > > exact change you made here? Also, SMALL_CLUSTER is obsolete for > ofa > > > > > stack... > > > > > > > > > > -- Wei > > > > > > > > > > > > > > > > > Let me know what other information would be useful. > > > > > > > > > > > > Thanks, > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > On Jan 4, 2008 6:12 PM, wei huang < huanwei@cse.ohio-state.edu> > > > wrote: > > > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > > Thanks for letting us know this problem. Would you please let > us > > > know > > > > > some > > > > > > > more details to help us locate the issue. > > > > > > > > > > > > > > 1) More details on your platform. > > > > > > > > > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED > > > package? > > > > > or > > > > > > > some version from our website. > > > > > > > > > > > > > > 3) If it is from our website, did you change anything from the > > > default > > > > > > > > > > > > compiling scripts? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > -- Wei > > > > > > > > I'm new to the list here... hi! I have been using OpenMPI > for a > > > > > while, > > > > > > > and > > > > > > > > LAM before that, but new requirements keep pushing me to new > > > > > > > > implementations. In particular, I was interested in using > > > > > infiniband > > > > > > > (using > > > > > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems > that > > > > > MVAPICH is > > > > > > > the > > > > > > > > library for that particular combination :) > > > > > > > > > > > > > > > > In any case, I installed MVAPICH, and I can boot the > daemons, > > > and > > > > > run > > > > > > > the > > > > > > > > ring speed test with no problems. When I run any programs > with > > > > > mpirun, > > > > > > > > however, I get an error when sending or receiving more than > 8192 > > > > > bytes. > > > > > > > > > > > > > > > > For example, if I run the bandwidth test from the benchmarks > > > page > > > > > > > > (osu_bw.c), I get the following: > > > > > > > > > --------------------------------------------------------------- > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > > Thursday 06:16:00 > > > > > > > > burn > > > > > > > > burn-3 > > > > > > > > # OSU MPI Bandwidth Test v3.0 > > > > > > > > # Size Bandwidth (MB/s) > > > > > > > > 1 1.24 > > > > > > > > 2 2.72 > > > > > > > > 4 5.44 > > > > > > > > 8 10.18 > > > > > > > > 16 19.09 > > > > > > > > 32 29.69 > > > > > > > > 64 65.01 > > > > > > > > 128 147.31 > > > > > > > > 256 244.61 > > > > > > > > 512 354.32 > > > > > > > > 1024 367.91 > > > > > > > > 2048 451.96 > > > > > > > > 4096 550.66 > > > > > > > > 8192 598.35 > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from > rndv > > > req to > > > > > > > send > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out > of > > > > > range) in > > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > > Fatal error in MPI_Waitall: > > > > > > > > Other MPI error, error stack: > > > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > > > > > > > status_array=0xdb3140) failed > > > > > > > > (unknown)(): Other MPI error > > > > > > > > rank 1 in job 4 burn_37156 caused collective abort of all > > > > ranks > > > > > > > > exit status of rank 1: killed by signal 9 > > > > > > > > > --------------------------------------------------------------- > > > > > > > > > > > > > > > > I get a similar problem with the latency test, however, the > > > protocol > > > > > > > that is > > > > > > > > complained about is different: > > > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > > Thursday 09:21:20 > > > > > > > > # OSU MPI Latency Test v3.0 > > > > > > > > # Size Latency (us) > > > > > > > > 0 3.93 > > > > > > > > 1 4.07 > > > > > > > > 2 4.06 > > > > > > > > 4 3.82 > > > > > > > > 8 3.98 > > > > > > > > 16 4.03 > > > > > > > > 32 4.00 > > > > > > > > 64 4.28 > > > > > > > > 128 5.22 > > > > > > > > 256 5.88 > > > > > > > > 512 8.65 > > > > > > > > 1024 9.11 > > > > > > > > 2048 11.53 > > > > > > > > 4096 16.17 > > > > > > > > 8192 25.67 > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type > from > > > rndv > > > > > req > > > > > > > to > > > > > > > > send > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out > of > > > > > range) in > > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > > Fatal error in MPI_Recv: > > > > > > > > Other MPI error, error stack: > > > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, > > > src=0, > > > > > > > tag=1, > > > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > > > > > > (unknown)(): Other MPI error > > > > > > > > rank 1 in job 5 burn_37156 caused collective abort of all > > > ranks > > > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > > > > > The protocols (0 and 8126589) are consistent if I run the > > > program > > > > > > > multiple > > > > > > > > times. > > > > > > > > > > > > > > > > Anyone have any ideas? If you need more info, please let me > > > > know. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/00bde41f/attachment-0001.html From brian.budge at gmail.com Tue Jan 8 12:37:30 2008 From: brian.budge at gmail.com (Brian Budge) Date: Tue Jan 8 12:37:39 2008 Subject: [mvapich-discuss] vbuf registration Message-ID: <5b7094580801080937s60124ac2pa8f451d42d49b61f@mail.gmail.com> Hi all - My program is running near to completion, and then dies, complaining: [vbuf.c 184] Cannot register vbuf region rank 1 in job 13 burn_40823 caused collective abort of all ranks exit status of rank 1: killed by signal 9 I can run several of the osu benchmarks without any problem. Addtionally, when I run my app without MPI, I can use mmap with the MAP_LOCKED flag, but when I run with MPI, the first mmap with MAP_LOCKED fails, saying that some resources weren't available. If I remove the MAP_LOCKED flag, I successfully mmap. These issues may or may not be related (ie. maybe my locked limit is magically reduced when I run using MPI, and mlock is used in conjunction with vbuf registration?). Thanks, Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/cb6f0685/attachment.html From koop at cse.ohio-state.edu Tue Jan 8 13:36:21 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Jan 8 13:36:29 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: <5b7094580801080827i7624cf3n61ebd6c1a3b0edb3@mail.gmail.com> Message-ID: Brian, Good to hear that the microbenchmarks are working now. Whether the numbers you have are good or not is dependant on the platform. Is this a PCI-X or PCI-Express card? You can expect 900 MB/sec for SDR PCI-Express. Matt On Tue, 8 Jan 2008, Brian Budge wrote: > Hi Matt - > > ibv_rc_pingpong worked, and I decided to try a new clean install, and it > seems to be working quite a bit better now. I must have somehow added some > nasty stuff to the Makefile during my previous attempts. > > Here is the output: > > # OSU MPI Bandwidth Test v3.0 > # Size Bandwidth (MB/s) > 1 1.18 > 2 2.59 > 4 4.92 > 8 10.38 > 16 20.31 > 32 40.12 > 64 77.14 > 128 144.37 > 256 241.72 > 512 362.12 > 1024 471.01 > 2048 546.45 > 4096 581.47 > 8192 600.65 > 16384 611.52 > 32768 632.87 > 65536 642.27 > 131072 646.30 > 262144 644.22 > 524288 644.15 > 1048576 649.36 > 2097152 662.55 > 4194304 672.55 > > How do these numbers look for a 10 Gb SDR HCA? > > Thanks for your help! > Brian > > On Jan 7, 2008 5:12 PM, Matthew Koop wrote: > > > Brian, > > > > Can you try the ibv_rc_pingpong program, which is a low-level (non-MPI) > > test that ships with OFED? This will make sure that your basic InfiniBand > > setup is working properly. > > > > Did any other error message print out other than the one you gave? > > > > Matt > > > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > > > Hi Matt - > > > > > > I have now done the install from the ofa build file, and I can boot and > > run > > > the ring test, but now when I run the osu_bw.c benchmark, the executable > > > > > dies in MPI_Init(). > > > > > > The things I altered in make.mvapich2.ofa were: > > > > > > OPEN_IB_HOME=${OPEN_IB_HOME:-/usr} > > > SHARED_LIBS=${SHARED_LIBS:-yes} > > > > > > and on the configure line I added: > > > --disable-f77 --disable-f90 > > > > > > Here is the error message that I am getting: > > > > > > rank 1 in job 1 burn_60139 caused collective abort of all ranks > > > exit status of rank 1: killed by signal 9 > > > > > > Thanks, > > > Brian > > > > > > On Jan 7, 2008 1:21 PM, Matthew Koop wrote: > > > > > > > Brian, > > > > > > > > The make.mvapich.detect script is just a helper script (not meant to > > be > > > > executed directly). You need to use the make.mvapich.ofa script, which > > > > will call configure and make for you with the correct arguments. > > > > > > > > More information can be found in our MVAPICH2 user guide under > > > > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP" > > > > > > > > https://mvapich.cse.ohio-state.edu/support/ > > > > > > > > Let us know if you have any other problems. > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > > > > > > > Hi Wei - > > > > > > > > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no > > > > difference. > > > > > > > > > > When I build with rdma, this adds the following: > > > > > export LIBS="${LIBS} -lrdmacm" > > > > > export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH > > -DRDMA_CM" > > > > > > > > > > It seems that I am using the make.mvapich2.detect script to build. > > It > > > > asks > > > > > me for my interface, and gives me the option for the mellanox > > interface, > > > > > which I choose. > > > > > > > > > > I just tried a fresh install directly from the tarball instead of > > using > > > > the > > > > > gentoo package. Now the program completes (goes beyond 8K message), > > but > > > > my > > > > > bandwidth isn't very good. Running the osu_bw.c test, I get about > > 250 > > > > MB/s > > > > > maximum. It seems like IB isn't being used. > > > > > > > > > > I did the following: > > > > > ./make.mvapich2.detect #, and chose the mellanox option > > > > > ./configure --enable-threads=multiple > > > > > make > > > > > make install > > > > > > > > > > So it seems that the package is doing something to enable infiniband > > > > that I > > > > > am not doing with the tarball. Conversely, the tarball can run > > without > > > > > crashing. > > > > > > > > > > Advice? > > > > > > > > > > Thanks, > > > > > Brian > > > > > > > > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei@cse.ohio-state.edu> > > wrote: > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > I am using the openib-mvapich2-1.0.1 package in the > > gentoo-science > > > > > > overlay > > > > > > > addition to the standard gentoo packages. I have also tried > > 1.0with > > > > > > the > > > > > > > same results. > > > > > > > > > > > > > > I compiled with multithreading turned on (haven't tried without > > > > this, > > > > > > but > > > > > > > the sample codes I am initially testing are not multithreaded, > > > > although > > > > > > my > > > > > > > application is). I also tried with or without rdma with no > > change. > > > > The > > > > > > > > > > > > > script seems to be setting the build for SMALL_CLUSTER. > > > > > > > > > > > > So you are using make.mvapich2.ofa to compile the package? I am a > > bit > > > > > > confused about ''I also tried with or without rdma with no > > change''. > > > > What > > > > > > exact change you made here? Also, SMALL_CLUSTER is obsolete for > > ofa > > > > > > stack... > > > > > > > > > > > > -- Wei > > > > > > > > > > > > > > > > > > > > Let me know what other information would be useful. > > > > > > > > > > > > > > Thanks, > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Jan 4, 2008 6:12 PM, wei huang < huanwei@cse.ohio-state.edu> > > > > wrote: > > > > > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > > > > Thanks for letting us know this problem. Would you please let > > us > > > > know > > > > > > some > > > > > > > > more details to help us locate the issue. > > > > > > > > > > > > > > > > 1) More details on your platform. > > > > > > > > > > > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED > > > > package? > > > > > > or > > > > > > > > some version from our website. > > > > > > > > > > > > > > > > 3) If it is from our website, did you change anything from the > > > > default > > > > > > > > > > > > > > compiling scripts? > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- Wei > > > > > > > > > I'm new to the list here... hi! I have been using OpenMPI > > for a > > > > > > while, > > > > > > > > and > > > > > > > > > LAM before that, but new requirements keep pushing me to new > > > > > > > > > implementations. In particular, I was interested in using > > > > > > infiniband > > > > > > > > (using > > > > > > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems > > that > > > > > > MVAPICH is > > > > > > > > the > > > > > > > > > library for that particular combination :) > > > > > > > > > > > > > > > > > > In any case, I installed MVAPICH, and I can boot the > > daemons, > > > > and > > > > > > run > > > > > > > > the > > > > > > > > > ring speed test with no problems. When I run any programs > > with > > > > > > mpirun, > > > > > > > > > however, I get an error when sending or receiving more than > > 8192 > > > > > > bytes. > > > > > > > > > > > > > > > > > > For example, if I run the bandwidth test from the benchmarks > > > > page > > > > > > > > > (osu_bw.c), I get the following: > > > > > > > > > > > --------------------------------------------------------------- > > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > > > Thursday 06:16:00 > > > > > > > > > burn > > > > > > > > > burn-3 > > > > > > > > > # OSU MPI Bandwidth Test v3.0 > > > > > > > > > # Size Bandwidth (MB/s) > > > > > > > > > 1 1.24 > > > > > > > > > 2 2.72 > > > > > > > > > 4 5.44 > > > > > > > > > 8 10.18 > > > > > > > > > 16 19.09 > > > > > > > > > 32 29.69 > > > > > > > > > 64 65.01 > > > > > > > > > 128 147.31 > > > > > > > > > 256 244.61 > > > > > > > > > 512 354.32 > > > > > > > > > 1024 367.91 > > > > > > > > > 2048 451.96 > > > > > > > > > 4096 550.66 > > > > > > > > > 8192 598.35 > > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from > > rndv > > > > req to > > > > > > > > send > > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out > > of > > > > > > range) in > > > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > > > Fatal error in MPI_Waitall: > > > > > > > > > Other MPI error, error stack: > > > > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, > > > > > > > > > status_array=0xdb3140) failed > > > > > > > > > (unknown)(): Other MPI error > > > > > > > > > rank 1 in job 4 burn_37156 caused collective abort of all > > > > > > ranks > > > > > > > > > exit status of rank 1: killed by signal 9 > > > > > > > > > > > --------------------------------------------------------------- > > > > > > > > > > > > > > > > > > I get a similar problem with the latency test, however, the > > > > protocol > > > > > > > > that is > > > > > > > > > complained about is different: > > > > > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > > > Thursday 09:21:20 > > > > > > > > > # OSU MPI Latency Test v3.0 > > > > > > > > > # Size Latency (us) > > > > > > > > > 0 3.93 > > > > > > > > > 1 4.07 > > > > > > > > > 2 4.06 > > > > > > > > > 4 3.82 > > > > > > > > > 8 3.98 > > > > > > > > > 16 4.03 > > > > > > > > > 32 4.00 > > > > > > > > > 64 4.28 > > > > > > > > > 128 5.22 > > > > > > > > > 256 5.88 > > > > > > > > > 512 8.65 > > > > > > > > > 1024 9.11 > > > > > > > > > 2048 11.53 > > > > > > > > > 4096 16.17 > > > > > > > > > 8192 25.67 > > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type > > from > > > > rndv > > > > > > req > > > > > > > > to > > > > > > > > > send > > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out > > of > > > > > > range) in > > > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > > > Fatal error in MPI_Recv: > > > > > > > > > Other MPI error, error stack: > > > > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, > > > > src=0, > > > > > > > > tag=1, > > > > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > > > > > > > (unknown)(): Other MPI error > > > > > > > > > rank 1 in job 5 burn_37156 caused collective abort of all > > > > ranks > > > > > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > > > > > > > The protocols (0 and 8126589) are consistent if I run the > > > > program > > > > > > > > multiple > > > > > > > > > times. > > > > > > > > > > > > > > > > > > Anyone have any ideas? If you need more info, please let me > > > > > > know. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From brian.budge at gmail.com Tue Jan 8 13:41:27 2008 From: brian.budge at gmail.com (Brian Budge) Date: Tue Jan 8 13:41:43 2008 Subject: [mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd) In-Reply-To: References: <5b7094580801080827i7624cf3n61ebd6c1a3b0edb3@mail.gmail.com> Message-ID: <5b7094580801081041t5ab4057dy98bc666c52d93ed3@mail.gmail.com> Hmmm, this is a PCI-Express setup. Are there some variables I should be tweaking? Thanks, Brai On Jan 8, 2008 10:36 AM, Matthew Koop wrote: > Brian, > > Good to hear that the microbenchmarks are working now. Whether the numbers > you have are good or not is dependant on the platform. Is this a PCI-X or > PCI-Express card? You can expect 900 MB/sec for SDR PCI-Express. > > Matt > > On Tue, 8 Jan 2008, Brian Budge wrote: > > > Hi Matt - > > > > ibv_rc_pingpong worked, and I decided to try a new clean install, and it > > seems to be working quite a bit better now. I must have somehow added > some > > nasty stuff to the Makefile during my previous attempts. > > > > Here is the output: > > > > # OSU MPI Bandwidth Test v3.0 > > # Size Bandwidth (MB/s) > > 1 1.18 > > 2 2.59 > > 4 4.92 > > 8 10.38 > > 16 20.31 > > 32 40.12 > > 64 77.14 > > 128 144.37 > > 256 241.72 > > 512 362.12 > > 1024 471.01 > > 2048 546.45 > > 4096 581.47 > > 8192 600.65 > > 16384 611.52 > > 32768 632.87 > > 65536 642.27 > > 131072 646.30 > > 262144 644.22 > > 524288 644.15 > > 1048576 649.36 > > 2097152 662.55 > > 4194304 672.55 > > > > How do these numbers look for a 10 Gb SDR HCA? > > > > Thanks for your help! > > Brian > > > > On Jan 7, 2008 5:12 PM, Matthew Koop wrote: > > > > > Brian, > > > > > > Can you try the ibv_rc_pingpong program, which is a low-level > (non-MPI) > > > test that ships with OFED? This will make sure that your basic > InfiniBand > > > setup is working properly. > > > > > > Did any other error message print out other than the one you gave? > > > > > > Matt > > > > > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > > > > > Hi Matt - > > > > > > > > I have now done the install from the ofa build file, and I can boot > and > > > run > > > > the ring test, but now when I run the osu_bw.c benchmark, the > executable > > > > > > > dies in MPI_Init(). > > > > > > > > The things I altered in make.mvapich2.ofa were: > > > > > > > > OPEN_IB_HOME=${OPEN_IB_HOME:-/usr} > > > > SHARED_LIBS=${SHARED_LIBS:-yes} > > > > > > > > and on the configure line I added: > > > > --disable-f77 --disable-f90 > > > > > > > > Here is the error message that I am getting: > > > > > > > > rank 1 in job 1 burn_60139 caused collective abort of all ranks > > > > exit status of rank 1: killed by signal 9 > > > > > > > > Thanks, > > > > Brian > > > > > > > > On Jan 7, 2008 1:21 PM, Matthew Koop > wrote: > > > > > > > > > Brian, > > > > > > > > > > The make.mvapich.detect script is just a helper script (not meant > to > > > be > > > > > executed directly). You need to use the make.mvapich.ofa script, > which > > > > > will call configure and make for you with the correct arguments. > > > > > > > > > > More information can be found in our MVAPICH2 user guide under > > > > > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP" > > > > > > > > > > https://mvapich.cse.ohio-state.edu/support/ > > > > > > > > > > Let us know if you have any other problems. > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 7 Jan 2008, Brian Budge wrote: > > > > > > > > > > > Hi Wei - > > > > > > > > > > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no > > > > > difference. > > > > > > > > > > > > When I build with rdma, this adds the following: > > > > > > export LIBS="${LIBS} -lrdmacm" > > > > > > export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH > > > -DRDMA_CM" > > > > > > > > > > > > It seems that I am using the make.mvapich2.detect script to > build. > > > It > > > > > asks > > > > > > me for my interface, and gives me the option for the mellanox > > > interface, > > > > > > which I choose. > > > > > > > > > > > > I just tried a fresh install directly from the tarball instead > of > > > using > > > > > the > > > > > > gentoo package. Now the program completes (goes beyond 8K > message), > > > but > > > > > my > > > > > > bandwidth isn't very good. Running the osu_bw.c test, I get > about > > > 250 > > > > > MB/s > > > > > > maximum. It seems like IB isn't being used. > > > > > > > > > > > > I did the following: > > > > > > ./make.mvapich2.detect #, and chose the mellanox option > > > > > > ./configure --enable-threads=multiple > > > > > > make > > > > > > make install > > > > > > > > > > > > So it seems that the package is doing something to enable > infiniband > > > > > that I > > > > > > am not doing with the tarball. Conversely, the tarball can run > > > without > > > > > > crashing. > > > > > > > > > > > > Advice? > > > > > > > > > > > > Thanks, > > > > > > Brian > > > > > > > > > > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei@cse.ohio-state.edu> > > > wrote: > > > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > > > I am using the openib-mvapich2-1.0.1 package in the > > > gentoo-science > > > > > > > overlay > > > > > > > > addition to the standard gentoo packages. I have also tried > > > 1.0with > > > > > > > the > > > > > > > > same results. > > > > > > > > > > > > > > > > I compiled with multithreading turned on (haven't tried > without > > > > > this, > > > > > > > but > > > > > > > > the sample codes I am initially testing are not > multithreaded, > > > > > although > > > > > > > my > > > > > > > > application is). I also tried with or without rdma with no > > > change. > > > > > The > > > > > > > > > > > > > > > script seems to be setting the build for SMALL_CLUSTER. > > > > > > > > > > > > > > So you are using make.mvapich2.ofa to compile the package? I > am a > > > bit > > > > > > > confused about ''I also tried with or without rdma with no > > > change''. > > > > > What > > > > > > > exact change you made here? Also, SMALL_CLUSTER is obsolete > for > > > ofa > > > > > > > stack... > > > > > > > > > > > > > > -- Wei > > > > > > > > > > > > > > > > > > > > > > > Let me know what other information would be useful. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Jan 4, 2008 6:12 PM, wei huang < > huanwei@cse.ohio-state.edu> > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Brian, > > > > > > > > > > > > > > > > > > Thanks for letting us know this problem. Would you please > let > > > us > > > > > know > > > > > > > some > > > > > > > > > more details to help us locate the issue. > > > > > > > > > > > > > > > > > > 1) More details on your platform. > > > > > > > > > > > > > > > > > > 2) Exact version of mvapich2 you are using. Is it from > OFED > > > > > package? > > > > > > > or > > > > > > > > > some version from our website. > > > > > > > > > > > > > > > > > > 3) If it is from our website, did you change anything from > the > > > > > default > > > > > > > > > > > > > > > > compiling scripts? > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > -- Wei > > > > > > > > > > I'm new to the list here... hi! I have been using > OpenMPI > > > for a > > > > > > > while, > > > > > > > > > and > > > > > > > > > > LAM before that, but new requirements keep pushing me to > new > > > > > > > > > > implementations. In particular, I was interested in > using > > > > > > > infiniband > > > > > > > > > (using > > > > > > > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems > > > that > > > > > > > MVAPICH is > > > > > > > > > the > > > > > > > > > > library for that particular combination :) > > > > > > > > > > > > > > > > > > > > In any case, I installed MVAPICH, and I can boot the > > > daemons, > > > > > and > > > > > > > run > > > > > > > > > the > > > > > > > > > > ring speed test with no problems. When I run any > programs > > > with > > > > > > > mpirun, > > > > > > > > > > however, I get an error when sending or receiving more > than > > > 8192 > > > > > > > bytes. > > > > > > > > > > > > > > > > > > > > For example, if I run the bandwidth test from the > benchmarks > > > > > page > > > > > > > > > > (osu_bw.c), I get the following: > > > > > > > > > > > > > --------------------------------------------------------------- > > > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > > > > Thursday 06:16:00 > > > > > > > > > > burn > > > > > > > > > > burn-3 > > > > > > > > > > # OSU MPI Bandwidth Test v3.0 > > > > > > > > > > # Size Bandwidth (MB/s) > > > > > > > > > > 1 1.24 > > > > > > > > > > 2 2.72 > > > > > > > > > > 4 5.44 > > > > > > > > > > 8 10.18 > > > > > > > > > > 16 19.09 > > > > > > > > > > 32 29.69 > > > > > > > > > > 64 65.01 > > > > > > > > > > 128 147.31 > > > > > > > > > > 256 244.61 > > > > > > > > > > 512 354.32 > > > > > > > > > > 1024 367.91 > > > > > > > > > > 2048 451.96 > > > > > > > > > > 4096 550.66 > > > > > > > > > > 8192 598.35 > > > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from > > > rndv > > > > > req to > > > > > > > > > send > > > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index > out > > > of > > > > > > > range) in > > > > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > > > > Fatal error in MPI_Waitall: > > > > > > > > > > Other MPI error, error stack: > > > > > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, > req_array=0xdb21a0, > > > > > > > > > > status_array=0xdb3140) failed > > > > > > > > > > (unknown)(): Other MPI error > > > > > > > > > > rank 1 in job 4 burn_37156 caused collective abort of > all > > > > > > > > ranks > > > > > > > > > > exit status of rank 1: killed by signal 9 > > > > > > > > > > > > > --------------------------------------------------------------- > > > > > > > > > > > > > > > > > > > > I get a similar problem with the latency test, however, > the > > > > > protocol > > > > > > > > > that is > > > > > > > > > > complained about is different: > > > > > > > > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > > > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out > > > > > > > > > > Thursday 09:21:20 > > > > > > > > > > # OSU MPI Latency Test v3.0 > > > > > > > > > > # Size Latency (us) > > > > > > > > > > 0 3.93 > > > > > > > > > > 1 4.07 > > > > > > > > > > 2 4.06 > > > > > > > > > > 4 3.82 > > > > > > > > > > 8 3.98 > > > > > > > > > > 16 4.03 > > > > > > > > > > 32 4.00 > > > > > > > > > > 64 4.28 > > > > > > > > > > 128 5.22 > > > > > > > > > > 256 5.88 > > > > > > > > > > 512 8.65 > > > > > > > > > > 1024 9.11 > > > > > > > > > > 2048 11.53 > > > > > > > > > > 4096 16.17 > > > > > > > > > > 8192 25.67 > > > > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 > type > > > from > > > > > rndv > > > > > > > req > > > > > > > > > to > > > > > > > > > > send > > > > > > > > > > Internal Error: invalid error code ffffffff (Ring Index > out > > > of > > > > > > > range) in > > > > > > > > > > MPIDI_CH3_RndvSend:263 > > > > > > > > > > Fatal error in MPI_Recv: > > > > > > > > > > Other MPI error, error stack: > > > > > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, > MPI_CHAR, > > > > > src=0, > > > > > > > > > tag=1, > > > > > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed > > > > > > > > > > (unknown)(): Other MPI error > > > > > > > > > > rank 1 in job 5 burn_37156 caused collective abort of > all > > > > > ranks > > > > > > > > > > > > > > > > -------------------------------------------------------------------- > > > > > > > > > > > > > > > > > > > > The protocols (0 and 8126589) are consistent if I run > the > > > > > program > > > > > > > > > multiple > > > > > > > > > > times. > > > > > > > > > > > > > > > > > > > > Anyone have any ideas? If you need more info, please > let me > > > > > > > > know. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/f181b9c1/attachment-0001.html From brian.budge at gmail.com Tue Jan 8 19:42:36 2008 From: brian.budge at gmail.com (Brian Budge) Date: Tue Jan 8 19:42:43 2008 Subject: [mvapich-discuss] Re: vbuf registration In-Reply-To: <5b7094580801080937s60124ac2pa8f451d42d49b61f@mail.gmail.com> References: <5b7094580801080937s60124ac2pa8f451d42d49b61f@mail.gmail.com> Message-ID: <5b7094580801081642r55c6ce75paebf420f915e5314@mail.gmail.com> Hi again - It looks as though this problem was due to the fact that I am running zsh as my shell, and so my unlimit commands weren't being executed because with zsh the .zshrc file isn't loaded with MPI, I believe it's loading the .zshenv file instead. Thanks, Brian On Jan 8, 2008 9:37 AM, Brian Budge wrote: > Hi all - > > My program is running near to completion, and then dies, complaining: > > [vbuf.c 184] Cannot register vbuf region > rank 1 in job 13 burn_40823 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > > I can run several of the osu benchmarks without any problem. > > Addtionally, when I run my app without MPI, I can use mmap with the > MAP_LOCKED flag, but when I run with MPI, the first mmap with MAP_LOCKED > fails, saying that some resources weren't available. If I remove the > MAP_LOCKED flag, I successfully mmap. > > These issues may or may not be related (ie. maybe my locked limit is > magically reduced when I run using MPI, and mlock is used in conjunction > with vbuf registration?). > > Thanks, > Brian > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/da53f7a8/attachment.html From brian.budge at gmail.com Tue Jan 8 19:57:19 2008 From: brian.budge at gmail.com (Brian Budge) Date: Tue Jan 8 19:57:27 2008 Subject: [mvapich-discuss] problems with MPI + GPU Message-ID: <5b7094580801081657n4ff2571el2d3b331bd813124@mail.gmail.com> Hi all - Sorry for all the traffic, but I'm getting very close to being able to reliably run my application with mvapich2. The problem I am having now is with GPUs. I am running an application which uses GPUs and the CUDA programming environment to accelerate computation. It's exciting stuff, and depending on the problem, I see 2 to 6x speedup (I am running a ray tracing type application). Everything works if I run without MPI, but if I run with mvapich2, my GPU initialization fails about 75% of the time, making my runs quite unreliable. In the 25% when the device initializes, everything else works fine. Now, I'm not sure what could possibly cause this, and I could see this problem cropping up due to any of the following factors: 1) bug in mvapich2 2) bug in CUDA 3) bug in OFED IB stuff Does anyone have any ideas how to even begin tracking this down? Could it be something like infiniband device initialization walking into NVIDIA's memory space? I'm grasping at straws here ;) Thanks, Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/91425665/attachment.html From panda at cse.ohio-state.edu Wed Jan 9 00:11:16 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Jan 9 00:11:24 2008 Subject: [mvapich-discuss] problems with MPI + GPU In-Reply-To: <5b7094580801081657n4ff2571el2d3b331bd813124@mail.gmail.com> Message-ID: Hi Brian, > Hi all - > > Sorry for all the traffic, but I'm getting very close to being able to > reliably run my application with mvapich2. Good to know this. > The problem I am having now is with GPUs. I am running an application > which uses GPUs and the CUDA programming environment to accelerate > computation. It's exciting stuff, and depending on the problem, I see 2 to > 6x speedup (I am running a ray tracing type application). Everything works > if I run without MPI, but if I run with mvapich2, my GPU initialization > fails about 75% of the time, making my runs quite unreliable. In the 25% > when the device initializes, everything else works fine. Unfortunately, we have not tested MVAPICH2 + IB (OFED) + GPU (with CUDA). If anybody else in this list has experience in running MVAPICH2 in this mode, they can indicate their experience. You can also post a note regarding this to the OFED general list. > Now, I'm not sure what could possibly cause this, and I could see this > problem cropping up due to any of the following factors: > > 1) bug in mvapich2 > 2) bug in CUDA > 3) bug in OFED IB stuff > > Does anyone have any ideas how to even begin tracking this down? Could it > be something like infiniband device initialization walking into NVIDIA's > memory space? > I'm grasping at straws here ;) Can you run basic MPICH2 (from Argonne) with Ethernet + GPU (with CUDA)? This will isolate IB-specific issues with IB/OFED and provide more insights to this problem. Thanks, DK > Thanks, > Brian > From brian.budge at gmail.com Wed Jan 9 11:25:37 2008 From: brian.budge at gmail.com (Brian Budge) Date: Wed Jan 9 11:25:47 2008 Subject: [mvapich-discuss] problems with MPI + GPU In-Reply-To: References: <5b7094580801081657n4ff2571el2d3b331bd813124@mail.gmail.com> Message-ID: <5b7094580801090825k43e92122s621a142f4eb822ba@mail.gmail.com> Hi DK - I just rebuilt mvapich2 with tcp instead of ofa, and now my program reliably executes. I'll post something to the OFED list if I can find it. Thanks, Brian On Jan 8, 2008 9:11 PM, Dhabaleswar Panda wrote: > Hi Brian, > > > Hi all - > > > > Sorry for all the traffic, but I'm getting very close to being able to > > reliably run my application with mvapich2. > > Good to know this. > > > The problem I am having now is with GPUs. I am running an application > > which uses GPUs and the CUDA programming environment to accelerate > > computation. It's exciting stuff, and depending on the problem, I see 2 > to > > 6x speedup (I am running a ray tracing type application). Everything > works > > if I run without MPI, but if I run with mvapich2, my GPU initialization > > fails about 75% of the time, making my runs quite unreliable. In the > 25% > > when the device initializes, everything else works fine. > > Unfortunately, we have not tested MVAPICH2 + IB (OFED) + GPU (with CUDA). > If anybody else in this list has experience in running MVAPICH2 in this > mode, they can indicate their experience. > > You can also post a note regarding this to the OFED general list. > > > Now, I'm not sure what could possibly cause this, and I could see this > > problem cropping up due to any of the following factors: > > > > 1) bug in mvapich2 > > 2) bug in CUDA > > 3) bug in OFED IB stuff > > > > Does anyone have any ideas how to even begin tracking this down? Could > it > > be something like infiniband device initialization walking into NVIDIA's > > memory space? > > I'm grasping at straws here ;) > > Can you run basic MPICH2 (from Argonne) with Ethernet + GPU (with CUDA)? > This will isolate IB-specific issues with IB/OFED and provide more > insights to this problem. > > Thanks, > > DK > > > Thanks, > > Brian > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080109/5796db3e/attachment.html From Durga.Choudhury at drs-ss.com Wed Jan 9 12:06:48 2008 From: Durga.Choudhury at drs-ss.com (Choudhury, Durga) Date: Wed Jan 9 12:06:24 2008 Subject: [mvapich-discuss] problems with MPI + GPU In-Reply-To: <5b7094580801090825k43e92122s621a142f4eb822ba@mail.gmail.com> References: <5b7094580801081657n4ff2571el2d3b331bd813124@mail.gmail.com> <5b7094580801090825k43e92122s621a142f4eb822ba@mail.gmail.com> Message-ID: Brian I would be very interested to know what, if any, solution you found to this issue. Please post your findings to the list, or at list send it to me individually. Thank you. Durga ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Brian Budge Sent: Wednesday, January 09, 2008 11:26 AM To: Dhabaleswar Panda Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] problems with MPI + GPU Hi DK - I just rebuilt mvapich2 with tcp instead of ofa, and now my program reliably executes. I'll post something to the OFED list if I can find it. Thanks, Brian On Jan 8, 2008 9:11 PM, Dhabaleswar Panda < panda@cse.ohio-state.edu> wrote: Hi Brian, > Hi all - > > Sorry for all the traffic, but I'm getting very close to being able to > reliably run my application with mvapich2. Good to know this. > The problem I am having now is with GPUs. I am running an application > which uses GPUs and the CUDA programming environment to accelerate > computation. It's exciting stuff, and depending on the problem, I see 2 to > 6x speedup (I am running a ray tracing type application). Everything works > if I run without MPI, but if I run with mvapich2, my GPU initialization > fails about 75% of the time, making my runs quite unreliable. In the 25% > when the device initializes, everything else works fine. Unfortunately, we have not tested MVAPICH2 + IB (OFED) + GPU (with CUDA). If anybody else in this list has experience in running MVAPICH2 in this mode, they can indicate their experience. You can also post a note regarding this to the OFED general list. > Now, I'm not sure what could possibly cause this, and I could see this > problem cropping up due to any of the following factors: > > 1) bug in mvapich2 > 2) bug in CUDA > 3) bug in OFED IB stuff > > Does anyone have any ideas how to even begin tracking this down? Could it > be something like infiniband device initialization walking into NVIDIA's > memory space? > I'm grasping at straws here ;) Can you run basic MPICH2 (from Argonne) with Ethernet + GPU (with CUDA)? This will isolate IB-specific issues with IB/OFED and provide more insights to this problem. Thanks, DK > Thanks, > Brian > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080109/2f240c81/attachment-0001.html From chail at cse.ohio-state.edu Wed Jan 9 13:59:35 2008 From: chail at cse.ohio-state.edu (Lei Chai) Date: Wed Jan 9 14:01:55 2008 Subject: [mvapich-discuss] Re: protocol used for MPI_FInaize in mvapich2 Message-ID: Hi Nilesh, I'm not sure what exact protocol you are looking for. You probably can take a look at src/mpid/osu_ch3/src/mpid_finalize.c to find out. Thanks, Lei ----- Original Message ----- From: nilesh awate Date: Monday, January 7, 2008 1:15 am Subject: protocol used for MPI_FInaize in mvapich2 > > Hi all, > > I'm using mvapich2-1.0.1 with OFED1.2(udapl stack) > > To know the flow of MPI_FInalize i put some debug statement in > source code & tried > simple mpi test code (only init & finalize api) > I observed there is shutting down/closing protocol (in which every > process does > 2dto) > some body plz tell how these dto (function trace of MPI_Finalize) > happenwhat is exact protocol is mvapich follows. > > thanking, > Nilesh > > > > 5, 50, 500, 5000 - Store N number of mails in your inbox. Go > to http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html From sshaw at sgi.com Wed Jan 9 14:40:21 2008 From: sshaw at sgi.com (Scott Shaw) Date: Wed Jan 9 14:40:58 2008 Subject: [mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11 Message-ID: <9BEB932202A05B488722B05D2374A1DA04BF73A0@mtv-amer001e--3.americas.sgi.com> Hi, On several clusters we are experiencing the same issues originally posted on Oct 11, 2007 regarding "error closing socket at end of mpirun_rsh" job. Running the mpi test with one core works, no error is generated but n+1 cores error is generated. Is there a patch available which addresses the "Termination socket read failed" error message? I have tested three different clusters and each cluster exhibits the same error. I also check the "mvapich-discuss" archives and still did not see a resolution. I am currently running mvapich v0.9.9 which is bundled with ofed v1.2. r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile ./mpi_test Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test Rank=1 present and calling MPI_Finalize Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely Termination socket read failed: Bad file descriptor Rank=1 bailing, nicely r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile ./mpi_test Rank=1 present and calling MPI_Finalize Rank=3 present and calling MPI_Finalize Rank=0 present and calling MPI_Finalize Rank=2 present and calling MPI_Finalize Rank=0 bailing, nicely Termination socket read failed: Bad file descriptor Rank=3 bailing, nicely Rank=1 bailing, nicely Rank=2 bailing, nicely Thanks, Scott From nilesh_awate at yahoo.com Thu Jan 10 00:04:59 2008 From: nilesh_awate at yahoo.com (nilesh awate) Date: Thu Jan 10 00:05:13 2008 Subject: [mvapich-discuss] Re: protocol used for MPI_FInaize in mvapich2 Message-ID: <160208.58352.qm@web94108.mail.in2.yahoo.com> Hi Lei, thanks for reply, Actually i've already gone thro' src/mpid/osu_ch3/src/mpid_finalize.c this file for both (mvapich2-.0.9.8 & 1.0.1) i found difference in mpid_finalize.c of both the version. There is dto(2 per process) in MPID_Finalize() function, but while dequeue we got only 3 events (ideally we should get 4 successfull event) One more thing while tracing MPID_Finalize() function i did'nt find call for MPID_Send() which do send/rdma-write (In mvapich2-0.9.8 i found MPID_CH3_iStartMsg call which do send/rdma write) will you plz explain how MPID_Finalize is working in mvapich2-1.0.1 waiting for reply, thanking, Nilesh ----- Original Message ---- From: Lei Chai To: nilesh awate Cc: mvapich-discuss@cse.ohio-state.edu; lei chai Sent: Thursday, 10 January, 2008 12:29:35 AM Subject: Re: protocol used for MPI_FInaize in mvapich2 Hi Nilesh, I'm not sure what exact protocol you are looking for. You probably can take a look at src/mpid/osu_ch3/src/mpid_finalize.c to find out. Thanks, Lei ----- Original Message ----- From: nilesh awate Date: Monday, January 7, 2008 1:15 am Subject: protocol used for MPI_FInaize in mvapich2 > > Hi all, > > I'm using mvapich2-1.0.1 with OFED1..2(udapl stack) > > To know the flow of MPI_FInalize i put some debug statement in > source code & tried > simple mpi test code (only init & finalize api) > I observed there is shutting down/closing protocol (in which every > process does > 2dto) > some body plz tell how these dto (function trace of MPI_Finalize) > happenwhat is exact protocol is mvapich follows. > > thanking, > Nilesh > > > > 5, 50, 500, 5000 - Store N number of mails in your inbox. Go > to http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html 5, 50, 500, 5000 - Store N number of mails in your inbox. Go to http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080110/549a8003/attachment.html From panda at cse.ohio-state.edu Sat Jan 12 09:24:50 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sat Jan 12 09:24:57 2008 Subject: [mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11 In-Reply-To: <9BEB932202A05B488722B05D2374A1DA04BF73A0@mtv-amer001e--3.americas.sgi.com> Message-ID: Hi Scott, As we discussed off-line, you have access to a solution to this problem. Let us know how it works. This solution is also being available with the enhanced and strengthened mpirun_rsh of mvapich 1.0 version. Thanks, DK On Wed, 9 Jan 2008, Scott Shaw wrote: > Hi, > On several clusters we are experiencing the same issues originally > posted on Oct 11, 2007 regarding "error closing socket at end of > mpirun_rsh" job. Running the mpi test with one core works, no error is > generated but n+1 cores error is generated. > > Is there a patch available which addresses the "Termination socket read > failed" error message? I have tested three different clusters and each > cluster exhibits the same error. I also check the "mvapich-discuss" > archives and still did not see a resolution. > > I am currently running mvapich v0.9.9 which is bundled with ofed v1.2. > > r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile ./mpi_test > Rank=0 present and calling MPI_Finalize > Rank=0 bailing, nicely > > r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test > Rank=1 present and calling MPI_Finalize > Rank=0 present and calling MPI_Finalize > Rank=0 bailing, nicely > Termination socket read failed: Bad file descriptor > Rank=1 bailing, nicely > > r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile ./mpi_test > Rank=1 present and calling MPI_Finalize > Rank=3 present and calling MPI_Finalize > Rank=0 present and calling MPI_Finalize > Rank=2 present and calling MPI_Finalize > Rank=0 bailing, nicely > Termination socket read failed: Bad file descriptor > Rank=3 bailing, nicely > Rank=1 bailing, nicely > Rank=2 bailing, nicely > > Thanks, > Scott > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From sshaw at sgi.com Mon Jan 14 10:11:21 2008 From: sshaw at sgi.com (Scott Shaw) Date: Mon Jan 14 10:12:43 2008 Subject: [mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11 In-Reply-To: References: <9BEB932202A05B488722B05D2374A1DA04BF73A0@mtv-amer001e--3.americas.sgi.com> Message-ID: <9BEB932202A05B488722B05D2374A1DA04C4BF9C@mtv-amer001e--3.americas.sgi.com> Hi DK, I installed and tested the NMCAC mvapich patches. Rerunning simple MPI tests still a problem. What seems interesting is the "termination failed" message does not happen on cluster nodes with drives only our diskless clusters. Another interesting data point is that this error can occur when just using "-np 2", two cores, on the same node so this might rule out networking issues? Following is an email I sent to Michel and Kevin regarding this issue. Would it help if I provide you access to a cluster for testing purposes? Thanks, Scott Thursday, January 10, 2008 3:07 PM Hi Michel, Kevin - I have downloaded the rpms from the location Michel provided. I extracted the rpms in my home directory instead of messing with what's currently installed on orbit6.americas. I recompiled the application and linked against the new/revised mvapich libs and I still get the termination failed message. Several applications like NEMO which are built against mvapich showed the same failure which prompted me to post the question to the mvapich mail alias. A customer reviewing the result files will be suspicious of this error messages and _if_ the analysis completed successfully. So this could be a potential issue to customers review benchmark results. Any ideas how to proceed? service0 /store/sshaw> pwd /nas/store/sshaw rpm2cpio mvapich_intel-test-SGINoShip-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio -ivd rpm2cpio mvapich_intel-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio -ivd service0 /store/sshaw> setenv LD_LIBRARY_PATH /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib service0 /store/sshaw> module load intel-compilers-9 service0 /store/sshaw> module list Currently Loaded Modulefiles: 1) intel-cc-9/9.1.052 2) intel-fc-9/9.1.052 3) intel-compilers-9 service0 /store/sshaw> setenv PATH /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin:${PATH} service0 /store/sshaw> which mpirun_rsh /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh service0 /store/sshaw> mpicc mpi_test.c -o mpi_test -L/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib service0 /store/sshaw> /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test Rank=0 present and calling MPI_Finalize Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely Termination socket read failed: Bad file descriptor Rank=0 bailing, nicely Thanks, Scott > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > Sent: Saturday, January 12, 2008 9:25 AM > To: Scott Shaw > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] error closing socket at end of mpirun_rsh > original posted Oct 11 > > Hi Scott, > > As we discussed off-line, you have access to a solution to this problem. > Let us know how it works. This solution is also being available with the > enhanced and strengthened mpirun_rsh of mvapich 1.0 version. > > Thanks, > > DK > > > On Wed, 9 Jan 2008, Scott Shaw wrote: > > > Hi, > > On several clusters we are experiencing the same issues originally > > posted on Oct 11, 2007 regarding "error closing socket at end of > > mpirun_rsh" job. Running the mpi test with one core works, no error is > > generated but n+1 cores error is generated. > > > > Is there a patch available which addresses the "Termination socket read > > failed" error message? I have tested three different clusters and each > > cluster exhibits the same error. I also check the "mvapich-discuss" > > archives and still did not see a resolution. > > > > I am currently running mvapich v0.9.9 which is bundled with ofed v1.2. > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile ./mpi_test > > Rank=0 present and calling MPI_Finalize > > Rank=0 bailing, nicely > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test > > Rank=1 present and calling MPI_Finalize > > Rank=0 present and calling MPI_Finalize > > Rank=0 bailing, nicely > > Termination socket read failed: Bad file descriptor > > Rank=1 bailing, nicely > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile ./mpi_test > > Rank=1 present and calling MPI_Finalize > > Rank=3 present and calling MPI_Finalize > > Rank=0 present and calling MPI_Finalize > > Rank=2 present and calling MPI_Finalize > > Rank=0 bailing, nicely > > Termination socket read failed: Bad file descriptor > > Rank=3 bailing, nicely > > Rank=1 bailing, nicely > > Rank=2 bailing, nicely > > > > Thanks, > > Scott > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > From panda at cse.ohio-state.edu Mon Jan 14 10:26:22 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Jan 14 10:26:31 2008 Subject: [mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11 In-Reply-To: <9BEB932202A05B488722B05D2374A1DA04C4BF9C@mtv-amer001e--3.americas.sgi.com> Message-ID: Hi Scott, Sorry to know that you are still encountering the problem on some systems. Thanks for offering to provide us access to a test cluster. This will be very helpful. Please send me the remote access information and one of my team members will work closely with you to resolve this problem. Thanks, DK On Mon, 14 Jan 2008, Scott Shaw wrote: > Hi DK, > I installed and tested the NMCAC mvapich patches. Rerunning simple MPI > tests still a problem. What seems interesting is the "termination > failed" message does not happen on cluster nodes with drives only our > diskless clusters. Another interesting data point is that this error can > occur when just using "-np 2", two cores, on the same node so this might > rule out networking issues? > > Following is an email I sent to Michel and Kevin regarding this issue. > Would it help if I provide you access to a cluster for testing purposes? > > > Thanks, > Scott > > Thursday, January 10, 2008 3:07 PM > Hi Michel, Kevin - > I have downloaded the rpms from the location Michel provided. I > extracted the rpms in my home directory instead of messing with what's > currently installed on orbit6.americas. I recompiled the application and > linked against the new/revised mvapich libs and I still get the > termination failed message. Several applications like NEMO which are > built against mvapich showed the same failure which prompted me to post > the question to the mvapich mail alias. A customer reviewing the result > files will be suspicious of this error messages and _if_ the analysis > completed successfully. So this could be a potential issue to customers > review benchmark results. Any ideas how to proceed? > > service0 /store/sshaw> pwd > /nas/store/sshaw > > rpm2cpio > mvapich_intel-test-SGINoShip-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio > -ivd rpm2cpio mvapich_intel-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio > -ivd > > > service0 /store/sshaw> setenv LD_LIBRARY_PATH > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib > service0 /store/sshaw> module load intel-compilers-9 service0 > /store/sshaw> module list Currently Loaded Modulefiles: > 1) intel-cc-9/9.1.052 2) intel-fc-9/9.1.052 3) intel-compilers-9 > > service0 /store/sshaw> setenv PATH > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin:${PATH} > service0 /store/sshaw> which mpirun_rsh > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh > > service0 /store/sshaw> mpicc mpi_test.c -o mpi_test > -L/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib > > service0 /store/sshaw> > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh -np 2 > -hostfile ./hfile ./mpi_test Rank=0 present and calling MPI_Finalize > Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely > Termination socket read failed: Bad file descriptor Rank=0 bailing, > nicely > > > Thanks, > Scott > > > > > > -----Original Message----- > > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > > Sent: Saturday, January 12, 2008 9:25 AM > > To: Scott Shaw > > Cc: mvapich-discuss@cse.ohio-state.edu > > Subject: Re: [mvapich-discuss] error closing socket at end of > mpirun_rsh > > original posted Oct 11 > > > > Hi Scott, > > > > As we discussed off-line, you have access to a solution to this > problem. > > Let us know how it works. This solution is also being available with > the > > enhanced and strengthened mpirun_rsh of mvapich 1.0 version. > > > > Thanks, > > > > DK > > > > > > On Wed, 9 Jan 2008, Scott Shaw wrote: > > > > > Hi, > > > On several clusters we are experiencing the same issues originally > > > posted on Oct 11, 2007 regarding "error closing socket at end of > > > mpirun_rsh" job. Running the mpi test with one core works, no error > is > > > generated but n+1 cores error is generated. > > > > > > Is there a patch available which addresses the "Termination socket > read > > > failed" error message? I have tested three different clusters and > each > > > cluster exhibits the same error. I also check the "mvapich-discuss" > > > archives and still did not see a resolution. > > > > > > I am currently running mvapich v0.9.9 which is bundled with ofed > v1.2. > > > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile ./mpi_test > > > Rank=0 present and calling MPI_Finalize > > > Rank=0 bailing, nicely > > > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test > > > Rank=1 present and calling MPI_Finalize > > > Rank=0 present and calling MPI_Finalize > > > Rank=0 bailing, nicely > > > Termination socket read failed: Bad file descriptor > > > Rank=1 bailing, nicely > > > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile ./mpi_test > > > Rank=1 present and calling MPI_Finalize > > > Rank=3 present and calling MPI_Finalize > > > Rank=0 present and calling MPI_Finalize > > > Rank=2 present and calling MPI_Finalize > > > Rank=0 bailing, nicely > > > Termination socket read failed: Bad file descriptor > > > Rank=3 bailing, nicely > > > Rank=1 bailing, nicely > > > Rank=2 bailing, nicely > > > > > > Thanks, > > > Scott > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > From sshaw at sgi.com Mon Jan 14 13:18:48 2008 From: sshaw at sgi.com (Scott Shaw) Date: Mon Jan 14 13:20:22 2008 Subject: [mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11 In-Reply-To: References: <9BEB932202A05B488722B05D2374A1DA04C4BF9C@mtv-amer001e--3.americas.sgi.com> Message-ID: <9BEB932202A05B488722B05D2374A1DA04C4C0CE@mtv-amer001e--3.americas.sgi.com> DK, I submitted a user account request to our support team and should have an account created later this afternoon. We have two ICE clusters available from the internet and not sure which one will be used so I will provide a hostname in a bit. The user account requested: Userid: osu_support Temp Passwd: sgisgi4u Thank you again for continued support. Scott > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > Sent: Monday, January 14, 2008 10:26 AM > To: Scott Shaw > Cc: mvapich-discuss@cse.ohio-state.edu; Dhabaleswar Panda > Subject: RE: [mvapich-discuss] error closing socket at end of mpirun_rsh > original posted Oct 11 > > Hi Scott, > > Sorry to know that you are still encountering the problem on some systems. > Thanks for offering to provide us access to a test cluster. This will be > very helpful. Please send me the remote access information and one of my > team members will work closely with you to resolve this problem. > > Thanks, > > DK > > On Mon, 14 Jan 2008, Scott Shaw wrote: > > > Hi DK, > > I installed and tested the NMCAC mvapich patches. Rerunning simple MPI > > tests still a problem. What seems interesting is the "termination > > failed" message does not happen on cluster nodes with drives only our > > diskless clusters. Another interesting data point is that this error can > > occur when just using "-np 2", two cores, on the same node so this might > > rule out networking issues? > > > > Following is an email I sent to Michel and Kevin regarding this issue. > > Would it help if I provide you access to a cluster for testing purposes? > > > > > > Thanks, > > Scott > > > > Thursday, January 10, 2008 3:07 PM > > Hi Michel, Kevin - > > I have downloaded the rpms from the location Michel provided. I > > extracted the rpms in my home directory instead of messing with what's > > currently installed on orbit6.americas. I recompiled the application and > > linked against the new/revised mvapich libs and I still get the > > termination failed message. Several applications like NEMO which are > > built against mvapich showed the same failure which prompted me to post > > the question to the mvapich mail alias. A customer reviewing the result > > files will be suspicious of this error messages and _if_ the analysis > > completed successfully. So this could be a potential issue to customers > > review benchmark results. Any ideas how to proceed? > > > > service0 /store/sshaw> pwd > > /nas/store/sshaw > > > > rpm2cpio > > mvapich_intel-test-SGINoShip-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio > > -ivd rpm2cpio mvapich_intel-0.9.9-1326sgi503rp2michel.x86_64.rpm | cpio > > -ivd > > > > > > service0 /store/sshaw> setenv LD_LIBRARY_PATH > > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib > > service0 /store/sshaw> module load intel-compilers-9 service0 > > /store/sshaw> module list Currently Loaded Modulefiles: > > 1) intel-cc-9/9.1.052 2) intel-fc-9/9.1.052 3) intel-compilers-9 > > > > service0 /store/sshaw> setenv PATH > > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin:${PATH} > > service0 /store/sshaw> which mpirun_rsh > > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh > > > > service0 /store/sshaw> mpicc mpi_test.c -o mpi_test > > -L/store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/lib > > > > service0 /store/sshaw> > > /store/sshaw/nmcac/usr/mpi/mvapich-0.9.9/intel/bin/mpirun_rsh -np 2 > > -hostfile ./hfile ./mpi_test Rank=0 present and calling MPI_Finalize > > Rank=0 present and calling MPI_Finalize Rank=0 bailing, nicely > > Termination socket read failed: Bad file descriptor Rank=0 bailing, > > nicely > > > > > > Thanks, > > Scott > > > > > > > > > > > -----Original Message----- > > > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > > > Sent: Saturday, January 12, 2008 9:25 AM > > > To: Scott Shaw > > > Cc: mvapich-discuss@cse.ohio-state.edu > > > Subject: Re: [mvapich-discuss] error closing socket at end of > > mpirun_rsh > > > original posted Oct 11 > > > > > > Hi Scott, > > > > > > As we discussed off-line, you have access to a solution to this > > problem. > > > Let us know how it works. This solution is also being available with > > the > > > enhanced and strengthened mpirun_rsh of mvapich 1.0 version. > > > > > > Thanks, > > > > > > DK > > > > > > > > > On Wed, 9 Jan 2008, Scott Shaw wrote: > > > > > > > Hi, > > > > On several clusters we are experiencing the same issues originally > > > > posted on Oct 11, 2007 regarding "error closing socket at end of > > > > mpirun_rsh" job. Running the mpi test with one core works, no error > > is > > > > generated but n+1 cores error is generated. > > > > > > > > Is there a patch available which addresses the "Termination socket > > read > > > > failed" error message? I have tested three different clusters and > > each > > > > cluster exhibits the same error. I also check the "mvapich-discuss" > > > > archives and still did not see a resolution. > > > > > > > > I am currently running mvapich v0.9.9 which is bundled with ofed > > v1.2. > > > > > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 1 -hostfile ./hfile ./mpi_test > > > > Rank=0 present and calling MPI_Finalize > > > > Rank=0 bailing, nicely > > > > > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 2 -hostfile ./hfile ./mpi_test > > > > Rank=1 present and calling MPI_Finalize > > > > Rank=0 present and calling MPI_Finalize > > > > Rank=0 bailing, nicely > > > > Termination socket read failed: Bad file descriptor > > > > Rank=1 bailing, nicely > > > > > > > > r1i0n0 /store/sshaw> mpirun_rsh -np 4 -hostfile ./hfile ./mpi_test > > > > Rank=1 present and calling MPI_Finalize > > > > Rank=3 present and calling MPI_Finalize > > > > Rank=0 present and calling MPI_Finalize > > > > Rank=2 present and calling MPI_Finalize > > > > Rank=0 bailing, nicely > > > > Termination socket read failed: Bad file descriptor > > > > Rank=3 bailing, nicely > > > > Rank=1 bailing, nicely > > > > Rank=2 bailing, nicely > > > > > > > > Thanks, > > > > Scott > > > > > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > From sshaw at sgi.com Mon Jan 14 13:46:57 2008 From: sshaw at sgi.com (Scott Shaw) Date: Mon Jan 14 13:48:41 2008 Subject: [mvapich-discuss] error closing socket at end of mpirun_rsh original posted Oct 11 References: <9BEB932202A05B488722B05D2374A1DA04C4BF9C@mtv-amer001e--3.americas.sgi.com> Message-ID: <9BEB932202A05B488722B05D2374A1DA04C4C119@mtv-amer001e--3.americas.sgi.com> This email was not intended for the mail alias and I have removed the account. Sorry for the wasted bandwidth and thanks to those who have replied regarding the my original email. Scott From ben.held at staarinc.com Thu Jan 17 12:37:34 2008 From: ben.held at staarinc.com (Ben Held) Date: Thu Jan 17 12:37:43 2008 Subject: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c Message-ID: <01d401c8592f$a5048150$ef0d83f0$@held@staarinc.com> We have recently built our MPI application using MVAPICH1 under LINUX and are seeing certain runs fail (success or failure seems to be a function of the # of processes - 8 will work, 16 will fail, 32 will work, etc). This code has been thoroughly testing using the standard MPICH (Ethernet based) and LAM and everything is fine. Does this error: Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c Mean anything? This is a new cluster (8 node, 8 cores per node) has been tested under using stress tests provided by the cluster manufacturer (Microway). This is out of my area of expertise and this is the first IB based system I have worked on. Any thoughts? Regards, Ben Held Simulation Technology & Applied Research, Inc. 11520 N. Port Washington Rd., Suite 201 Mequon, WI 53092 P: 1.262.240.0291 x101 F: 1.262.240.0294 E: ben.held@staarinc.com http://www.staarinc.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080117/0e5be848/attachment.html From koop at cse.ohio-state.edu Thu Jan 17 22:28:50 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu Jan 17 22:28:58 2008 Subject: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c In-Reply-To: <01d401c8592f$a5048150$ef0d83f0$@held@staarinc.com> Message-ID: Ben, Sorry to hear about this issue. Can you give me some more details on your installation -- what distro are you using and is OFED being used? Also, what version of MVAPICH are you using? Additionally, what is the output of 'ulimit -l' on your system (or equivalent shell command). You may want to check all nodes. Memory registration generally does not fail unless the amount of lockable memory is too low. Matt On Thu, 17 Jan 2008, Ben Held wrote: > We have recently built our MPI application using MVAPICH1 under LINUX and > are seeing certain runs fail (success or failure seems to be a function of > the # of processes - 8 will work, 16 will fail, 32 will work, etc). This > code has been thoroughly testing using the standard MPICH (Ethernet based) > and LAM and everything is fine. > > > > Does this error: > > > > Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c > > > > Mean anything? This is a new cluster (8 node, 8 cores per node) has been > tested under using stress tests provided by the cluster manufacturer > (Microway). This is out of my area of expertise and this is the first IB > based system I have worked on. > > > > Any thoughts? > > > Regards, > > > > Ben Held > Simulation Technology & Applied Research, Inc. > 11520 N. Port Washington Rd., Suite 201 > Mequon, WI 53092 > P: 1.262.240.0291 x101 > F: 1.262.240.0294 > E: ben.held@staarinc.com > http://www.staarinc.com > > > > > > From dstuebe at umassd.edu Fri Jan 18 13:53:49 2008 From: dstuebe at umassd.edu (David Stuebe) Date: Fri Jan 18 13:54:02 2008 Subject: [mvapich-discuss] VALGRIND on MVAPICH2 F90 code In-Reply-To: <1f31dac10801181020k2b791539qbf3281c471724843@mail.gmail.com> References: <1f31dac10801181020k2b791539qbf3281c471724843@mail.gmail.com> Message-ID: <1f31dac10801181053k700c6a1bvf79c4490989ea695@mail.gmail.com> Hello MVAPICH and VALGRIND I am a research associate at UMASSD. I work on a numerical ocean model, fvcom, written in F90. We have recently run into problems: forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) mpiexec: Warning: tasks 0-1,3 exited with status 1. mpiexec: Warning: task 2 died with signal 11 (Segmentation fault). The error is problem size dependent The error is compiler optimization dependent. The error only occurs when running on more than one node. (in the example error above, I used 2 procs. per node, on 2 nodes) If I run on four procs in one node, the code passes! The only clue that I have is that the problem seems to be related to subroutines which use explicit shape arrays - but I have checked all the upper and lower bounds. Running under valgrind or compiling with '-check all' in ifort allows the routine to pass? It seems my only hope for tracing this mess is using valgrind, but I am having trouble using valgrind on our cluster. It does run but I am concerned that it is not running properly. The mpi_init call alone results in hundreds of errors in the mpi and vapi libraries including leaks, uninitialized memory use/conditionals and invalid read/writes. Has anyone had success using valgrind with mvapich2? Valgrind also found problems with the fvcom fortran code but most of these seemed to go away when I increased the max-framestack. None of the remaining errors seem to be related to what causes the sigsev when I run without valgrind. Selected system info: Nodes are Dell 1850. Intel Xeon EM64-T Network is Infiniband PCI-EX 4X System is Rocks 4.2 Thread model: posix gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) ifort Version 9.1 mpif90 for mvapich2-1.0 valgrind-3.2.3 mpiexec-0.82 Again, all of these tools/libraries seem to work fine under normal tests, but this particular combination of code and model case is causing a real mess! Thanks for any help you can offer! David -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080118/e6f046a8/attachment-0001.html From ben.held at staarinc.com Fri Jan 18 14:18:02 2008 From: ben.held at staarinc.com (Ben Held) Date: Fri Jan 18 14:18:15 2008 Subject: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c In-Reply-To: References: <01d401c8592f$a5048150$ef0d83f0$@held@staarinc.com> Message-ID: <00c001c85a06$d8861a70$89924f50$@held@staarinc.com> Matt, The version of MVAPICH is mvapich_gcc-0.9.9-1458. I believe this is part of the OFED distro - it was installed by the manuf. Of the cluster. ulimit -l reports 131072 on all nodes. Ben -----Original Message----- From: Matthew Koop [mailto:koop@cse.ohio-state.edu] Sent: Thursday, January 17, 2008 9:29 PM To: Ben Held Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c Ben, Sorry to hear about this issue. Can you give me some more details on your installation -- what distro are you using and is OFED being used? Also, what version of MVAPICH are you using? Additionally, what is the output of 'ulimit -l' on your system (or equivalent shell command). You may want to check all nodes. Memory registration generally does not fail unless the amount of lockable memory is too low. Matt On Thu, 17 Jan 2008, Ben Held wrote: > We have recently built our MPI application using MVAPICH1 under LINUX and > are seeing certain runs fail (success or failure seems to be a function of > the # of processes - 8 will work, 16 will fail, 32 will work, etc). This > code has been thoroughly testing using the standard MPICH (Ethernet based) > and LAM and everything is fine. > > > > Does this error: > > > > Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c > > > > Mean anything? This is a new cluster (8 node, 8 cores per node) has been > tested under using stress tests provided by the cluster manufacturer > (Microway). This is out of my area of expertise and this is the first IB > based system I have worked on. > > > > Any thoughts? > > > Regards, > > > > Ben Held > Simulation Technology & Applied Research, Inc. > 11520 N. Port Washington Rd., Suite 201 > Mequon, WI 53092 > P: 1.262.240.0291 x101 > F: 1.262.240.0294 > E: ben.held@staarinc.com > http://www.staarinc.com > > > > > > From joseph.hargitai at nyu.edu Fri Jan 18 18:05:36 2008 From: joseph.hargitai at nyu.edu (Joseph Hargitai) Date: Fri Jan 18 18:05:43 2008 Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu Message-ID: hi all: While submitting two identical mpi jobs (-np 4) with differetnt datasets for a dual socket quadocore node using two distinct pbs/mpiexec submission both jobs end up on the first processor, such they use 4 cores of the first cpu, none of the second. This results in 8 processes on cpu 1, with a load of about 8-9, both jobs producing output okay, but obviously the choice would be to have them on distinct cpus. When one of these mpi jobs meet 4 others regular serial jobs submitted without mpiexec, all 8 cores are populated. I did read on the group list about the first mpiexec session being the master, but not reserving the first 4 cores such allowing the possibility for the next mpi job to end up on the same cpu. best, joseph From chai.15 at osu.edu Fri Jan 18 19:00:16 2008 From: chai.15 at osu.edu (LEI CHAI) Date: Fri Jan 18 19:00:25 2008 Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu Message-ID: <4cb724ab37.4ab374cb72@osu.edu> Hi Joseph, Could you try disable the cpu affinity feature in mvapich/mvapich2, e.g. mvapich2: $ mpiexec -n 4 -env MV2_ENABLE_AFFINITY 0 ./a.out or mvapich: $ mpirun_rsh -np 4 VIADEV_ENABLE_AFFINITY=0 ./a.out Thanks, Lei ----- Original Message ----- From: Joseph Hargitai Date: Friday, January 18, 2008 6:05 pm Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu > > > hi all: > > While submitting two identical mpi jobs (-np 4) with differetnt > datasets for a dual socket quadocore node using two distinct > pbs/mpiexec submission both jobs end up on the first processor, > such they use 4 cores of the first cpu, none of the second. This > results in 8 processes on cpu 1, with a load of about 8-9, both > jobs producing output okay, but obviously the choice would be to > have them on distinct cpus. > > When one of these mpi jobs meet 4 others regular serial jobs > submitted without mpiexec, all 8 cores are populated. > > I did read on the group list about the first mpiexec session being > the master, but not reserving the first 4 cores such allowing the > possibility for the next mpi job to end up on the same cpu. > > best, > joseph > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From joseph.hargitai at nyu.edu Fri Jan 18 19:49:30 2008 From: joseph.hargitai at nyu.edu (Joseph Hargitai) Date: Fri Jan 18 19:49:38 2008 Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu In-Reply-To: <4cb724ab37.4ab374cb72@osu.edu> References: <4cb724ab37.4ab374cb72@osu.edu> Message-ID: Sure, we'll try. I thought it was only an mvapich2 only option. We have .9, so we'll see. While we are at it - is there an actual way to assign with either version to specific cores? We have a few large multinode obs that can only use 4 out of 8 cores per node due to bus and memory limitations. Is there a way to distribute the processes to cores 1,2 and 5,6 ? ie to skip 0, and split the other 4 to different chips? j ----- Original Message ----- From: LEI CHAI Date: Friday, January 18, 2008 7:00 pm Subject: Re: [mvapich-discuss] mpiexec/mvapich places processes on same cpu > Hi Joseph, > > Could you try disable the cpu affinity feature in mvapich/mvapich2, e.g. > > mvapich2: > $ mpiexec -n 4 -env MV2_ENABLE_AFFINITY 0 ./a.out > > or mvapich: > $ mpirun_rsh -np 4 VIADEV_ENABLE_AFFINITY=0 ./a.out > > Thanks, > Lei > > > ----- Original Message ----- > From: Joseph Hargitai > Date: Friday, January 18, 2008 6:05 pm > Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu > > > > > > > hi all: > > > > While submitting two identical mpi jobs (-np 4) with differetnt > > datasets for a dual socket quadocore node using two distinct > > pbs/mpiexec submission both jobs end up on the first processor, > > such they use 4 cores of the first cpu, none of the second. This > > results in 8 processes on cpu 1, with a load of about 8-9, both > > jobs producing output okay, but obviously the choice would be to > > have them on distinct cpus. > > > > When one of these mpi jobs meet 4 others regular serial jobs > > submitted without mpiexec, all 8 cores are populated. > > > > I did read on the group list about the first mpiexec session being > > the master, but not reserving the first 4 cores such allowing the > > possibility for the next mpi job to end up on the same cpu. > > > > best, > > joseph > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > From koop at cse.ohio-state.edu Fri Jan 18 19:55:17 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Jan 18 19:55:24 2008 Subject: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c In-Reply-To: <00c001c85a06$d8861a70$89924f50$@held@staarinc.com> Message-ID: Ben, The maximum locked memory you are allowing on the system is lower than is expected. Can you try increasing that value to closer to the maximum memory of the node? Matt On Fri, 18 Jan 2008, Ben Held wrote: > Matt, > > The version of MVAPICH is mvapich_gcc-0.9.9-1458. I believe this is part of > the OFED distro - it was installed by the manuf. Of the cluster. > > ulimit -l reports 131072 on all nodes. > > Ben > -----Original Message----- > From: Matthew Koop [mailto:koop@cse.ohio-state.edu] > Sent: Thursday, January 17, 2008 9:29 PM > To: Ben Held > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at > line 211 in file vbuf.c > > Ben, > > Sorry to hear about this issue. Can you give me some more details on your > installation -- what distro are you using and is OFED being used? Also, > what version of MVAPICH are you using? > > Additionally, what is the output of 'ulimit -l' on your system (or > equivalent shell command). You may want to check all nodes. Memory > registration generally does not fail unless the amount of lockable memory > is too low. > > Matt > > On Thu, 17 Jan 2008, Ben Held wrote: > > > We have recently built our MPI application using MVAPICH1 under LINUX and > > are seeing certain runs fail (success or failure seems to be a function of > > the # of processes - 8 will work, 16 will fail, 32 will work, etc). This > > code has been thoroughly testing using the standard MPICH (Ethernet based) > > and LAM and everything is fine. > > > > > > > > Does this error: > > > > > > > > Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c > > > > > > > > Mean anything? This is a new cluster (8 node, 8 cores per node) has been > > tested under using stress tests provided by the cluster manufacturer > > (Microway). This is out of my area of expertise and this is the first IB > > based system I have worked on. > > > > > > > > Any thoughts? > > > > > > Regards, > > > > > > > > Ben Held > > Simulation Technology & Applied Research, Inc. > > 11520 N. Port Washington Rd., Suite 201 > > Mequon, WI 53092 > > P: 1.262.240.0291 x101 > > F: 1.262.240.0294 > > E: ben.held@staarinc.com > > http://www.staarinc.com > > > > > > > > > > > > > > From pasha at dev.mellanox.co.il Sun Jan 20 04:45:09 2008 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Sun Jan 20 04:45:38 2008 Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu In-Reply-To: References: <4cb724ab37.4ab374cb72@osu.edu> Message-ID: <47931825.3000305@dev.mellanox.co.il> > While we are at it - is there an actual way to assign with either version to specific cores? We have a few large multinode obs that can only use 4 out of 8 cores per node due to bus and memory limitations. Is there a way to distribute the processes to cores 1,2 and 5,6 ? ie to skip 0, and split the other 4 to different chips? > Sounds good. As we have a option to specify HCA/port per rank it may be nice to have option to specify core. Should not be very complicated to implement. (I'm talking about mvapich1) Mvapich team, what do you think ? Pasha > j > > ----- Original Message ----- > From: LEI CHAI > Date: Friday, January 18, 2008 7:00 pm > Subject: Re: [mvapich-discuss] mpiexec/mvapich places processes on same cpu > > >> Hi Joseph, >> >> Could you try disable the cpu affinity feature in mvapich/mvapich2, e.g. >> >> mvapich2: >> $ mpiexec -n 4 -env MV2_ENABLE_AFFINITY 0 ./a.out >> >> or mvapich: >> $ mpirun_rsh -np 4 VIADEV_ENABLE_AFFINITY=0 ./a.out >> >> Thanks, >> Lei >> >> >> ----- Original Message ----- >> From: Joseph Hargitai >> Date: Friday, January 18, 2008 6:05 pm >> Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu >> >> >>> hi all: >>> >>> While submitting two identical mpi jobs (-np 4) with differetnt >>> datasets for a dual socket quadocore node using two distinct >>> pbs/mpiexec submission both jobs end up on the first processor, >>> such they use 4 cores of the first cpu, none of the second. This >>> results in 8 processes on cpu 1, with a load of about 8-9, both >>> jobs producing output okay, but obviously the choice would be to >>> have them on distinct cpus. >>> >>> When one of these mpi jobs meet 4 others regular serial jobs >>> submitted without mpiexec, all 8 cores are populated. >>> >>> I did read on the group list about the first mpiexec session being >>> the master, but not reserving the first 4 cores such allowing the >>> possibility for the next mpi job to end up on the same cpu. >>> >>> best, >>> joseph >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- Pavel Shamis (Pasha) Mellanox Technologies From wyu at ornl.gov Sun Jan 20 14:44:47 2008 From: wyu at ornl.gov (Weikuan Yu) Date: Sun Jan 20 15:48:08 2008 Subject: [mvapich-discuss] MVAPICH-1.0 Bandwidth drop with large window size Message-ID: Hi, I noticed a bandwidth problem with MVAPICH-1.0 UD implementation. Using osu_(bi)bw.c programs, if the window size is increased to 128 or above, the bandwidth can be quite low. I would assume there is a congestion in some queue processing. The difference should be readily reproducible. So I would not post the details numbers here. Let me know if you have some parameters to tune it up or need more info. -- Weikuan Yu http://ft.ornl.gov/~wyu/ From koop at cse.ohio-state.edu Sun Jan 20 16:48:41 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sun Jan 20 16:48:49 2008 Subject: [mvapich-discuss] MVAPICH-1.0 Bandwidth drop with large window size In-Reply-To: Message-ID: Weikuan, Thanks for pointing this out. This is known issue that we are working on addressing -- the queue length gets too long when the zero copy path is not used. If you have applications where a large window will be used you can increase the zero copy pool for the time being: e.g. ../bin/mpirun_rsh -np 2 host1 host22 MV_UD_ZCOPY_QPS=128 ./bw Thanks, Matt On Sun, 20 Jan 2008, Weikuan Yu wrote: > Hi, > > I noticed a bandwidth problem with MVAPICH-1.0 UD implementation. Using > osu_(bi)bw.c programs, if the window size is increased to 128 or above, the > bandwidth can be quite low. I would assume there is a congestion in some > queue processing. > > The difference should be readily reproducible. So I would not post the > details numbers here. Let me know if you have some parameters to tune it up > or need more info. > > -- > Weikuan Yu > http://ft.ornl.gov/~wyu/ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From chai.15 at osu.edu Sun Jan 20 20:20:38 2008 From: chai.15 at osu.edu (LEI CHAI) Date: Sun Jan 20 20:20:48 2008 Subject: [mvapich-discuss] mpiexec/mvapich places processes on same cpu Message-ID: <8ce8287730.877308ce82@osu.edu> Hi Joseph and Pasha, Thank you for the suggestions. We already plan to add this feature to MVAPICH. It will be available in upcoming releases. Lei ----- Original Message ----- From: "Pavel Shamis (Pasha)" Date: Sunday, January 20, 2008 4:45 am Subject: Re: [mvapich-discuss] mpiexec/mvapich places processes on same cpu > > > While we are at it - is there an actual way to assign with > either version to specific cores? We have a few large multinode > obs that can only use 4 out of 8 cores per node due to bus and > memory limitations. Is there a way to distribute the processes to > cores 1,2 and 5,6 ? ie to skip 0, and split the other 4 to > different chips? > > > Sounds good. As we have a option to specify HCA/port per rank it > may be > nice to have option to specify core. Should not be very > complicated to > implement. (I'm talking about mvapich1) > Mvapich team, what do you think ? > > Pasha > > j > > > > ----- Original Message ----- > > From: LEI CHAI > > Date: Friday, January 18, 2008 7:00 pm > > Subject: Re: [mvapich-discuss] mpiexec/mvapich places processes > on same cpu > > > > > >> Hi Joseph, > >> > >> Could you try disable the cpu affinity feature in > mvapich/mvapich2, e.g. > >> > >> mvapich2: > >> $ mpiexec -n 4 -env MV2_ENABLE_AFFINITY 0 ./a.out > >> > >> or mvapich: > >> $ mpirun_rsh -np 4 VIADEV_ENABLE_AFFINITY=0 ./a.out > >> > >> Thanks, > >> Lei > >> > >> > >> ----- Original Message ----- > >> From: Joseph Hargitai > >> Date: Friday, January 18, 2008 6:05 pm > >> Subject: [mvapich-discuss] mpiexec/mvapich places processes on > same cpu > >> > >> > >>> hi all: > >>> > >>> While submitting two identical mpi jobs (-np 4) with > differetnt > >>> datasets for a dual socket quadocore node using two distinct > >>> pbs/mpiexec submission both jobs end up on the first > processor, > >>> such they use 4 cores of the first cpu, none of the second. > This > >>> results in 8 processes on cpu 1, with a load of about 8-9, > both > >>> jobs producing output okay, but obviously the choice would be > to > >>> have them on distinct cpus. > >>> > >>> When one of these mpi jobs meet 4 others regular serial jobs > >>> submitted without mpiexec, all 8 cores are populated. > >>> > >>> I did read on the group list about the first mpiexec session > being > >>> the master, but not reserving the first 4 cores such allowing > the > >>> possibility for the next mpi job to end up on the same cpu. > >>> > >>> best, > >>> joseph > >>> _______________________________________________ > >>> mvapich-discuss mailing list > >>> mvapich-discuss@cse.ohio-state.edu > >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>> > >>> > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > -- > Pavel Shamis (Pasha) > Mellanox Technologies > > From weikuan.yu at gmail.com Mon Jan 21 07:59:01 2008 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Mon Jan 21 07:59:08 2008 Subject: [mvapich-discuss] MVAPICH-1.0 Bandwidth drop with large window size In-Reply-To: References: Message-ID: <47949715.4030503@gmail.com> Matthew Koop wrote: > Weikuan, > > Thanks for pointing this out. This is known issue that we are working on > addressing -- the queue length gets too long when the zero copy path is > not used. If you have applications where a large window will be used you > can increase the zero copy pool for the time being: > > e.g. > ../bin/mpirun_rsh -np 2 host1 host22 MV_UD_ZCOPY_QPS=128 ./bw Thanks, Matt. Yes, that helps. --Weikuan > > Thanks, > > Matt > > On Sun, 20 Jan 2008, Weikuan Yu wrote: > > > Hi, > > > > I noticed a bandwidth problem with MVAPICH-1.0 UD implementation. Using > > osu_(bi)bw.c programs, if the window size is increased to 128 or > above, the > > bandwidth can be quite low. I would assume there is a congestion in some > > queue processing. > > > > The difference should be readily reproducible. So I would not post the > > details numbers here. Let me know if you have some parameters to tune > it up > > or need more info. > > > > -- > > Weikuan Yu > > http://ft.ornl.gov/~wyu/ > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > From panda at cse.ohio-state.edu Tue Jan 22 16:32:58 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue Jan 22 16:33:08 2008 Subject: [mvapich-discuss] BLCR fix for MPI problems (fwd) Message-ID: I have received this note from Dr. Paul Hargrove (designer of BLCR). He has released BLCR 0.6.3 and it incorporates fixes for MPI problems. Some of you had indicated on this list earlier about experiencing `restart' problem with MVAPICH2+BLCR on some platforms. May I request you to update your BLCR version to 0.6.3. Please let us know if you encounter any further problems. Thanks, DK ---------- Forwarded message ---------- Date: Tue, 22 Jan 2008 12:20:02 -0800 From: Paul H. Hargrove To: Dhabaleswar Panda , Joshua Hursey Subject: BLCR fix for MPI problems D.K and Josh, You two have forwarded to me reports from your respective MPI user communities of problems w/ BLCR, in particular reports of corrupted floating-point results on the x86-64 architecture. I am pleased top let you know that I've just released BLCR 0.6.3 to finally fix that problem. You should receive (may all ready have) the BLCR 0.6.3 release announcement in a separate e-mail. Since this fixes a serious problem that has affected many of your users, I'd appreciate it if you could forward the BLCR release announcement to your respective mailing lists. Thanks, -Paul -- Paul H. Hargrove PHHargrove@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 From dstuebe at umassd.edu Wed Jan 23 14:29:54 2008 From: dstuebe at umassd.edu (David Stuebe) Date: Wed Jan 23 14:30:06 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? Message-ID: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> Hello MVAPICH I have found a strange bug in MVAPICH2 using IFORT. The behavior is very strange indeed - it seems to be related to how ifort deals with passing pointers to the MVAPICH FORTRAN 90 INTERFACE. The MPI call returns successfully, but later calls to a dummy subroutine cause a sigsev. Please look at the following code: !================================================================================= !================================================================================= !================================================================================= ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT ! WRITEN BY: DAVID STUEBE ! DATE: JAN 23, 2008 ! ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest ! ! KNOWN BEHAVIOR: ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE OF ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS WITHOUT AN INTERFACE - ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING VALID DATA. ! ! COMMENTS: ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - SHAME ON ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR NOT. ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS ! EXTREMELY DIFFICULT TO DEBUG! ! ! CONDITIONS FOR OCCURANCE: ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... ! ie Running inside one SMP box does not crash. ! ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV ! RUNNING UNDER MPIEXEC0.82 FOR PBS, ! ONLY SOME PROCESSES SIGSEV ? ! ! ENVIRONMENTAL INFO: ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X ! SYSTEM: ROCKS 4.2 ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) ! ! IFORT/ICC: ! Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 ! ! MVAPICH2: mpif90 for mvapich2-1.0 ! ./configure --prefix=/usr/local/share/mvapich2/1.0 --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd --enable-f90 --enable-cxx --disable-romio --without-mpe ! !================================================================================= !================================================================================= !================================================================================= Module vars USE MPI implicit none integer :: n,m,MYID,NPROCS integer :: ipt integer, allocatable, target :: data(:,:) contains subroutine alloc_vars implicit none integer Status allocate(data(n,m),stat=status) if (status /=0) then write(ipt,*) "allocation error" stop end if data = 0 end subroutine alloc_vars SUBROUTINE INIT_MPI_ENV(ID,NP) !===================================================================================| ! INITIALIZE MPI ENVIRONMENT | !===================================================================================| INTEGER, INTENT(OUT) :: ID,NP INTEGER IERR IERR=0 CALL MPI_INIT(IERR) IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID END SUBROUTINE INIT_MPI_ENV !==============================================================================| SUBROUTINE PSHUTDOWN !==============================================================================| INTEGER IERR IERR=0 CALL MPI_FINALIZE(IERR) if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID close(IPT) STOP END SUBROUTINE PSHUTDOWN SUBROUTINE CONTIGUOUS_WORKS IMPLICIT NONE INTEGER, pointer :: ptest(:,:) INTEGER :: IERR, I,J write(ipt,*) "START CONTIGUOUS:" n=2000 ! Set size here... m=n+10 call alloc_vars write(ipt,*) "ALLOCATED DATA" ptest => data(1:N,1:N) IF (MYID == 0) ptest=6 write(ipt,*) "Made POINTER" call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID write(ipt,*) "BROADCAST Data; a value:",data(1,6) DO I = 1,N DO J = 1,N if(data(I,J) /= 6) & & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) END DO DO J = N+1,M if(data(I,J) /= 0) & & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) END DO END DO ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN ITERFACE ! THAT USE AN EXPLICIT SHAPE ARRAY write(ipt,*) "CALLING DUMMY1" CALL DUMMY1 write(ipt,*) "CALLING DUMMY2" call Dummy2(m,n) write(ipt,*) "CALLING DUMMY3" call Dummy3 write(ipt,*) "FINISHED!" END SUBROUTINE CONTIGUOUS_WORKS SUBROUTINE NON_CONTIGUOUS_FAILS IMPLICIT NONE INTEGER, pointer :: ptest(:,:) INTEGER :: IERR, I,J write(ipt,*) "START NON_CONTIGUOUS:" m=200 ! Set size here - crash is size dependent! n=m+10 call alloc_vars write(ipt,*) "ALLOCATED DATA" ptest => data(1:M,1:M) !=================================================== ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? !=================================================== ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG IF (MYID == 0) ptest=6 write(ipt,*) "Made POINTER" call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" write(ipt,*) "BROADCAST Data; a value:",data(1,6) DO I = 1,M DO J = 1,M if(data(J,I) /= 6) & & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) END DO DO J = M+1,N if(data(J,I) /= 0) & & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) END DO END DO ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN ITERFACE ! THAT USE AN EXPLICIT SHAPE ARRAY write(ipt,*) "CALLING DUMMY1" CALL DUMMY1 write(ipt,*) "CALLING DUMMY2" call Dummy2(m,n) ! SHOULD CRASH HERE! write(ipt,*) "CALLING DUMMY3" call Dummy3 write(ipt,*) "FINISHED!" END SUBROUTINE NON_CONTIGUOUS_FAILS End Module vars Program main USE vars implicit none CALL INIT_MPI_ENV(MYID,NPROCS) ipt=myid+10 OPEN(ipt) write(ipt,*) "Start memory test!" CALL NON_CONTIGUOUS_FAILS ! CALL CONTIGUOUS_WORKS write(ipt,*) "End memory test!" CALL PSHUTDOWN END Program main ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE SUBROUTINE DUMMY1 USE vars implicit none real, dimension(m) :: my_data write(ipt,*) "m,n",m,n write(ipt,*) "DUMMY 1", size(my_data) END SUBROUTINE DUMMY1 SUBROUTINE DUMMY2(i,j) USE vars implicit none INTEGER, INTENT(IN) ::i,j real, dimension(i,j) :: my_data write(ipt,*) "start: DUMMY 2", size(my_data) END SUBROUTINE DUMMY2 SUBROUTINE DUMMY3 USE vars implicit none real, dimension(m,n) :: my_data write(ipt,*) "start: DUMMY 3", size(my_data) END SUBROUTINE DUMMY3 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080123/a231fe79/attachment-0001.html From curtisbr at cse.ohio-state.edu Wed Jan 23 15:23:03 2008 From: curtisbr at cse.ohio-state.edu (Brian Curtis) Date: Wed Jan 23 15:23:13 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> References: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> Message-ID: <4797A227.5030002@cse.ohio-state.edu> David, Sorry to hear you are experience problems with the MVAPICH2 Fortran 90 interface. I will be investigating this issue, but need some additional information about your setup. What is the exact version of MVAPICH2 1.0 you are utilizing (daily tarball or release)? Have you tried MVAPICH2 1.0.1? Brian David Stuebe wrote: > Hello MVAPICH > I have found a strange bug in MVAPICH2 using IFORT. The behavior is very > strange indeed - it seems to be related to how ifort deals with passing > pointers to the MVAPICH FORTRAN 90 INTERFACE. > The MPI call returns successfully, but later calls to a dummy subroutine > cause a sigsev. > > Please look at the following code: > > !================================================================================= > !================================================================================= > !================================================================================= > ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT > ! WRITEN BY: DAVID STUEBE > ! DATE: JAN 23, 2008 > ! > ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest > ! > ! KNOWN BEHAVIOR: > ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE OF > ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS WITHOUT AN > INTERFACE - > ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING VALID DATA. > ! > ! COMMENTS: > ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - SHAME ON > ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR NOT. > ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS > ! EXTREMELY DIFFICULT TO DEBUG! > ! > ! CONDITIONS FOR OCCURANCE: > ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' > ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? > ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... > ! ie Running inside one SMP box does not crash. > ! > ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV > ! RUNNING UNDER MPIEXEC0.82 FOR PBS, > ! ONLY SOME PROCESSES SIGSEV ? > ! > ! ENVIRONMENTAL INFO: > ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X > ! SYSTEM: ROCKS 4.2 > ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > ! > ! IFORT/ICC: > ! Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, > ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 > ! > ! MVAPICH2: mpif90 for mvapich2-1.0 > ! ./configure --prefix=/usr/local/share/mvapich2/1.0 > --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd --enable-f90 > --enable-cxx --disable-romio --without-mpe > ! > !================================================================================= > !================================================================================= > !================================================================================= > > Module vars > USE MPI > implicit none > > > integer :: n,m,MYID,NPROCS > integer :: ipt > > integer, allocatable, target :: data(:,:) > > contains > > subroutine alloc_vars > implicit none > > integer Status > > allocate(data(n,m),stat=status) > if (status /=0) then > write(ipt,*) "allocation error" > stop > end if > > data = 0 > > end subroutine alloc_vars > > SUBROUTINE INIT_MPI_ENV(ID,NP) > !===================================================================================| > ! INITIALIZE MPI > ENVIRONMENT | > !===================================================================================| > INTEGER, INTENT(OUT) :: ID,NP > INTEGER IERR > > IERR=0 > > CALL MPI_INIT(IERR) > IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID > CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) > IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID > CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) > IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID > > END SUBROUTINE INIT_MPI_ENV > > > !==============================================================================| > SUBROUTINE PSHUTDOWN > > !==============================================================================| > INTEGER IERR > > IERR=0 > CALL MPI_FINALIZE(IERR) > if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID > close(IPT) > STOP > > END SUBROUTINE PSHUTDOWN > > > SUBROUTINE CONTIGUOUS_WORKS > IMPLICIT NONE > INTEGER, pointer :: ptest(:,:) > INTEGER :: IERR, I,J > > > write(ipt,*) "START CONTIGUOUS:" > n=2000 ! Set size here... > m=n+10 > > call alloc_vars > write(ipt,*) "ALLOCATED DATA" > ptest => data(1:N,1:N) > > IF (MYID == 0) ptest=6 > write(ipt,*) "Made POINTER" > > call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID > > write(ipt,*) "BROADCAST Data; a value:",data(1,6) > > DO I = 1,N > DO J = 1,N > if(data(I,J) /= 6) & > & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > END DO > > DO J = N+1,M > if(data(I,J) /= 0) & > & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > END DO > > END DO > > ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN ITERFACE > ! THAT USE AN EXPLICIT SHAPE ARRAY > write(ipt,*) "CALLING DUMMY1" > CALL DUMMY1 > > write(ipt,*) "CALLING DUMMY2" > call Dummy2(m,n) > > write(ipt,*) "CALLING DUMMY3" > call Dummy3 > write(ipt,*) "FINISHED!" > > END SUBROUTINE CONTIGUOUS_WORKS > > SUBROUTINE NON_CONTIGUOUS_FAILS > IMPLICIT NONE > INTEGER, pointer :: ptest(:,:) > INTEGER :: IERR, I,J > > > write(ipt,*) "START NON_CONTIGUOUS:" > > m=200 ! Set size here - crash is size dependent! > n=m+10 > > call alloc_vars > write(ipt,*) "ALLOCATED DATA" > ptest => data(1:M,1:M) > > !=================================================== > ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? > !=================================================== > ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT > ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG > > IF (MYID == 0) ptest=6 > write(ipt,*) "Made POINTER" > > call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" > > write(ipt,*) "BROADCAST Data; a value:",data(1,6) > > DO I = 1,M > DO J = 1,M > if(data(J,I) /= 6) & > & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > END DO > > DO J = M+1,N > if(data(J,I) /= 0) & > & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > END DO > END DO > > ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN ITERFACE > ! THAT USE AN EXPLICIT SHAPE ARRAY > write(ipt,*) "CALLING DUMMY1" > CALL DUMMY1 > > write(ipt,*) "CALLING DUMMY2" > call Dummy2(m,n) ! SHOULD CRASH HERE! > > write(ipt,*) "CALLING DUMMY3" > call Dummy3 > write(ipt,*) "FINISHED!" > > END SUBROUTINE NON_CONTIGUOUS_FAILS > > > End Module vars > > > Program main > USE vars > implicit none > > > CALL INIT_MPI_ENV(MYID,NPROCS) > > ipt=myid+10 > OPEN(ipt) > > > write(ipt,*) "Start memory test!" > > CALL NON_CONTIGUOUS_FAILS > > ! CALL CONTIGUOUS_WORKS > > write(ipt,*) "End memory test!" > > CALL PSHUTDOWN > > END Program main > > > > ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS > ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE > ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE > > SUBROUTINE DUMMY1 > USE vars > implicit none > real, dimension(m) :: my_data > > write(ipt,*) "m,n",m,n > > write(ipt,*) "DUMMY 1", size(my_data) > > END SUBROUTINE DUMMY1 > > > SUBROUTINE DUMMY2(i,j) > USE vars > implicit none > INTEGER, INTENT(IN) ::i,j > > > real, dimension(i,j) :: my_data > > write(ipt,*) "start: DUMMY 2", size(my_data) > > > END SUBROUTINE DUMMY2 > > SUBROUTINE DUMMY3 > USE vars > implicit none > > > real, dimension(m,n) :: my_data > > > write(ipt,*) "start: DUMMY 3", size(my_data) > > > END SUBROUTINE DUMMY3 > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From curtisbr at cse.ohio-state.edu Fri Jan 25 12:31:28 2008 From: curtisbr at cse.ohio-state.edu (Brian Curtis) Date: Fri Jan 25 12:31:37 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: <1f31dac10801231303v645eab2bu8208e9719456f8dd@mail.gmail.com> References: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> <4797A227.5030002@cse.ohio-state.edu> <1f31dac10801231303v645eab2bu8208e9719456f8dd@mail.gmail.com> Message-ID: <479A1CF0.1040307@cse.ohio-state.edu> David, I did some research on this issue and it looks like you have posted the bug with Intel. Please let us know what you find out. Brian David Stuebe wrote: > Hi Brian > > I downloaded the public release, it seems silly but I am not sure how to get > a rev number from the source... there does not seem to be a '-version' > option that gives more info, although I did not look to hard. > > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on the > cluster I will try 1.0.1 and see if it goes away. > > In the mean time please let me know if you can recreate the problem? > > David > > PS - Just want to make sure you understand my issue, I think it is a bad > idea to try and pass a non-contiguous F90 memory pointer, I should not do > that... but the way that it breaks has caused me headaches for weeks now. If > it reliably caused a sigsev on entering MPI_BCAST that would be great! As it > is it is really hard to trace the problem. > > > > > On Jan 23, 2008 3:23 PM, Brian Curtis wrote: > > >> David, >> >> Sorry to hear you are experience problems with the MVAPICH2 Fortran 90 >> interface. I will be investigating this issue, but need some additional >> information about your setup. What is the exact version of MVAPICH2 1.0 >> you are utilizing (daily tarball or release)? Have you tried MVAPICH2 >> 1.0.1? >> >> Brian >> >> David Stuebe wrote: >> >>> Hello MVAPICH >>> I have found a strange bug in MVAPICH2 using IFORT. The behavior is very >>> strange indeed - it seems to be related to how ifort deals with passing >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. >>> The MPI call returns successfully, but later calls to a dummy subroutine >>> cause a sigsev. >>> >>> Please look at the following code: >>> >>> >>> >> !================================================================================= >> >> !================================================================================= >> >> !================================================================================= >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT >>> ! WRITEN BY: DAVID STUEBE >>> ! DATE: JAN 23, 2008 >>> ! >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest >>> ! >>> ! KNOWN BEHAVIOR: >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE OF >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS WITHOUT AN >>> INTERFACE - >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING VALID >>> >> DATA. >> >>> ! >>> ! COMMENTS: >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - SHAME ON >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR NOT. >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS >>> ! EXTREMELY DIFFICULT TO DEBUG! >>> ! >>> ! CONDITIONS FOR OCCURANCE: >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... >>> ! ie Running inside one SMP box does not crash. >>> ! >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, >>> ! ONLY SOME PROCESSES SIGSEV ? >>> ! >>> ! ENVIRONMENTAL INFO: >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X >>> ! SYSTEM: ROCKS 4.2 >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) >>> ! >>> ! IFORT/ICC: >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based applications, >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 >>> ! >>> ! MVAPICH2: mpif90 for mvapich2-1.0 >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd --enable-f90 >>> --enable-cxx --disable-romio --without-mpe >>> ! >>> >>> >> !================================================================================= >> >> !================================================================================= >> >> !================================================================================= >> >>> Module vars >>> USE MPI >>> implicit none >>> >>> >>> integer :: n,m,MYID,NPROCS >>> integer :: ipt >>> >>> integer, allocatable, target :: data(:,:) >>> >>> contains >>> >>> subroutine alloc_vars >>> implicit none >>> >>> integer Status >>> >>> allocate(data(n,m),stat=status) >>> if (status /=0) then >>> write(ipt,*) "allocation error" >>> stop >>> end if >>> >>> data = 0 >>> >>> end subroutine alloc_vars >>> >>> SUBROUTINE INIT_MPI_ENV(ID,NP) >>> >>> >> !===================================================================================| >> >>> ! INITIALIZE MPI >>> ENVIRONMENT | >>> >>> >> !===================================================================================| >> >>> INTEGER, INTENT(OUT) :: ID,NP >>> INTEGER IERR >>> >>> IERR=0 >>> >>> CALL MPI_INIT(IERR) >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID >>> >>> END SUBROUTINE INIT_MPI_ENV >>> >>> >>> >>> >> !==============================================================================| >> >>> SUBROUTINE PSHUTDOWN >>> >>> >>> >> !==============================================================================| >> >>> INTEGER IERR >>> >>> IERR=0 >>> CALL MPI_FINALIZE(IERR) >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID >>> close(IPT) >>> STOP >>> >>> END SUBROUTINE PSHUTDOWN >>> >>> >>> SUBROUTINE CONTIGUOUS_WORKS >>> IMPLICIT NONE >>> INTEGER, pointer :: ptest(:,:) >>> INTEGER :: IERR, I,J >>> >>> >>> write(ipt,*) "START CONTIGUOUS:" >>> n=2000 ! Set size here... >>> m=n+10 >>> >>> call alloc_vars >>> write(ipt,*) "ALLOCATED DATA" >>> ptest => data(1:N,1:N) >>> >>> IF (MYID == 0) ptest=6 >>> write(ipt,*) "Made POINTER" >>> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID >>> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) >>> >>> DO I = 1,N >>> DO J = 1,N >>> if(data(I,J) /= 6) & >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) >>> END DO >>> >>> DO J = N+1,M >>> if(data(I,J) /= 0) & >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) >>> END DO >>> >>> END DO >>> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN ITERFACE >>> ! THAT USE AN EXPLICIT SHAPE ARRAY >>> write(ipt,*) "CALLING DUMMY1" >>> CALL DUMMY1 >>> >>> write(ipt,*) "CALLING DUMMY2" >>> call Dummy2(m,n) >>> >>> write(ipt,*) "CALLING DUMMY3" >>> call Dummy3 >>> write(ipt,*) "FINISHED!" >>> >>> END SUBROUTINE CONTIGUOUS_WORKS >>> >>> SUBROUTINE NON_CONTIGUOUS_FAILS >>> IMPLICIT NONE >>> INTEGER, pointer :: ptest(:,:) >>> INTEGER :: IERR, I,J >>> >>> >>> write(ipt,*) "START NON_CONTIGUOUS:" >>> >>> m=200 ! Set size here - crash is size dependent! >>> n=m+10 >>> >>> call alloc_vars >>> write(ipt,*) "ALLOCATED DATA" >>> ptest => data(1:M,1:M) >>> >>> !=================================================== >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? >>> !=================================================== >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG >>> >>> IF (MYID == 0) ptest=6 >>> write(ipt,*) "Made POINTER" >>> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" >>> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) >>> >>> DO I = 1,M >>> DO J = 1,M >>> if(data(J,I) /= 6) & >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) >>> END DO >>> >>> DO J = M+1,N >>> if(data(J,I) /= 0) & >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) >>> END DO >>> END DO >>> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN ITERFACE >>> ! THAT USE AN EXPLICIT SHAPE ARRAY >>> write(ipt,*) "CALLING DUMMY1" >>> CALL DUMMY1 >>> >>> write(ipt,*) "CALLING DUMMY2" >>> call Dummy2(m,n) ! SHOULD CRASH HERE! >>> >>> write(ipt,*) "CALLING DUMMY3" >>> call Dummy3 >>> write(ipt,*) "FINISHED!" >>> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS >>> >>> >>> End Module vars >>> >>> >>> Program main >>> USE vars >>> implicit none >>> >>> >>> CALL INIT_MPI_ENV(MYID,NPROCS) >>> >>> ipt=myid+10 >>> OPEN(ipt) >>> >>> >>> write(ipt,*) "Start memory test!" >>> >>> CALL NON_CONTIGUOUS_FAILS >>> >>> ! CALL CONTIGUOUS_WORKS >>> >>> write(ipt,*) "End memory test!" >>> >>> CALL PSHUTDOWN >>> >>> END Program main >>> >>> >>> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE >>> >>> SUBROUTINE DUMMY1 >>> USE vars >>> implicit none >>> real, dimension(m) :: my_data >>> >>> write(ipt,*) "m,n",m,n >>> >>> write(ipt,*) "DUMMY 1", size(my_data) >>> >>> END SUBROUTINE DUMMY1 >>> >>> >>> SUBROUTINE DUMMY2(i,j) >>> USE vars >>> implicit none >>> INTEGER, INTENT(IN) ::i,j >>> >>> >>> real, dimension(i,j) :: my_data >>> >>> write(ipt,*) "start: DUMMY 2", size(my_data) >>> >>> >>> END SUBROUTINE DUMMY2 >>> >>> SUBROUTINE DUMMY3 >>> USE vars >>> implicit none >>> >>> >>> real, dimension(m,n) :: my_data >>> >>> >>> write(ipt,*) "start: DUMMY 3", size(my_data) >>> >>> >>> END SUBROUTINE DUMMY3 >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> > > From ben.held at staarinc.com Sun Jan 27 20:07:18 2008 From: ben.held at staarinc.com (Ben Held) Date: Sun Jan 27 20:07:30 2008 Subject: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c In-Reply-To: References: <00c001c85a06$d8861a70$89924f50$@held@staarinc.com> Message-ID: <004e01c8614a$214d5710$63e80530$@held@staarinc.com> Matt, We have set it to ulimited and are still seeing the same failure. Any other suggestions? Regards, Ben -----Original Message----- From: Matthew Koop [mailto:koop@cse.ohio-state.edu] Sent: Friday, January 18, 2008 6:55 PM To: Ben Held Cc: mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c Ben, The maximum locked memory you are allowing on the system is lower than is expected. Can you try increasing that value to closer to the maximum memory of the node? Matt On Fri, 18 Jan 2008, Ben Held wrote: > Matt, > > The version of MVAPICH is mvapich_gcc-0.9.9-1458. I believe this is part of > the OFED distro - it was installed by the manuf. Of the cluster. > > ulimit -l reports 131072 on all nodes. > > Ben > -----Original Message----- > From: Matthew Koop [mailto:koop@cse.ohio-state.edu] > Sent: Thursday, January 17, 2008 9:29 PM > To: Ben Held > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at > line 211 in file vbuf.c > > Ben, > > Sorry to hear about this issue. Can you give me some more details on your > installation -- what distro are you using and is OFED being used? Also, > what version of MVAPICH are you using? > > Additionally, what is the output of 'ulimit -l' on your system (or > equivalent shell command). You may want to check all nodes. Memory > registration generally does not fail unless the amount of lockable memory > is too low. > > Matt > > On Thu, 17 Jan 2008, Ben Held wrote: > > > We have recently built our MPI application using MVAPICH1 under LINUX and > > are seeing certain runs fail (success or failure seems to be a function of > > the # of processes - 8 will work, 16 will fail, 32 will work, etc). This > > code has been thoroughly testing using the standard MPICH (Ethernet based) > > and LAM and everything is fine. > > > > > > > > Does this error: > > > > > > > > Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c > > > > > > > > Mean anything? This is a new cluster (8 node, 8 cores per node) has been > > tested under using stress tests provided by the cluster manufacturer > > (Microway). This is out of my area of expertise and this is the first IB > > based system I have worked on. > > > > > > > > Any thoughts? > > > > > > Regards, > > > > > > > > Ben Held > > Simulation Technology & Applied Research, Inc. > > 11520 N. Port Washington Rd., Suite 201 > > Mequon, WI 53092 > > P: 1.262.240.0291 x101 > > F: 1.262.240.0294 > > E: ben.held@staarinc.com > > http://www.staarinc.com > > > > > > > > > > > > > > From jsquyres at cisco.com Sun Jan 27 20:26:33 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Sun Jan 27 20:27:06 2008 Subject: [mvapich-discuss] Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c In-Reply-To: <004e01c8614a$214d5710$63e80530$@held@staarinc.com> References: <00c001c85a06$d8861a70$89924f50$@held@staarinc.com> <004e01c8614a$214d5710$63e80530$@held@staarinc.com> Message-ID: Are you running through a resource manager (such as Torque/PBS, SLURM, SGE/N1GE, LSF, ...etc.)? Resource managers will usually have different limits for jobs launched through queueing mechanisms vs. normal ssh-launched interactive logins. A good test is to launch a job *through the resource manager* than runs "ulimit -l" (or whatever flavor of ulimit is appropriate for your shell) and see what value you get. IIRC, the MVAPICH web pages/documentation have some good docs on how to set the ulimit properly...? You might want to check those out for some more details. On Jan 27, 2008, at 8:07 PM, Ben Held wrote: > Matt, > > We have set it to ulimited and are still seeing the same failure. > > Any other suggestions? > > Regards, > Ben > > -----Original Message----- > From: Matthew Koop [mailto:koop@cse.ohio-state.edu] > Sent: Friday, January 18, 2008 6:55 PM > To: Ben Held > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: RE: [mvapich-discuss] Abort: unable to register vbuf DMA > buffer at > line 211 in file vbuf.c > > Ben, > > The maximum locked memory you are allowing on the system is lower > than is > expected. Can you try increasing that value to closer to the maximum > memory of the node? > > Matt > > On Fri, 18 Jan 2008, Ben Held wrote: > >> Matt, >> >> The version of MVAPICH is mvapich_gcc-0.9.9-1458. I believe this >> is part > of >> the OFED distro - it was installed by the manuf. Of the cluster. >> >> ulimit -l reports 131072 on all nodes. >> >> Ben >> -----Original Message----- >> From: Matthew Koop [mailto:koop@cse.ohio-state.edu] >> Sent: Thursday, January 17, 2008 9:29 PM >> To: Ben Held >> Cc: mvapich-discuss@cse.ohio-state.edu >> Subject: Re: [mvapich-discuss] Abort: unable to register vbuf DMA >> buffer > at >> line 211 in file vbuf.c >> >> Ben, >> >> Sorry to hear about this issue. Can you give me some more details >> on your >> installation -- what distro are you using and is OFED being used? >> Also, >> what version of MVAPICH are you using? >> >> Additionally, what is the output of 'ulimit -l' on your system (or >> equivalent shell command). You may want to check all nodes. Memory >> registration generally does not fail unless the amount of lockable >> memory >> is too low. >> >> Matt >> >> On Thu, 17 Jan 2008, Ben Held wrote: >> >>> We have recently built our MPI application using MVAPICH1 under >>> LINUX > and >>> are seeing certain runs fail (success or failure seems to be a >>> function > of >>> the # of processes - 8 will work, 16 will fail, 32 will work, etc). > This >>> code has been thoroughly testing using the standard MPICH (Ethernet > based) >>> and LAM and everything is fine. >>> >>> >>> >>> Does this error: >>> >>> >>> >>> Abort: unable to register vbuf DMA buffer at line 211 in file vbuf.c >>> >>> >>> >>> Mean anything? This is a new cluster (8 node, 8 cores per node) has > been >>> tested under using stress tests provided by the cluster manufacturer >>> (Microway). This is out of my area of expertise and this is the >>> first > IB >>> based system I have worked on. >>> >>> >>> >>> Any thoughts? >>> >>> >>> Regards, >>> >>> >>> >>> Ben Held >>> Simulation Technology & Applied Research, Inc. >>> 11520 N. Port Washington Rd., Suite 201 >>> Mequon, WI 53092 >>> P: 1.262.240.0291 x101 >>> F: 1.262.240.0294 >>> E: ben.held@staarinc.com >>> http://www.staarinc.com >>> >>> >>> >>> >>> >>> >> >> > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jeff Squyres Cisco Systems From jbernstein at penguincomputing.com Mon Jan 28 16:27:25 2008 From: jbernstein at penguincomputing.com (Joshua Bernstein) Date: Mon Jan 28 16:52:34 2008 Subject: [mvapich-discuss] On "Got Completion" and IBV_EVENT Errors Message-ID: <479E48BD.2010806@penguincomputing.com> Hi All, I've seen various posts about this error including something that seems related from this month, though I never see any resolution. http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-January/001340.html When I run a very simple (cpi for example) MVAPICH job using the ch_gen2 transport, the job starts up, but just seems to hang. After a bit of time I am left with this: [1:n2] Abort: [n2:1] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12 [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 This smells like a timeout! So, after reading through some of the archives I came across the envar VIADEV_USE_SHMEM_COLL, so setting this variable to: VIADEV_USE_SHMEM_COLL=0 seems to allow the job to get a little further. Because now I get STDIO from the process before the hang: ... Hello from Process 0 on n2 Hello from Process 1 on n2 ... Once again I reach a hang, though this is right where the sample program tries to do some MPI communication. The output is as follows: [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line 2554 in file viacheck.c Again, I've read through the archives and have determined that everything seems to check out. ibchecknet and other ibv_ and ib_ commands come up clean. Also the osu_* sample tests exhibit the exact same behavior. I'm totally left in the dark now, so any help would be greatly appreciated. Running: RHEL4u6, OFED1.2, and MVAPICH 0.9.9 -Joshua Bernstein Software Engineer Penguin Computing From jbernstein at penguincomputing.com Mon Jan 28 18:10:14 2008 From: jbernstein at penguincomputing.com (Joshua Bernstein) Date: Mon Jan 28 18:09:34 2008 Subject: [mvapich-discuss] On "Got Completion" and IBV_EVENT Errors Message-ID: <479E60D6.6040302@penguincomputing.com> Hi All, I've seen various posts about this error including something that seems related from this month, though I never see any resolution. http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-January/001340.html When I run a very simple (cpi for example) MVAPICH job using the ch_gen2 transport, the job starts up, but just seems to hang. After a bit of time I am left with this: [1:n2] Abort: [n2:1] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12 [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 This smells like a timeout! So, after reading through some of the archives I came across the envar VIADEV_USE_SHMEM_COLL, so setting this variable to: VIADEV_USE_SHMEM_COLL=0 seems to allow the job to get a little further. Because now I get STDIO from the process before the hang: ... Hello from Process 0 on n2 Hello from Process 1 on n2 ... Once again I reach a hang, though this is right where the sample program tries to do some MPI communication. The output is as follows: [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line 2554 in file viacheck.c Again, I've read through the archives and have determined that everything seems to check out. ibchecknet and other ibv_ and ib_ commands come up clean. Also the osu_* sample tests exhibit the exact same behavior. I'm totally left in the dark now, so any help would be greatly appreciated. Running: RHEL4u6, OFED1.2, and MVAPICH 0.9.9 -Joshua Bernstein Software Engineer Penguin Computing From koop at cse.ohio-state.edu Tue Jan 29 11:05:07 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Jan 29 11:05:18 2008 Subject: [mvapich-discuss] On "Got Completion" and IBV_EVENT Errors In-Reply-To: <479E48BD.2010806@penguincomputing.com> Message-ID: Joshua, So are you able to run `ibv_rc_pingpong' with a variety of message sizes? You may want to double-check that the cables between machines are well connected as well. With the earlier request you cited, the issue didn't occur for simple microbenchmarks, only with an application. We have previously seen issues when fork or system calls are used in applications (due to incompatibilities with the underlying OpenFabrics drivers). It seems that your issue is more likely to be a setup issue. What does ulimit -l report on your compute nodes? Also, it is unlikely that VIADEV_USE_SHMEM_COLL is causing any issue -- turning off this option means there is less communication in the init phase (which allows you to get to the stdout statements). Thanks, Matt On Mon, 28 Jan 2008, Joshua Bernstein wrote: > Hi All, > > I've seen various posts about this error including something that seems > related from this month, though I never see any resolution. > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-January/001340.html > > When I run a very simple (cpi for example) MVAPICH job using the ch_gen2 > transport, the job starts up, but just seems to hang. After a bit of > time I am left with this: > > [1:n2] Abort: [n2:1] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12 > [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 > > This smells like a timeout! So, after reading through some of the > archives I came across the envar VIADEV_USE_SHMEM_COLL, so setting this > variable to: > > VIADEV_USE_SHMEM_COLL=0 > > seems to allow the job to get a little further. Because now I get STDIO > from the process before the hang: > > ... > Hello from Process 0 on n2 > Hello from Process 1 on n2 > ... > > Once again I reach a hang, though this is right where the sample program > tries to do some MPI communication. The output is as follows: > > [0:n2] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 > at line 2554 in file viacheck.c > > Again, I've read through the archives and have determined that > everything seems to check out. ibchecknet and other ibv_ and ib_ > commands come up clean. Also the osu_* sample tests exhibit the exact > same behavior. > > I'm totally left in the dark now, so any help would be greatly appreciated. > > Running: RHEL4u6, OFED1.2, and MVAPICH 0.9.9 > > -Joshua Bernstein > Software Engineer > Penguin Computing > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From friedman at ats.ucla.edu Tue Jan 29 18:13:04 2008 From: friedman at ats.ucla.edu (Scott A. Friedman) Date: Tue Jan 29 18:25:21 2008 Subject: [mvapich-discuss] Help with polled desc error Message-ID: <479FB300.8080403@ats.ucla.edu> Hi We have found applications crashing with the following error: [113] Abort: Error code in polled desc! at line 1229 in file rdma_iba_priv.c rank 113 in job 1 n90_57923 caused collective abort of all ranks exit status of rank 113: killed by signal 9 Have not been able to find anything useful on this on the web. Hopefully someone here can shed some light on it. Using mvapich2-1.0.1 Would have to check on the exact build number but it is from the last month or so. Thanks Scott Friedman UCLA From huanwei at cse.ohio-state.edu Wed Jan 30 14:24:33 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Wed Jan 30 14:24:43 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <479FB300.8080403@ats.ucla.edu> Message-ID: Hi Scott, Thanks for letting us know the problem. From your description, however, it looks like there are some problems with your system setup. The program has not passed initialization phase yet. Would you please verify that your system setup is correct by running IB level ibv_* benchmarks? Those benchmarks are standard components of OFED installation and should be available on your systems already. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Tue, 29 Jan 2008, Scott A. Friedman wrote: > Hi > > We have found applications crashing with the following error: > > [113] Abort: Error code in polled desc! > at line 1229 in file rdma_iba_priv.c > rank 113 in job 1 n90_57923 caused collective abort of all ranks > exit status of rank 113: killed by signal 9 > > Have not been able to find anything useful on this on the web. Hopefully > someone here can shed some light on it. > > Using mvapich2-1.0.1 > > Would have to check on the exact build number but it is from the last > month or so. > > Thanks > Scott Friedman > UCLA > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Mike.Colonno at spacex.com Wed Jan 30 17:10:50 2008 From: Mike.Colonno at spacex.com (Mike Colonno) Date: Wed Jan 30 17:33:25 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: <479FB300.8080403@ats.ucla.edu> Message-ID: We have experienced this as well, and it appears to be a function of Intel's compilers. This application running on just n = 12 produced: rank 11 in job 3 node1_32811 caused collective abort of all ranks exit status of rank 11: killed by signal 9 rank 6 in job 3 node1_32811 caused collective abort of all ranks exit status of rank 6: killed by signal 9 rank 5 in job 3 node1_32811 caused collective abort of all ranks exit status of rank 5: killed by signal 11 rank 4 in job 3 node1_32811 caused collective abort of all ranks exit status of rank 4: killed by signal 9 But the same job on n <=8 nodes runs fine. This happens for at least 3 different MPI codes so it must be a function on the MVAPICH2 compile and / or compiling MVAPICH2 applications using Intel's compilers. I ran all of the ibv_* tests which appear to return nominal output. The job above ran on just 3 different machines with 4 processes on each machine (2x quad-core Xeons, x64, Red Hat Enterprise 4.5). I built MVAPICH2 1.0.1 as well using Intel C++ / Fortran compilers, version 10.1. Is there any way to generate more detailed debug info to see exactly where these processes run into trouble? Thanks, Michael R. Colonno, Ph.D.?| Chief Aerodynamic Engineer Space Exploration Technologies 1 Rocket Road Hawthorne, CA 90250 W:?310?363 6263 | M: 310?570 3299 | F: 310?363 6001 | www.spacex.com -- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written?Consent?of Space Exploration Technologies -- -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of wei huang Sent: Wednesday, January 30, 2008 11:25 AM To: Scott A. Friedman Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Help with polled desc error Hi Scott, Thanks for letting us know the problem. From your description, however, it looks like there are some problems with your system setup. The program has not passed initialization phase yet. Would you please verify that your system setup is correct by running IB level ibv_* benchmarks? Those benchmarks are standard components of OFED installation and should be available on your systems already. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Tue, 29 Jan 2008, Scott A. Friedman wrote: > Hi > > We have found applications crashing with the following error: > > [113] Abort: Error code in polled desc! > at line 1229 in file rdma_iba_priv.c > rank 113 in job 1 n90_57923 caused collective abort of all ranks > exit status of rank 113: killed by signal 9 > > Have not been able to find anything useful on this on the web. Hopefully > someone here can shed some light on it. > > Using mvapich2-1.0.1 > > Would have to check on the exact build number but it is from the last > month or so. > > Thanks > Scott Friedman > UCLA > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From friedman at ucla.edu Wed Jan 30 17:42:24 2008 From: friedman at ucla.edu (Scott A. Friedman) Date: Wed Jan 30 17:42:32 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: <479FB300.8080403@ats.ucla.edu> Message-ID: <47A0FD50.5030001@ucla.edu> The low level ibv tests work fine. Scott From huanwei at cse.ohio-state.edu Wed Jan 30 18:21:37 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Wed Jan 30 18:21:47 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <47A0F70C.70509@ats.ucla.edu> Message-ID: Hi Scott, On how many processes (and how many nodes) you ran your program? Do you have any environmental variables when you are running the program? Does the error happen on simple test like cpi? Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Wed, 30 Jan 2008, Scott A. Friedman wrote: > The low level ibv tests work fine. From friedman at ats.ucla.edu Wed Jan 30 17:15:40 2008 From: friedman at ats.ucla.edu (Scott A. Friedman) Date: Wed Jan 30 20:08:05 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: <479FB300.8080403@ats.ucla.edu> Message-ID: <47A0F70C.70509@ats.ucla.edu> It does not appear to be specifically related to the intel compiler - we rebuilt everything using gcc last night and had the same problem. This is a new cluster, so I am double checking things are up properly - they appear to be so far - but we see this problem on a couple of other IB clusters here at UCLA as well. mvapich2 1.0.1 and OFED 1.2.5.4, CentOS 5.x, just so we are all on the same page. Scott Mike Colonno wrote: > We have experienced this as well, and it appears to be a function of Intel's compilers. This application running on just n = 12 produced: > > rank 11 in job 3 node1_32811 caused collective abort of all ranks > exit status of rank 11: killed by signal 9 > rank 6 in job 3 node1_32811 caused collective abort of all ranks > exit status of rank 6: killed by signal 9 > rank 5 in job 3 node1_32811 caused collective abort of all ranks > exit status of rank 5: killed by signal 11 > rank 4 in job 3 node1_32811 caused collective abort of all ranks > exit status of rank 4: killed by signal 9 > > But the same job on n <=8 nodes runs fine. This happens for at least 3 different MPI codes so it must be a function on the MVAPICH2 compile and / or compiling MVAPICH2 applications using Intel's compilers. I ran all of the ibv_* tests which appear to return nominal output. The job above ran on just 3 different machines with 4 processes on each machine (2x quad-core Xeons, x64, Red Hat Enterprise 4.5). I built MVAPICH2 1.0.1 as well using Intel C++ / Fortran compilers, version 10.1. Is there any way to generate more detailed debug info to see exactly where these processes run into trouble? > > Thanks, > > Michael R. Colonno, Ph.D. | Chief Aerodynamic Engineer > Space Exploration Technologies > 1 Rocket Road > Hawthorne, CA 90250 > W: 310 363 6263 | M: 310 570 3299 | F: 310 363 6001 | www.spacex.com > > -- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written Consent of Space Exploration Technologies -- > > > > -----Original Message----- > From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of wei huang > Sent: Wednesday, January 30, 2008 11:25 AM > To: Scott A. Friedman > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Help with polled desc error > > Hi Scott, > > Thanks for letting us know the problem. From your description, however, it > looks like there are some problems with your system setup. The program has > not passed initialization phase yet. > > Would you please verify that your system setup is correct by running IB > level ibv_* benchmarks? Those benchmarks are standard components of OFED > installation and should be available on your systems already. > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Tue, 29 Jan 2008, Scott A. Friedman wrote: > >> Hi >> >> We have found applications crashing with the following error: >> >> [113] Abort: Error code in polled desc! >> at line 1229 in file rdma_iba_priv.c >> rank 113 in job 1 n90_57923 caused collective abort of all ranks >> exit status of rank 113: killed by signal 9 >> >> Have not been able to find anything useful on this on the web. Hopefully >> someone here can shed some light on it. >> >> Using mvapich2-1.0.1 >> >> Would have to check on the exact build number but it is from the last >> month or so. >> >> Thanks >> Scott Friedman >> UCLA >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From friedman at ucla.edu Wed Jan 30 21:32:43 2008 From: friedman at ucla.edu (Scott A. Friedman) Date: Wed Jan 30 21:32:52 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: Message-ID: <47A1334B.5030906@ucla.edu> My co-worker passed this along... Yes, the error happens on the cpi.c program too. It happened 2 times among the 9 cases I ran. I was using 128 processes (on 32 4-core nodes). --- and another... It happens for a simple MPI program which just does MPI_Init and MPI_Finalize and print out number of processors. It happened for anything from 4 nodes (16 processors ) and more. What environment variables should we look for? Thanks, Scott wei huang wrote: > Hi Scott, > > On how many processes (and how many nodes) you ran your program? Do you > have any environmental variables when you are running the program? Does > the error happen on simple test like cpi? > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > >> The low level ibv tests work fine. > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From schuang at ats.ucla.edu Wed Jan 30 19:23:33 2008 From: schuang at ats.ucla.edu (Shao-Ching Huang) Date: Wed Jan 30 21:36:12 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <47A10E92.30501@ats.ucla.edu> References: <47A10E92.30501@ats.ucla.edu> Message-ID: <20080131002333.GA22541@ats.ucla.edu> Hi, Yes, the error happens on the cpi.c program too. It happened 2 times among the 9 cases I ran. I was using 128 processes (on 32 4-core nodes). Shao-Ching Huang UCLA On Wed, Jan 30, 2008 at 03:56:02PM -0800, Scott Friedman wrote: > Guys, can you answer this for me? > > Scott > > Subject: Re: [mvapich-discuss] Help with polled desc error > From: wei huang > To: "Scott A. Friedman" > cc: mvapich-discuss@cse.ohio-state.edu > Date: Wed, 30 Jan 2008 18:21:37 -0500 (EST) > > Hi Scott, > > On how many processes (and how many nodes) you ran your program? Do you > have any environmental variables when you are running the program? Does > the error happen on simple test like cpi? > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > The low level ibv tests work fine. > From Terrence.LIAO at total.com Thu Jan 31 08:46:18 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Thu Jan 31 10:31:39 2008 Subject: [mvapich-discuss] where can I find similar env setting on mvapich as these three: MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX Message-ID: My MPI code die on MPI-IO using mvapich 1.0.? On SGI Altix, the problem was solved by tuning these 3 parameters: ??? MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX However, there are SGI specific,? does mvapich have similar parameters? Thank?you?very?much. --?Terrence -------------------------------------------------------- Terrence?Liao TOTAL?E&P?RESEARCH?&?TECHNOLOGY?USA,?LLC Email:?terrence.liao@total.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080131/f0f4d75c/attachment.html From weikuan.yu at gmail.com Thu Jan 31 12:31:04 2008 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Thu Jan 31 12:31:19 2008 Subject: [mvapich-discuss] where can I find similar env setting on mvapich as these three: MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX In-Reply-To: References: Message-ID: <47A205D8.4010903@gmail.com> Hi, Terrence, Your report is intriguing, hence my questions: 1) What does your MPI code do? how does it die? 2) What system you are running with? What file system you are using? 3) What are the three parameters for? How did they solve your problem? Any detail info? --Weikuan Terrence.LIAO@total.com wrote: > My MPI code die on MPI-IO using mvapich 1.0. On SGI Altix, the problem > was solved by tuning these 3 parameters: > MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX > However, there are SGI specific, does mvapich have similar parameters? > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > Email: terrence.liao@total.com > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From curtisbr at cse.ohio-state.edu Thu Jan 31 12:36:01 2008 From: curtisbr at cse.ohio-state.edu (Brian Curtis) Date: Thu Jan 31 12:36:20 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: <1f31dac10801310859l65fabf4fp9950ea8575266e5c@mail.gmail.com> References: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> <4797A227.5030002@cse.ohio-state.edu> <1f31dac10801231303v645eab2bu8208e9719456f8dd@mail.gmail.com> <479A1CF0.1040307@cse.ohio-state.edu> <1f31dac10801310859l65fabf4fp9950ea8575266e5c@mail.gmail.com> Message-ID: David, The MPI-2 documentation goes into great detail on issues with Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/ node236.htm#Node236). The conditions you are seeing should be directed to Intel. Brian On Jan 31, 2008, at 11:59 AM, David Stuebe wrote: > > Hi again Brian > > I just ran my test code on our cluster using ifort 10.1.011 and > MVAPICH 1.0.1, but the behavior is still the same. > > Have you had a chance to try it on any of your test machines? > > David > > > > > On Jan 25, 2008 12:31 PM, Brian Curtis state.edu> wrote: > David, > > I did some research on this issue and it looks like you have posted > the > bug with Intel. Please let us know what you find out. > > > Brian > > David Stuebe wrote: > > Hi Brian > > > > I downloaded the public release, it seems silly but I am not sure > how to get > > a rev number from the source... there does not seem to be a '- > version' > > option that gives more info, although I did not look to hard. > > > > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on > the > > cluster I will try 1.0.1 and see if it goes away. > > > > In the mean time please let me know if you can recreate the problem? > > > > David > > > > PS - Just want to make sure you understand my issue, I think it > is a bad > > idea to try and pass a non-contiguous F90 memory pointer, I > should not do > > that... but the way that it breaks has caused me headaches for > weeks now. If > > it reliably caused a sigsev on entering MPI_BCAST that would be > great! As it > > is it is really hard to trace the problem. > > > > > > > > > > On Jan 23, 2008 3:23 PM, Brian Curtis state.edu> wrote: > > > > > >> David, > >> > >> Sorry to hear you are experience problems with the MVAPICH2 > Fortran 90 > >> interface. I will be investigating this issue, but need some > additional > >> information about your setup. What is the exact version of > MVAPICH2 1.0 > >> you are utilizing (daily tarball or release)? Have you tried > MVAPICH2 > >> 1.0.1? > >> > >> Brian > >> > >> David Stuebe wrote: > >> > >>> Hello MVAPICH > >>> I have found a strange bug in MVAPICH2 using IFORT. The > behavior is very > >>> strange indeed - it seems to be related to how ifort deals with > passing > >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. > >>> The MPI call returns successfully, but later calls to a dummy > subroutine > >>> cause a sigsev. > >>> > >>> Please look at the following code: > >>> > >>> > >>> > >> ! > ====================================================================== > =========== > >> > >> ! > ====================================================================== > =========== > >> > >> ! > ====================================================================== > =========== > >> > >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT > >>> ! WRITEN BY: DAVID STUEBE > >>> ! DATE: JAN 23, 2008 > >>> ! > >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest > >>> ! > >>> ! KNOWN BEHAVIOR: > >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE OF > >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS > WITHOUT AN > >>> INTERFACE - > >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING > VALID > >>> > >> DATA. > >> > >>> ! > >>> ! COMMENTS: > >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - > SHAME ON > >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR > NOT. > >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS > >>> ! EXTREMELY DIFFICULT TO DEBUG! > >>> ! > >>> ! CONDITIONS FOR OCCURANCE: > >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' > >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? > >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... > >>> ! ie Running inside one SMP box does not crash. > >>> ! > >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV > >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, > >>> ! ONLY SOME PROCESSES SIGSEV ? > >>> ! > >>> ! ENVIRONMENTAL INFO: > >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X > >>> ! SYSTEM: ROCKS 4.2 > >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > >>> ! > >>> ! IFORT/ICC: > >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based > applications, > >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 > >>> ! > >>> ! MVAPICH2: mpif90 for mvapich2-1.0 > >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 > >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- > enable-f90 > >>> --enable-cxx --disable-romio --without-mpe > >>> ! > >>> > >>> > >> ! > ====================================================================== > =========== > >> > >> ! > ====================================================================== > =========== > >> > >> ! > ====================================================================== > =========== > >> > >>> Module vars > >>> USE MPI > >>> implicit none > >>> > >>> > >>> integer :: n,m,MYID,NPROCS > >>> integer :: ipt > >>> > >>> integer, allocatable, target :: data(:,:) > >>> > >>> contains > >>> > >>> subroutine alloc_vars > >>> implicit none > >>> > >>> integer Status > >>> > >>> allocate(data(n,m),stat=status) > >>> if (status /=0) then > >>> write(ipt,*) "allocation error" > >>> stop > >>> end if > >>> > >>> data = 0 > >>> > >>> end subroutine alloc_vars > >>> > >>> SUBROUTINE INIT_MPI_ENV(ID,NP) > >>> > >>> > >> ! > ====================================================================== > =============| > >> > >>> ! INITIALIZE MPI > >>> > ENVIRONMENT | > >>> > >>> > >> ! > ====================================================================== > =============| > >> > >>> INTEGER, INTENT(OUT) :: ID,NP > >>> INTEGER IERR > >>> > >>> IERR=0 > >>> > >>> CALL MPI_INIT(IERR) > >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID > >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) > >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID > >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) > >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID > >>> > >>> END SUBROUTINE INIT_MPI_ENV > >>> > >>> > >>> > >>> > >> ! > ====================================================================== > ========| > >> > >>> SUBROUTINE PSHUTDOWN > >>> > >>> > >>> > >> ! > ====================================================================== > ========| > >> > >>> INTEGER IERR > >>> > >>> IERR=0 > >>> CALL MPI_FINALIZE(IERR) > >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID > >>> close(IPT) > >>> STOP > >>> > >>> END SUBROUTINE PSHUTDOWN > >>> > >>> > >>> SUBROUTINE CONTIGUOUS_WORKS > >>> IMPLICIT NONE > >>> INTEGER, pointer :: ptest(:,:) > >>> INTEGER :: IERR, I,J > >>> > >>> > >>> write(ipt,*) "START CONTIGUOUS:" > >>> n=2000 ! Set size here... > >>> m=n+10 > >>> > >>> call alloc_vars > >>> write(ipt,*) "ALLOCATED DATA" > >>> ptest => data(1:N,1:N) > >>> > >>> IF (MYID == 0) ptest=6 > >>> write(ipt,*) "Made POINTER" > >>> > >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID > >>> > >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >>> > >>> DO I = 1,N > >>> DO J = 1,N > >>> if(data(I,J) /= 6) & > >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >>> END DO > >>> > >>> DO J = N+1,M > >>> if(data(I,J) /= 0) & > >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >>> END DO > >>> > >>> END DO > >>> > >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > ITERFACE > >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >>> write(ipt,*) "CALLING DUMMY1" > >>> CALL DUMMY1 > >>> > >>> write(ipt,*) "CALLING DUMMY2" > >>> call Dummy2(m,n) > >>> > >>> write(ipt,*) "CALLING DUMMY3" > >>> call Dummy3 > >>> write(ipt,*) "FINISHED!" > >>> > >>> END SUBROUTINE CONTIGUOUS_WORKS > >>> > >>> SUBROUTINE NON_CONTIGUOUS_FAILS > >>> IMPLICIT NONE > >>> INTEGER, pointer :: ptest(:,:) > >>> INTEGER :: IERR, I,J > >>> > >>> > >>> write(ipt,*) "START NON_CONTIGUOUS:" > >>> > >>> m=200 ! Set size here - crash is size dependent! > >>> n=m+10 > >>> > >>> call alloc_vars > >>> write(ipt,*) "ALLOCATED DATA" > >>> ptest => data(1:M,1:M) > >>> > >>> !=================================================== > >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? > >>> !=================================================== > >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT > >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG > >>> > >>> IF (MYID == 0) ptest=6 > >>> write(ipt,*) "Made POINTER" > >>> > >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" > >>> > >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >>> > >>> DO I = 1,M > >>> DO J = 1,M > >>> if(data(J,I) /= 6) & > >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >>> END DO > >>> > >>> DO J = M+1,N > >>> if(data(J,I) /= 0) & > >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >>> END DO > >>> END DO > >>> > >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > ITERFACE > >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >>> write(ipt,*) "CALLING DUMMY1" > >>> CALL DUMMY1 > >>> > >>> write(ipt,*) "CALLING DUMMY2" > >>> call Dummy2(m,n) ! SHOULD CRASH HERE! > >>> > >>> write(ipt,*) "CALLING DUMMY3" > >>> call Dummy3 > >>> write(ipt,*) "FINISHED!" > >>> > >>> END SUBROUTINE NON_CONTIGUOUS_FAILS > >>> > >>> > >>> End Module vars > >>> > >>> > >>> Program main > >>> USE vars > >>> implicit none > >>> > >>> > >>> CALL INIT_MPI_ENV(MYID,NPROCS) > >>> > >>> ipt=myid+10 > >>> OPEN(ipt) > >>> > >>> > >>> write(ipt,*) "Start memory test!" > >>> > >>> CALL NON_CONTIGUOUS_FAILS > >>> > >>> ! CALL CONTIGUOUS_WORKS > >>> > >>> write(ipt,*) "End memory test!" > >>> > >>> CALL PSHUTDOWN > >>> > >>> END Program main > >>> > >>> > >>> > >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS > >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE > >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE > >>> > >>> SUBROUTINE DUMMY1 > >>> USE vars > >>> implicit none > >>> real, dimension(m) :: my_data > >>> > >>> write(ipt,*) "m,n",m,n > >>> > >>> write(ipt,*) "DUMMY 1", size(my_data) > >>> > >>> END SUBROUTINE DUMMY1 > >>> > >>> > >>> SUBROUTINE DUMMY2(i,j) > >>> USE vars > >>> implicit none > >>> INTEGER, INTENT(IN) ::i,j > >>> > >>> > >>> real, dimension(i,j) :: my_data > >>> > >>> write(ipt,*) "start: DUMMY 2", size(my_data) > >>> > >>> > >>> END SUBROUTINE DUMMY2 > >>> > >>> SUBROUTINE DUMMY3 > >>> USE vars > >>> implicit none > >>> > >>> > >>> real, dimension(m,n) :: my_data > >>> > >>> > >>> write(ipt,*) "start: DUMMY 3", size(my_data) > >>> > >>> > >>> END SUBROUTINE DUMMY3 > >>> > >>> > >>> > ---------------------------------------------------------------------- > -- > >>> > >>> _______________________________________________ > >>> mvapich-discuss mailing list > >>> mvapich-discuss@cse.ohio-state.edu > >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>> > >>> > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080131/76b5f701/attachment-0001.html From huanwei at cse.ohio-state.edu Thu Jan 31 13:05:06 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Thu Jan 31 13:05:16 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <47A1334B.5030906@ucla.edu> Message-ID: Hi Scott, We went up to 256 processes (32 nodes) and did not see the problem in few hundred runs (cpi). Thus, to narrow down the problem, we want to make sure the fabrics and system setup are ok. To diagnose this, we suggest you running mpiGraph program from http://sourceforge.net/projects/mpigraph. This test stresses the interconnects. It should fail at a much higher frequency than simple cpi program if there is a problem with your system setup. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Wed, 30 Jan 2008, Scott A. Friedman wrote: > My co-worker passed this along... > > Yes, the error happens on the cpi.c program too. It happened 2 times > among the 9 cases I ran. > > I was using 128 processes (on 32 4-core nodes). > > --- > > and another... > > It happens for a simple MPI program which just does MPI_Init and > MPI_Finalize and print out number of processors. It happened for > anything from 4 nodes (16 processors ) and more. > > What environment variables should we look for? > > Thanks, > Scott > > wei huang wrote: > > Hi Scott, > > > > On how many processes (and how many nodes) you ran your program? Do you > > have any environmental variables when you are running the program? Does > > the error happen on simple test like cpi? > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > >> The low level ibv tests work fine. > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From jsquyres at cisco.com Thu Jan 31 13:06:46 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu Jan 31 13:07:32 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: References: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> <4797A227.5030002@cse.ohio-state.edu> <1f31dac10801231303v645eab2bu8208e9719456f8dd@mail.gmail.com> <479A1CF0.1040307@cse.ohio-state.edu> <1f31dac10801310859l65fabf4fp9950ea8575266e5c@mail.gmail.com> Message-ID: <5747B25C-4D73-4C88-BC22-8821F71E2034@cisco.com> Brian is completely correct - if the F90 compiler chooses to make temporary buffers in order to pass array subsections to non-blocking MPI functions, there's little that an MPI implementation can do. Simply put: MPI requires that when you use non-blocking communications, the buffer must be available until you call some flavor of MPI_TEST or MPI_WAIT to complete the communication. I don't know of any way for an MPI implementation to know whether it has been handed a temporary buffer (e.g., one that a compiler silently created to pass an array subsection). Do you know if there is a way? On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote: > David, > > The MPI-2 documentation goes into great detail on issues with > Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236 > ). The conditions you are seeing should be directed to Intel. > > > Brian > > > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote: > >> >> Hi again Brian >> >> I just ran my test code on our cluster using ifort 10.1.011 and >> MVAPICH 1.0.1, but the behavior is still the same. >> >> Have you had a chance to try it on any of your test machines? >> >> David >> >> >> >> >> On Jan 25, 2008 12:31 PM, Brian Curtis > state.edu> wrote: >> David, >> >> I did some research on this issue and it looks like you have posted >> the >> bug with Intel. Please let us know what you find out. >> >> >> Brian >> >> David Stuebe wrote: >> > Hi Brian >> > >> > I downloaded the public release, it seems silly but I am not sure >> how to get >> > a rev number from the source... there does not seem to be a '- >> version' >> > option that gives more info, although I did not look to hard. >> > >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on >> the >> > cluster I will try 1.0.1 and see if it goes away. >> > >> > In the mean time please let me know if you can recreate the >> problem? >> > >> > David >> > >> > PS - Just want to make sure you understand my issue, I think it >> is a bad >> > idea to try and pass a non-contiguous F90 memory pointer, I >> should not do >> > that... but the way that it breaks has caused me headaches for >> weeks now. If >> > it reliably caused a sigsev on entering MPI_BCAST that would be >> great! As it >> > is it is really hard to trace the problem. >> > >> > >> > >> > >> > On Jan 23, 2008 3:23 PM, Brian Curtis > state.edu> wrote: >> > >> > >> >> David, >> >> >> >> Sorry to hear you are experience problems with the MVAPICH2 >> Fortran 90 >> >> interface. I will be investigating this issue, but need some >> additional >> >> information about your setup. What is the exact version of >> MVAPICH2 1.0 >> >> you are utilizing (daily tarball or release)? Have you tried >> MVAPICH2 >> >> 1.0.1? >> >> >> >> Brian >> >> >> >> David Stuebe wrote: >> >> >> >>> Hello MVAPICH >> >>> I have found a strange bug in MVAPICH2 using IFORT. The >> behavior is very >> >>> strange indeed - it seems to be related to how ifort deals with >> passing >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. >> >>> The MPI call returns successfully, but later calls to a dummy >> subroutine >> >>> cause a sigsev. >> >>> >> >>> Please look at the following code: >> >>> >> >>> >> >>> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ===================================================================== >> >> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ===================================================================== >> >> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ===================================================================== >> >> >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT >> >>> ! WRITEN BY: DAVID STUEBE >> >>> ! DATE: JAN 23, 2008 >> >>> ! >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest >> >>> ! >> >>> ! KNOWN BEHAVIOR: >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE >> OF >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS >> WITHOUT AN >> >>> INTERFACE - >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING >> VALID >> >>> >> >> DATA. >> >> >> >>> ! >> >>> ! COMMENTS: >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - >> SHAME ON >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR >> NOT. >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS >> >>> ! EXTREMELY DIFFICULT TO DEBUG! >> >>> ! >> >>> ! CONDITIONS FOR OCCURANCE: >> >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' >> >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? >> >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... >> >>> ! ie Running inside one SMP box does not crash. >> >>> ! >> >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV >> >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, >> >>> ! ONLY SOME PROCESSES SIGSEV ? >> >>> ! >> >>> ! ENVIRONMENTAL INFO: >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X >> >>> ! SYSTEM: ROCKS 4.2 >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) >> >>> ! >> >>> ! IFORT/ICC: >> >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based >> applications, >> >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 >> >>> ! >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0 >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- >> enable-f90 >> >>> --enable-cxx --disable-romio --without-mpe >> >>> ! >> >>> >> >>> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ===================================================================== >> >> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ===================================================================== >> >> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ===================================================================== >> >> >> >>> Module vars >> >>> USE MPI >> >>> implicit none >> >>> >> >>> >> >>> integer :: n,m,MYID,NPROCS >> >>> integer :: ipt >> >>> >> >>> integer, allocatable, target :: data(:,:) >> >>> >> >>> contains >> >>> >> >>> subroutine alloc_vars >> >>> implicit none >> >>> >> >>> integer Status >> >>> >> >>> allocate(data(n,m),stat=status) >> >>> if (status /=0) then >> >>> write(ipt,*) "allocation error" >> >>> stop >> >>> end if >> >>> >> >>> data = 0 >> >>> >> >>> end subroutine alloc_vars >> >>> >> >>> SUBROUTINE INIT_MPI_ENV(ID,NP) >> >>> >> >>> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ====================================================================| >> >> >> >>> ! INITIALIZE MPI >> >>> >> ENVIRONMENT | >> >>> >> >>> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ====================================================================| >> >> >> >>> INTEGER, INTENT(OUT) :: ID,NP >> >>> INTEGER IERR >> >>> >> >>> IERR=0 >> >>> >> >>> CALL MPI_INIT(IERR) >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID >> >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID >> >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID >> >>> >> >>> END SUBROUTINE INIT_MPI_ENV >> >>> >> >>> >> >>> >> >>> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ====================================================================| >> >> >> >>> SUBROUTINE PSHUTDOWN >> >>> >> >>> >> >>> >> >> ! >> = >> = >> = >> = >> = >> = >> = >> = >> = >> = >> ====================================================================| >> >> >> >>> INTEGER IERR >> >>> >> >>> IERR=0 >> >>> CALL MPI_FINALIZE(IERR) >> >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID >> >>> close(IPT) >> >>> STOP >> >>> >> >>> END SUBROUTINE PSHUTDOWN >> >>> >> >>> >> >>> SUBROUTINE CONTIGUOUS_WORKS >> >>> IMPLICIT NONE >> >>> INTEGER, pointer :: ptest(:,:) >> >>> INTEGER :: IERR, I,J >> >>> >> >>> >> >>> write(ipt,*) "START CONTIGUOUS:" >> >>> n=2000 ! Set size here... >> >>> m=n+10 >> >>> >> >>> call alloc_vars >> >>> write(ipt,*) "ALLOCATED DATA" >> >>> ptest => data(1:N,1:N) >> >>> >> >>> IF (MYID == 0) ptest=6 >> >>> write(ipt,*) "Made POINTER" >> >>> >> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID >> >>> >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) >> >>> >> >>> DO I = 1,N >> >>> DO J = 1,N >> >>> if(data(I,J) /= 6) & >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) >> >>> END DO >> >>> >> >>> DO J = N+1,M >> >>> if(data(I,J) /= 0) & >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) >> >>> END DO >> >>> >> >>> END DO >> >>> >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN >> ITERFACE >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY >> >>> write(ipt,*) "CALLING DUMMY1" >> >>> CALL DUMMY1 >> >>> >> >>> write(ipt,*) "CALLING DUMMY2" >> >>> call Dummy2(m,n) >> >>> >> >>> write(ipt,*) "CALLING DUMMY3" >> >>> call Dummy3 >> >>> write(ipt,*) "FINISHED!" >> >>> >> >>> END SUBROUTINE CONTIGUOUS_WORKS >> >>> >> >>> SUBROUTINE NON_CONTIGUOUS_FAILS >> >>> IMPLICIT NONE >> >>> INTEGER, pointer :: ptest(:,:) >> >>> INTEGER :: IERR, I,J >> >>> >> >>> >> >>> write(ipt,*) "START NON_CONTIGUOUS:" >> >>> >> >>> m=200 ! Set size here - crash is size dependent! >> >>> n=m+10 >> >>> >> >>> call alloc_vars >> >>> write(ipt,*) "ALLOCATED DATA" >> >>> ptest => data(1:M,1:M) >> >>> >> >>> !=================================================== >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? >> >>> !=================================================== >> >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT >> >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG >> >>> >> >>> IF (MYID == 0) ptest=6 >> >>> write(ipt,*) "Made POINTER" >> >>> >> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" >> >>> >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) >> >>> >> >>> DO I = 1,M >> >>> DO J = 1,M >> >>> if(data(J,I) /= 6) & >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) >> >>> END DO >> >>> >> >>> DO J = M+1,N >> >>> if(data(J,I) /= 0) & >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) >> >>> END DO >> >>> END DO >> >>> >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN >> ITERFACE >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY >> >>> write(ipt,*) "CALLING DUMMY1" >> >>> CALL DUMMY1 >> >>> >> >>> write(ipt,*) "CALLING DUMMY2" >> >>> call Dummy2(m,n) ! SHOULD CRASH HERE! >> >>> >> >>> write(ipt,*) "CALLING DUMMY3" >> >>> call Dummy3 >> >>> write(ipt,*) "FINISHED!" >> >>> >> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS >> >>> >> >>> >> >>> End Module vars >> >>> >> >>> >> >>> Program main >> >>> USE vars >> >>> implicit none >> >>> >> >>> >> >>> CALL INIT_MPI_ENV(MYID,NPROCS) >> >>> >> >>> ipt=myid+10 >> >>> OPEN(ipt) >> >>> >> >>> >> >>> write(ipt,*) "Start memory test!" >> >>> >> >>> CALL NON_CONTIGUOUS_FAILS >> >>> >> >>> ! CALL CONTIGUOUS_WORKS >> >>> >> >>> write(ipt,*) "End memory test!" >> >>> >> >>> CALL PSHUTDOWN >> >>> >> >>> END Program main >> >>> >> >>> >> >>> >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS >> >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE >> >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE >> >>> >> >>> SUBROUTINE DUMMY1 >> >>> USE vars >> >>> implicit none >> >>> real, dimension(m) :: my_data >> >>> >> >>> write(ipt,*) "m,n",m,n >> >>> >> >>> write(ipt,*) "DUMMY 1", size(my_data) >> >>> >> >>> END SUBROUTINE DUMMY1 >> >>> >> >>> >> >>> SUBROUTINE DUMMY2(i,j) >> >>> USE vars >> >>> implicit none >> >>> INTEGER, INTENT(IN) ::i,j >> >>> >> >>> >> >>> real, dimension(i,j) :: my_data >> >>> >> >>> write(ipt,*) "start: DUMMY 2", size(my_data) >> >>> >> >>> >> >>> END SUBROUTINE DUMMY2 >> >>> >> >>> SUBROUTINE DUMMY3 >> >>> USE vars >> >>> implicit none >> >>> >> >>> >> >>> real, dimension(m,n) :: my_data >> >>> >> >>> >> >>> write(ipt,*) "start: DUMMY 3", size(my_data) >> >>> >> >>> >> >>> END SUBROUTINE DUMMY3 >> >>> >> >>> >> >>> >> ------------------------------------------------------------------------ >> >>> >> >>> _______________________________________________ >> >>> mvapich-discuss mailing list >> >>> mvapich-discuss@cse.ohio-state.edu >> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >>> >> >>> >> > >> > >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jeff Squyres Cisco Systems From dstuebe at umassd.edu Thu Jan 31 13:32:23 2008 From: dstuebe at umassd.edu (David Stuebe) Date: Thu Jan 31 13:32:43 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: <5747B25C-4D73-4C88-BC22-8821F71E2034@cisco.com> References: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> <4797A227.5030002@cse.ohio-state.edu> <1f31dac10801231303v645eab2bu8208e9719456f8dd@mail.gmail.com> <479A1CF0.1040307@cse.ohio-state.edu> <1f31dac10801310859l65fabf4fp9950ea8575266e5c@mail.gmail.com> <5747B25C-4D73-4C88-BC22-8821F71E2034@cisco.com> Message-ID: <1f31dac10801311032m5e09064bo6858b7d45f57ca26@mail.gmail.com> Hi Jeff, Brian Maybe I don't fully understand all the issues involved but I did read through several web sites that discuss the dangers of passing temporary arrays to non blocking MPI calls. Is MPI_BCAST non-blocking - I assumed that was a blocking call anyway? Again, my concern is that MPI call returns the data on all processors as (perhaps, naively) expected, it is later in the program that an alloc called on entry to a different subroutine for an explicit shape array causes a sig sev. There is further evidence that it is an MPI issue because the problem is memory-size dependent, and only occurs when run using more than one node, using mvapich2.0. MPICH2.0 when I tested that on our cluster which does not have infiniband. Have you had a chance to experiment with the demo code that I sent. I think the behavior warrants a little further investigation. Thanks David On Jan 31, 2008 1:06 PM, Jeff Squyres wrote: > Brian is completely correct - if the F90 compiler chooses to make > temporary buffers in order to pass array subsections to non-blocking > MPI functions, there's little that an MPI implementation can do. > Simply put: MPI requires that when you use non-blocking > communications, the buffer must be available until you call some > flavor of MPI_TEST or MPI_WAIT to complete the communication. > > I don't know of any way for an MPI implementation to know whether it > has been handed a temporary buffer (e.g., one that a compiler silently > created to pass an array subsection). Do you know if there is a way? > > > > On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote: > > > David, > > > > The MPI-2 documentation goes into great detail on issues with > > Fortran-90 bindings ( > http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236 > > ). The conditions you are seeing should be directed to Intel. > > > > > > Brian > > > > > > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote: > > > >> > >> Hi again Brian > >> > >> I just ran my test code on our cluster using ifort 10.1.011 and > >> MVAPICH 1.0.1, but the behavior is still the same. > >> > >> Have you had a chance to try it on any of your test machines? > >> > >> David > >> > >> > >> > >> > >> On Jan 25, 2008 12:31 PM, Brian Curtis >> state.edu> wrote: > >> David, > >> > >> I did some research on this issue and it looks like you have posted > >> the > >> bug with Intel. Please let us know what you find out. > >> > >> > >> Brian > >> > >> David Stuebe wrote: > >> > Hi Brian > >> > > >> > I downloaded the public release, it seems silly but I am not sure > >> how to get > >> > a rev number from the source... there does not seem to be a '- > >> version' > >> > option that gives more info, although I did not look to hard. > >> > > >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on > >> the > >> > cluster I will try 1.0.1 and see if it goes away. > >> > > >> > In the mean time please let me know if you can recreate the > >> problem? > >> > > >> > David > >> > > >> > PS - Just want to make sure you understand my issue, I think it > >> is a bad > >> > idea to try and pass a non-contiguous F90 memory pointer, I > >> should not do > >> > that... but the way that it breaks has caused me headaches for > >> weeks now. If > >> > it reliably caused a sigsev on entering MPI_BCAST that would be > >> great! As it > >> > is it is really hard to trace the problem. > >> > > >> > > >> > > >> > > >> > On Jan 23, 2008 3:23 PM, Brian Curtis >> state.edu> wrote: > >> > > >> > > >> >> David, > >> >> > >> >> Sorry to hear you are experience problems with the MVAPICH2 > >> Fortran 90 > >> >> interface. I will be investigating this issue, but need some > >> additional > >> >> information about your setup. What is the exact version of > >> MVAPICH2 1.0 > >> >> you are utilizing (daily tarball or release)? Have you tried > >> MVAPICH2 > >> >> 1.0.1? > >> >> > >> >> Brian > >> >> > >> >> David Stuebe wrote: > >> >> > >> >>> Hello MVAPICH > >> >>> I have found a strange bug in MVAPICH2 using IFORT. The > >> behavior is very > >> >>> strange indeed - it seems to be related to how ifort deals with > >> passing > >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. > >> >>> The MPI call returns successfully, but later calls to a dummy > >> subroutine > >> >>> cause a sigsev. > >> >>> > >> >>> Please look at the following code: > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ===================================================================== > >> >> > >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT > >> >>> ! WRITEN BY: DAVID STUEBE > >> >>> ! DATE: JAN 23, 2008 > >> >>> ! > >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest > >> >>> ! > >> >>> ! KNOWN BEHAVIOR: > >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE > >> OF > >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS > >> WITHOUT AN > >> >>> INTERFACE - > >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING > >> VALID > >> >>> > >> >> DATA. > >> >> > >> >>> ! > >> >>> ! COMMENTS: > >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - > >> SHAME ON > >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR > >> NOT. > >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS > >> >>> ! EXTREMELY DIFFICULT TO DEBUG! > >> >>> ! > >> >>> ! CONDITIONS FOR OCCURANCE: > >> >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' > >> >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? > >> >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... > >> >>> ! ie Running inside one SMP box does not crash. > >> >>> ! > >> >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV > >> >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, > >> >>> ! ONLY SOME PROCESSES SIGSEV ? > >> >>> ! > >> >>> ! ENVIRONMENTAL INFO: > >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X > >> >>> ! SYSTEM: ROCKS 4.2 > >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > >> >>> ! > >> >>> ! IFORT/ICC: > >> >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based > >> applications, > >> >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 > >> >>> ! > >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0 > >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 > >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- > >> enable-f90 > >> >>> --enable-cxx --disable-romio --without-mpe > >> >>> ! > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ===================================================================== > >> >> > >> >>> Module vars > >> >>> USE MPI > >> >>> implicit none > >> >>> > >> >>> > >> >>> integer :: n,m,MYID,NPROCS > >> >>> integer :: ipt > >> >>> > >> >>> integer, allocatable, target :: data(:,:) > >> >>> > >> >>> contains > >> >>> > >> >>> subroutine alloc_vars > >> >>> implicit none > >> >>> > >> >>> integer Status > >> >>> > >> >>> allocate(data(n,m),stat=status) > >> >>> if (status /=0) then > >> >>> write(ipt,*) "allocation error" > >> >>> stop > >> >>> end if > >> >>> > >> >>> data = 0 > >> >>> > >> >>> end subroutine alloc_vars > >> >>> > >> >>> SUBROUTINE INIT_MPI_ENV(ID,NP) > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ====================================================================| > >> >> > >> >>> ! INITIALIZE MPI > >> >>> > >> ENVIRONMENT | > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ====================================================================| > >> >> > >> >>> INTEGER, INTENT(OUT) :: ID,NP > >> >>> INTEGER IERR > >> >>> > >> >>> IERR=0 > >> >>> > >> >>> CALL MPI_INIT(IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID > >> >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID > >> >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID > >> >>> > >> >>> END SUBROUTINE INIT_MPI_ENV > >> >>> > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ====================================================================| > >> >> > >> >>> SUBROUTINE PSHUTDOWN > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> ====================================================================| > >> >> > >> >>> INTEGER IERR > >> >>> > >> >>> IERR=0 > >> >>> CALL MPI_FINALIZE(IERR) > >> >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID > >> >>> close(IPT) > >> >>> STOP > >> >>> > >> >>> END SUBROUTINE PSHUTDOWN > >> >>> > >> >>> > >> >>> SUBROUTINE CONTIGUOUS_WORKS > >> >>> IMPLICIT NONE > >> >>> INTEGER, pointer :: ptest(:,:) > >> >>> INTEGER :: IERR, I,J > >> >>> > >> >>> > >> >>> write(ipt,*) "START CONTIGUOUS:" > >> >>> n=2000 ! Set size here... > >> >>> m=n+10 > >> >>> > >> >>> call alloc_vars > >> >>> write(ipt,*) "ALLOCATED DATA" > >> >>> ptest => data(1:N,1:N) > >> >>> > >> >>> IF (MYID == 0) ptest=6 > >> >>> write(ipt,*) "Made POINTER" > >> >>> > >> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID > >> >>> > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >> >>> > >> >>> DO I = 1,N > >> >>> DO J = 1,N > >> >>> if(data(I,J) /= 6) & > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >> >>> END DO > >> >>> > >> >>> DO J = N+1,M > >> >>> if(data(I,J) /= 0) & > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >> >>> END DO > >> >>> > >> >>> END DO > >> >>> > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > >> ITERFACE > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >> >>> write(ipt,*) "CALLING DUMMY1" > >> >>> CALL DUMMY1 > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY2" > >> >>> call Dummy2(m,n) > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY3" > >> >>> call Dummy3 > >> >>> write(ipt,*) "FINISHED!" > >> >>> > >> >>> END SUBROUTINE CONTIGUOUS_WORKS > >> >>> > >> >>> SUBROUTINE NON_CONTIGUOUS_FAILS > >> >>> IMPLICIT NONE > >> >>> INTEGER, pointer :: ptest(:,:) > >> >>> INTEGER :: IERR, I,J > >> >>> > >> >>> > >> >>> write(ipt,*) "START NON_CONTIGUOUS:" > >> >>> > >> >>> m=200 ! Set size here - crash is size dependent! > >> >>> n=m+10 > >> >>> > >> >>> call alloc_vars > >> >>> write(ipt,*) "ALLOCATED DATA" > >> >>> ptest => data(1:M,1:M) > >> >>> > >> >>> !=================================================== > >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? > >> >>> !=================================================== > >> >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT > >> >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG > >> >>> > >> >>> IF (MYID == 0) ptest=6 > >> >>> write(ipt,*) "Made POINTER" > >> >>> > >> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" > >> >>> > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >> >>> > >> >>> DO I = 1,M > >> >>> DO J = 1,M > >> >>> if(data(J,I) /= 6) & > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >> >>> END DO > >> >>> > >> >>> DO J = M+1,N > >> >>> if(data(J,I) /= 0) & > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >> >>> END DO > >> >>> END DO > >> >>> > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > >> ITERFACE > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >> >>> write(ipt,*) "CALLING DUMMY1" > >> >>> CALL DUMMY1 > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY2" > >> >>> call Dummy2(m,n) ! SHOULD CRASH HERE! > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY3" > >> >>> call Dummy3 > >> >>> write(ipt,*) "FINISHED!" > >> >>> > >> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS > >> >>> > >> >>> > >> >>> End Module vars > >> >>> > >> >>> > >> >>> Program main > >> >>> USE vars > >> >>> implicit none > >> >>> > >> >>> > >> >>> CALL INIT_MPI_ENV(MYID,NPROCS) > >> >>> > >> >>> ipt=myid+10 > >> >>> OPEN(ipt) > >> >>> > >> >>> > >> >>> write(ipt,*) "Start memory test!" > >> >>> > >> >>> CALL NON_CONTIGUOUS_FAILS > >> >>> > >> >>> ! CALL CONTIGUOUS_WORKS > >> >>> > >> >>> write(ipt,*) "End memory test!" > >> >>> > >> >>> CALL PSHUTDOWN > >> >>> > >> >>> END Program main > >> >>> > >> >>> > >> >>> > >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS > >> >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE > >> >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE > >> >>> > >> >>> SUBROUTINE DUMMY1 > >> >>> USE vars > >> >>> implicit none > >> >>> real, dimension(m) :: my_data > >> >>> > >> >>> write(ipt,*) "m,n",m,n > >> >>> > >> >>> write(ipt,*) "DUMMY 1", size(my_data) > >> >>> > >> >>> END SUBROUTINE DUMMY1 > >> >>> > >> >>> > >> >>> SUBROUTINE DUMMY2(i,j) > >> >>> USE vars > >> >>> implicit none > >> >>> INTEGER, INTENT(IN) ::i,j > >> >>> > >> >>> > >> >>> real, dimension(i,j) :: my_data > >> >>> > >> >>> write(ipt,*) "start: DUMMY 2", size(my_data) > >> >>> > >> >>> > >> >>> END SUBROUTINE DUMMY2 > >> >>> > >> >>> SUBROUTINE DUMMY3 > >> >>> USE vars > >> >>> implicit none > >> >>> > >> >>> > >> >>> real, dimension(m,n) :: my_data > >> >>> > >> >>> > >> >>> write(ipt,*) "start: DUMMY 3", size(my_data) > >> >>> > >> >>> > >> >>> END SUBROUTINE DUMMY3 > >> >>> > >> >>> > >> >>> > >> > ------------------------------------------------------------------------ > >> >>> > >> >>> _______________________________________________ > >> >>> mvapich-discuss mailing list > >> >>> mvapich-discuss@cse.ohio-state.edu > >> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> >>> > >> >>> > >> > > >> > > >> > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > -- > Jeff Squyres > Cisco Systems > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080131/4053f5a3/attachment-0001.html From curtisbr at cse.ohio-state.edu Thu Jan 31 14:28:04 2008 From: curtisbr at cse.ohio-state.edu (Brian Curtis) Date: Thu Jan 31 14:28:21 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: <1f31dac10801311032m5e09064bo6858b7d45f57ca26@mail.gmail.com> References: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> <4797A227.5030002@cse.ohio-state.edu> <1f31dac10801231303v645eab2bu8208e9719456f8dd@mail.gmail.com> <479A1CF0.1040307@cse.ohio-state.edu> <1f31dac10801310859l65fabf4fp9950ea8575266e5c@mail.gmail.com> <5747B25C-4D73-4C88-BC22-8821F71E2034@cisco.com> <1f31dac10801311032m5e09064bo6858b7d45f57ca26@mail.gmail.com> Message-ID: <7882E549-127B-4FBD-B0D8-D5CBA874C951@cse.ohio-state.edu> David, Have you compiled and testing the application with other f90 compilers? Brian On Jan 31, 2008, at 1:32 PM, David Stuebe wrote: > > Hi Jeff, Brian > > Maybe I don't fully understand all the issues involved but I did > read through several web sites that discuss the dangers of passing > temporary arrays to non blocking MPI calls. Is MPI_BCAST non- > blocking - I assumed that was a blocking call anyway? > > Again, my concern is that MPI call returns the data on all > processors as (perhaps, naively) expected, it is later in the > program that an alloc called on entry to a different subroutine for > an explicit shape array causes a sig sev. There is further evidence > that it is an MPI issue because the problem is memory-size > dependent, and only occurs when run using more than one node, using > mvapich2.0. MPICH2.0 when I tested that on our cluster which does > not have infiniband. > > Have you had a chance to experiment with the demo code that I sent. > I think the behavior warrants a little further investigation. > > Thanks > > David > > On Jan 31, 2008 1:06 PM, Jeff Squyres wrote: > Brian is completely correct - if the F90 compiler chooses to make > temporary buffers in order to pass array subsections to non-blocking > MPI functions, there's little that an MPI implementation can do. > Simply put: MPI requires that when you use non-blocking > communications, the buffer must be available until you call some > flavor of MPI_TEST or MPI_WAIT to complete the communication. > > I don't know of any way for an MPI implementation to know whether it > has been handed a temporary buffer (e.g., one that a compiler silently > created to pass an array subsection). Do you know if there is a way? > > > > On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote: > > > David, > > > > The MPI-2 documentation goes into great detail on issues with > > Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/ > node236.htm#Node236 > > ). The conditions you are seeing should be directed to Intel. > > > > > > Brian > > > > > > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote: > > > >> > >> Hi again Brian > >> > >> I just ran my test code on our cluster using ifort 10.1.011 and > >> MVAPICH 1.0.1, but the behavior is still the same. > >> > >> Have you had a chance to try it on any of your test machines? > >> > >> David > >> > >> > >> > >> > >> On Jan 25, 2008 12:31 PM, Brian Curtis >> state.edu> wrote: > >> David, > >> > >> I did some research on this issue and it looks like you have posted > >> the > >> bug with Intel. Please let us know what you find out. > >> > >> > >> Brian > >> > >> David Stuebe wrote: > >> > Hi Brian > >> > > >> > I downloaded the public release, it seems silly but I am not sure > >> how to get > >> > a rev number from the source... there does not seem to be a '- > >> version' > >> > option that gives more info, although I did not look to hard. > >> > > >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on > >> the > >> > cluster I will try 1.0.1 and see if it goes away. > >> > > >> > In the mean time please let me know if you can recreate the > >> problem? > >> > > >> > David > >> > > >> > PS - Just want to make sure you understand my issue, I think it > >> is a bad > >> > idea to try and pass a non-contiguous F90 memory pointer, I > >> should not do > >> > that... but the way that it breaks has caused me headaches for > >> weeks now. If > >> > it reliably caused a sigsev on entering MPI_BCAST that would be > >> great! As it > >> > is it is really hard to trace the problem. > >> > > >> > > >> > > >> > > >> > On Jan 23, 2008 3:23 PM, Brian Curtis >> state.edu> wrote: > >> > > >> > > >> >> David, > >> >> > >> >> Sorry to hear you are experience problems with the MVAPICH2 > >> Fortran 90 > >> >> interface. I will be investigating this issue, but need some > >> additional > >> >> information about your setup. What is the exact version of > >> MVAPICH2 1.0 > >> >> you are utilizing (daily tarball or release)? Have you tried > >> MVAPICH2 > >> >> 1.0.1? > >> >> > >> >> Brian > >> >> > >> >> David Stuebe wrote: > >> >> > >> >>> Hello MVAPICH > >> >>> I have found a strange bug in MVAPICH2 using IFORT. The > >> behavior is very > >> >>> strange indeed - it seems to be related to how ifort deals with > >> passing > >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. > >> >>> The MPI call returns successfully, but later calls to a dummy > >> subroutine > >> >>> cause a sigsev. > >> >>> > >> >>> Please look at the following code: > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT > >> >>> ! WRITEN BY: DAVID STUEBE > >> >>> ! DATE: JAN 23, 2008 > >> >>> ! > >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest > >> >>> ! > >> >>> ! KNOWN BEHAVIOR: > >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE > >> OF > >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS > >> WITHOUT AN > >> >>> INTERFACE - > >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING > >> VALID > >> >>> > >> >> DATA. > >> >> > >> >>> ! > >> >>> ! COMMENTS: > >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - > >> SHAME ON > >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR > >> NOT. > >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS > >> >>> ! EXTREMELY DIFFICULT TO DEBUG! > >> >>> ! > >> >>> ! CONDITIONS FOR OCCURANCE: > >> >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' > >> >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? > >> >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... > >> >>> ! ie Running inside one SMP box does not crash. > >> >>> ! > >> >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV > >> >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, > >> >>> ! ONLY SOME PROCESSES SIGSEV ? > >> >>> ! > >> >>> ! ENVIRONMENTAL INFO: > >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X > >> >>> ! SYSTEM: ROCKS 4.2 > >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > >> >>> ! > >> >>> ! IFORT/ICC: > >> >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based > >> applications, > >> >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 > >> >>> ! > >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0 > >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 > >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- > >> enable-f90 > >> >>> --enable-cxx --disable-romio --without-mpe > >> >>> ! > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >>> Module vars > >> >>> USE MPI > >> >>> implicit none > >> >>> > >> >>> > >> >>> integer :: n,m,MYID,NPROCS > >> >>> integer :: ipt > >> >>> > >> >>> integer, allocatable, target :: data(:,:) > >> >>> > >> >>> contains > >> >>> > >> >>> subroutine alloc_vars > >> >>> implicit none > >> >>> > >> >>> integer Status > >> >>> > >> >>> allocate(data(n,m),stat=status) > >> >>> if (status /=0) then > >> >>> write(ipt,*) "allocation error" > >> >>> stop > >> >>> end if > >> >>> > >> >>> data = 0 > >> >>> > >> >>> end subroutine alloc_vars > >> >>> > >> >>> SUBROUTINE INIT_MPI_ENV(ID,NP) > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ====================================================================| > >> >> > >> >>> ! INITIALIZE MPI > >> >>> > >> ENVIRONMENT | > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ====================================================================| > >> >> > >> >>> INTEGER, INTENT(OUT) :: ID,NP > >> >>> INTEGER IERR > >> >>> > >> >>> IERR=0 > >> >>> > >> >>> CALL MPI_INIT(IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID > >> >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID > >> >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID > >> >>> > >> >>> END SUBROUTINE INIT_MPI_ENV > >> >>> > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ====================================================================| > >> >> > >> >>> SUBROUTINE PSHUTDOWN > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ====================================================================| > >> >> > >> >>> INTEGER IERR > >> >>> > >> >>> IERR=0 > >> >>> CALL MPI_FINALIZE(IERR) > >> >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID > >> >>> close(IPT) > >> >>> STOP > >> >>> > >> >>> END SUBROUTINE PSHUTDOWN > >> >>> > >> >>> > >> >>> SUBROUTINE CONTIGUOUS_WORKS > >> >>> IMPLICIT NONE > >> >>> INTEGER, pointer :: ptest(:,:) > >> >>> INTEGER :: IERR, I,J > >> >>> > >> >>> > >> >>> write(ipt,*) "START CONTIGUOUS:" > >> >>> n=2000 ! Set size here... > >> >>> m=n+10 > >> >>> > >> >>> call alloc_vars > >> >>> write(ipt,*) "ALLOCATED DATA" > >> >>> ptest => data(1:N,1:N) > >> >>> > >> >>> IF (MYID == 0) ptest=6 > >> >>> write(ipt,*) "Made POINTER" > >> >>> > >> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID > >> >>> > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >> >>> > >> >>> DO I = 1,N > >> >>> DO J = 1,N > >> >>> if(data(I,J) /= 6) & > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >> >>> END DO > >> >>> > >> >>> DO J = N+1,M > >> >>> if(data(I,J) /= 0) & > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >> >>> END DO > >> >>> > >> >>> END DO > >> >>> > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > >> ITERFACE > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >> >>> write(ipt,*) "CALLING DUMMY1" > >> >>> CALL DUMMY1 > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY2" > >> >>> call Dummy2(m,n) > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY3" > >> >>> call Dummy3 > >> >>> write(ipt,*) "FINISHED!" > >> >>> > >> >>> END SUBROUTINE CONTIGUOUS_WORKS > >> >>> > >> >>> SUBROUTINE NON_CONTIGUOUS_FAILS > >> >>> IMPLICIT NONE > >> >>> INTEGER, pointer :: ptest(:,:) > >> >>> INTEGER :: IERR, I,J > >> >>> > >> >>> > >> >>> write(ipt,*) "START NON_CONTIGUOUS:" > >> >>> > >> >>> m=200 ! Set size here - crash is size dependent! > >> >>> n=m+10 > >> >>> > >> >>> call alloc_vars > >> >>> write(ipt,*) "ALLOCATED DATA" > >> >>> ptest => data(1:M,1:M) > >> >>> > >> >>> !=================================================== > >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? > >> >>> !=================================================== > >> >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT > >> >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG > >> >>> > >> >>> IF (MYID == 0) ptest=6 > >> >>> write(ipt,*) "Made POINTER" > >> >>> > >> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" > >> >>> > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >> >>> > >> >>> DO I = 1,M > >> >>> DO J = 1,M > >> >>> if(data(J,I) /= 6) & > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >> >>> END DO > >> >>> > >> >>> DO J = M+1,N > >> >>> if(data(J,I) /= 0) & > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >> >>> END DO > >> >>> END DO > >> >>> > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > >> ITERFACE > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >> >>> write(ipt,*) "CALLING DUMMY1" > >> >>> CALL DUMMY1 > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY2" > >> >>> call Dummy2(m,n) ! SHOULD CRASH HERE! > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY3" > >> >>> call Dummy3 > >> >>> write(ipt,*) "FINISHED!" > >> >>> > >> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS > >> >>> > >> >>> > >> >>> End Module vars > >> >>> > >> >>> > >> >>> Program main > >> >>> USE vars > >> >>> implicit none > >> >>> > >> >>> > >> >>> CALL INIT_MPI_ENV(MYID,NPROCS) > >> >>> > >> >>> ipt=myid+10 > >> >>> OPEN(ipt) > >> >>> > >> >>> > >> >>> write(ipt,*) "Start memory test!" > >> >>> > >> >>> CALL NON_CONTIGUOUS_FAILS > >> >>> > >> >>> ! CALL CONTIGUOUS_WORKS > >> >>> > >> >>> write(ipt,*) "End memory test!" > >> >>> > >> >>> CALL PSHUTDOWN > >> >>> > >> >>> END Program main > >> >>> > >> >>> > >> >>> > >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS > >> >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE > >> >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE > >> >>> > >> >>> SUBROUTINE DUMMY1 > >> >>> USE vars > >> >>> implicit none > >> >>> real, dimension(m) :: my_data > >> >>> > >> >>> write(ipt,*) "m,n",m,n > >> >>> > >> >>> write(ipt,*) "DUMMY 1", size(my_data) > >> >>> > >> >>> END SUBROUTINE DUMMY1 > >> >>> > >> >>> > >> >>> SUBROUTINE DUMMY2(i,j) > >> >>> USE vars > >> >>> implicit none > >> >>> INTEGER, INTENT(IN) ::i,j > >> >>> > >> >>> > >> >>> real, dimension(i,j) :: my_data > >> >>> > >> >>> write(ipt,*) "start: DUMMY 2", size(my_data) > >> >>> > >> >>> > >> >>> END SUBROUTINE DUMMY2 > >> >>> > >> >>> SUBROUTINE DUMMY3 > >> >>> USE vars > >> >>> implicit none > >> >>> > >> >>> > >> >>> real, dimension(m,n) :: my_data > >> >>> > >> >>> > >> >>> write(ipt,*) "start: DUMMY 3", size(my_data) > >> >>> > >> >>> > >> >>> END SUBROUTINE DUMMY3 > >> >>> > >> >>> > >> >>> > >> > ---------------------------------------------------------------------- > -- > >> >>> > >> >>> _______________________________________________ > >> >>> mvapich-discuss mailing list > >> >>> mvapich-discuss@cse.ohio-state.edu > >> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> >>> > >> >>> > >> > > >> > > >> > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > -- > Jeff Squyres > Cisco Systems > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080131/bd22c795/attachment-0001.html From jsquyres at cisco.com Thu Jan 31 14:53:48 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu Jan 31 14:54:29 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: <1f31dac10801311032m5e09064bo6858b7d45f57ca26@mail.gmail.com> References: <1f31dac10801231129n65e6851ei49085e071cf1492e@mail.gmail.com> <4797A227.5030002@cse.ohio-state.edu> <1f31dac10801231303v645eab2bu8208e9719456f8dd@mail.gmail.com> <479A1CF0.1040307@cse.ohio-state.edu> <1f31dac10801310859l65fabf4fp9950ea8575266e5c@mail.gmail.com> <5747B25C-4D73-4C88-BC22-8821F71E2034@cisco.com> <1f31dac10801311032m5e09064bo6858b7d45f57ca26@mail.gmail.com> Message-ID: <4BCD4A22-812E-42F2-83A1-FBB5B91216D8@cisco.com> On Jan 31, 2008, at 1:32 PM, David Stuebe wrote: > Maybe I don't fully understand all the issues involved but I did > read through several web sites that discuss the dangers of passing > temporary arrays to non blocking MPI calls. Is MPI_BCAST non- > blocking - I assumed that was a blocking call anyway? Yes it is; my bad for not noticing that that is what you were asking about. :-) > Again, my concern is that MPI call returns the data on all > processors as (perhaps, naively) expected, it is later in the > program that an alloc called on entry to a different subroutine for > an explicit shape array causes a sig sev. There is further evidence > that it is an MPI issue because the problem is memory-size > dependent, and only occurs when run using more than one node, using > mvapich2.0. MPICH2.0 when I tested that on our cluster which does > not have infiniband. Looking at your example code, I don't understand all of the F90 syntax to fully appreciate what's going on. It looks like you're passing an array subset to MPI_BCAST and when that is a non-contiguous buffer, problems *may* occur later. Is that what you're trying to say? My *guesses/speculation* are: - perhaps there's an issue with compiler-provided temporary buffers that are registered by MVAPICH and then later freed by the compiler, but somehow evade being unregistered by MVAPICH (this could lead to heap corruption that manifests later) - I don't know if fortran compilers are allowed to move buffers at will, such as in garbage collection and/or memory compacting schemes (do you?) -- this could lead to a similar problem that I describe above Again, these are pure speculation. I really don't know how F90 compilers work, and I don't know what MVAPICH does with registered memory caching and/or progress threads, so further speculation is fairly pointless. :-) MVAPICH developers: can you comment on this? And just to be sure -- you compiled MVAPICH with the same compilers that you're using, with the same levels of optimization, etc., right? > On Jan 31, 2008 1:06 PM, Jeff Squyres wrote: > Brian is completely correct - if the F90 compiler chooses to make > temporary buffers in order to pass array subsections to non-blocking > MPI functions, there's little that an MPI implementation can do. > Simply put: MPI requires that when you use non-blocking > communications, the buffer must be available until you call some > flavor of MPI_TEST or MPI_WAIT to complete the communication. > > I don't know of any way for an MPI implementation to know whether it > has been handed a temporary buffer (e.g., one that a compiler silently > created to pass an array subsection). Do you know if there is a way? > > > > On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote: > > > David, > > > > The MPI-2 documentation goes into great detail on issues with > > Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236 > > ). The conditions you are seeing should be directed to Intel. > > > > > > Brian > > > > > > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote: > > > >> > >> Hi again Brian > >> > >> I just ran my test code on our cluster using ifort 10.1.011 and > >> MVAPICH 1.0.1, but the behavior is still the same. > >> > >> Have you had a chance to try it on any of your test machines? > >> > >> David > >> > >> > >> > >> > >> On Jan 25, 2008 12:31 PM, Brian Curtis >> state.edu> wrote: > >> David, > >> > >> I did some research on this issue and it looks like you have posted > >> the > >> bug with Intel. Please let us know what you find out. > >> > >> > >> Brian > >> > >> David Stuebe wrote: > >> > Hi Brian > >> > > >> > I downloaded the public release, it seems silly but I am not sure > >> how to get > >> > a rev number from the source... there does not seem to be a '- > >> version' > >> > option that gives more info, although I did not look to hard. > >> > > >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on > >> the > >> > cluster I will try 1.0.1 and see if it goes away. > >> > > >> > In the mean time please let me know if you can recreate the > >> problem? > >> > > >> > David > >> > > >> > PS - Just want to make sure you understand my issue, I think it > >> is a bad > >> > idea to try and pass a non-contiguous F90 memory pointer, I > >> should not do > >> > that... but the way that it breaks has caused me headaches for > >> weeks now. If > >> > it reliably caused a sigsev on entering MPI_BCAST that would be > >> great! As it > >> > is it is really hard to trace the problem. > >> > > >> > > >> > > >> > > >> > On Jan 23, 2008 3:23 PM, Brian Curtis >> state.edu> wrote: > >> > > >> > > >> >> David, > >> >> > >> >> Sorry to hear you are experience problems with the MVAPICH2 > >> Fortran 90 > >> >> interface. I will be investigating this issue, but need some > >> additional > >> >> information about your setup. What is the exact version of > >> MVAPICH2 1.0 > >> >> you are utilizing (daily tarball or release)? Have you tried > >> MVAPICH2 > >> >> 1.0.1? > >> >> > >> >> Brian > >> >> > >> >> David Stuebe wrote: > >> >> > >> >>> Hello MVAPICH > >> >>> I have found a strange bug in MVAPICH2 using IFORT. The > >> behavior is very > >> >>> strange indeed - it seems to be related to how ifort deals with > >> passing > >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. > >> >>> The MPI call returns successfully, but later calls to a dummy > >> subroutine > >> >>> cause a sigsev. > >> >>> > >> >>> Please look at the following code: > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ===================================================================== > >> >> > >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT > >> >>> ! WRITEN BY: DAVID STUEBE > >> >>> ! DATE: JAN 23, 2008 > >> >>> ! > >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest > >> >>> ! > >> >>> ! KNOWN BEHAVIOR: > >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE > >> OF > >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS > >> WITHOUT AN > >> >>> INTERFACE - > >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING > >> VALID > >> >>> > >> >> DATA. > >> >> > >> >>> ! > >> >>> ! COMMENTS: > >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - > >> SHAME ON > >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR > >> NOT. > >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS > >> >>> ! EXTREMELY DIFFICULT TO DEBUG! > >> >>> ! > >> >>> ! CONDITIONS FOR OCCURANCE: > >> >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' > >> >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? > >> >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... > >> >>> ! ie Running inside one SMP box does not crash. > >> >>> ! > >> >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV > >> >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, > >> >>> ! ONLY SOME PROCESSES SIGSEV ? > >> >>> ! > >> >>> ! ENVIRONMENTAL INFO: > >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X > >> >>> ! SYSTEM: ROCKS 4.2 > >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > >> >>> ! > >> >>> ! IFORT/ICC: > >> >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based > >> applications, > >> >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 > >> >>> ! > >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0 > >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 > >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- > >> enable-f90 > >> >>> --enable-cxx --disable-romio --without-mpe > >> >>> ! > > >> > ===================================================================== > >> >> > >> >>> Module vars > >> >>> USE MPI > >> >>> implicit none > >> >>> > >> >>> > >> >>> integer :: n,m,MYID,NPROCS > >> >>> integer :: ipt > >> >>> > >> >>> integer, allocatable, target :: data(:,:) > >> >>> > >> >>> contains > >> >>> > >> >>> subroutine alloc_vars > >> >>> implicit none > >> >>> > >> >>> integer Status > >> >>> > >> >>> allocate(data(n,m),stat=status) > >> >>> if (status /=0) then > >> >>> write(ipt,*) "allocation error" > >> >>> stop > >> >>> end if > >> >>> > >> >>> data = 0 > >> >>> > >> >>> end subroutine alloc_vars > >> >>> > >> >>> SUBROUTINE INIT_MPI_ENV(ID,NP) > >> >>> > >> >>> > >> >> ! > > >> > ====================================================================| > >> >> > >> >>> ! INITIALIZE MPI > >> >>> > >> ENVIRONMENT | > >> >>> > >> >>> > >> >> ! > > >> > ====================================================================| > >> >> > >> >>> INTEGER, INTENT(OUT) :: ID,NP > >> >>> INTEGER IERR > >> >>> > >> >>> IERR=0 > >> >>> > >> >>> CALL MPI_INIT(IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID > >> >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID > >> >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID > >> >>> > >> >>> END SUBROUTINE INIT_MPI_ENV > >> >>> > >> >>> > >> >>> > >> >>> > >> >> ! > =============================| > >> >> > >> >>> SUBROUTINE PSHUTDOWN > >> >>> > >> >>> > >> >>> > >> >> ! > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> = > >> > ====================================================================| > >> >> > >> >>> INTEGER IERR > >> >>> > >> >>> IERR=0 > >> >>> CALL MPI_FINALIZE(IERR) > >> >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID > >> >>> close(IPT) > >> >>> STOP > >> >>> > >> >>> END SUBROUTINE PSHUTDOWN > >> >>> > >> >>> > >> >>> SUBROUTINE CONTIGUOUS_WORKS > >> >>> IMPLICIT NONE > >> >>> INTEGER, pointer :: ptest(:,:) > >> >>> INTEGER :: IERR, I,J > >> >>> > >> >>> > >> >>> write(ipt,*) "START CONTIGUOUS:" > >> >>> n=2000 ! Set size here... > >> >>> m=n+10 > >> >>> > >> >>> call alloc_vars > >> >>> write(ipt,*) "ALLOCATED DATA" > >> >>> ptest => data(1:N,1:N) > >> >>> > >> >>> IF (MYID == 0) ptest=6 > >> >>> write(ipt,*) "Made POINTER" > >> >>> > >> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID > >> >>> > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >> >>> > >> >>> DO I = 1,N > >> >>> DO J = 1,N > >> >>> if(data(I,J) /= 6) & > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >> >>> END DO > >> >>> > >> >>> DO J = N+1,M > >> >>> if(data(I,J) /= 0) & > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > >> >>> END DO > >> >>> > >> >>> END DO > >> >>> > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > >> ITERFACE > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >> >>> write(ipt,*) "CALLING DUMMY1" > >> >>> CALL DUMMY1 > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY2" > >> >>> call Dummy2(m,n) > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY3" > >> >>> call Dummy3 > >> >>> write(ipt,*) "FINISHED!" > >> >>> > >> >>> END SUBROUTINE CONTIGUOUS_WORKS > >> >>> > >> >>> SUBROUTINE NON_CONTIGUOUS_FAILS > >> >>> IMPLICIT NONE > >> >>> INTEGER, pointer :: ptest(:,:) > >> >>> INTEGER :: IERR, I,J > >> >>> > >> >>> > >> >>> write(ipt,*) "START NON_CONTIGUOUS:" > >> >>> > >> >>> m=200 ! Set size here - crash is size dependent! > >> >>> n=m+10 > >> >>> > >> >>> call alloc_vars > >> >>> write(ipt,*) "ALLOCATED DATA" > >> >>> ptest => data(1:M,1:M) > >> >>> > >> >>> !=================================================== > >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? > >> >>> !=================================================== > >> >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT > >> >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG > >> >>> > >> >>> IF (MYID == 0) ptest=6 > >> >>> write(ipt,*) "Made POINTER" > >> >>> > >> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" > >> >>> > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > >> >>> > >> >>> DO I = 1,M > >> >>> DO J = 1,M > >> >>> if(data(J,I) /= 6) & > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >> >>> END DO > >> >>> > >> >>> DO J = M+1,N > >> >>> if(data(J,I) /= 0) & > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > >> >>> END DO > >> >>> END DO > >> >>> > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > >> ITERFACE > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > >> >>> write(ipt,*) "CALLING DUMMY1" > >> >>> CALL DUMMY1 > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY2" > >> >>> call Dummy2(m,n) ! SHOULD CRASH HERE! > >> >>> > >> >>> write(ipt,*) "CALLING DUMMY3" > >> >>> call Dummy3 > >> >>> write(ipt,*) "FINISHED!" > >> >>> > >> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS > >> >>> > >> >>> > >> >>> End Module vars > >> >>> > >> >>> > >> >>> Program main > >> >>> USE vars > >> >>> implicit none > >> >>> > >> >>> > >> >>> CALL INIT_MPI_ENV(MYID,NPROCS) > >> >>> > >> >>> ipt=myid+10 > >> >>> OPEN(ipt) > >> >>> > >> >>> > >> >>> write(ipt,*) "Start memory test!" > >> >>> > >> >>> CALL NON_CONTIGUOUS_FAILS > >> >>> > >> >>> ! CALL CONTIGUOUS_WORKS > >> >>> > >> >>> write(ipt,*) "End memory test!" > >> >>> > >> >>> CALL PSHUTDOWN > >> >>> > >> >>> END Program main > >> >>> > >> >>> > >> >>> > >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS > >> >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE > >> >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE > >> >>> > >> >>> SUBROUTINE DUMMY1 > >> >>> USE vars > >> >>> implicit none > >> >>> real, dimension(m) :: my_data > >> >>> > >> >>> write(ipt,*) "m,n",m,n > >> >>> > >> >>> write(ipt,*) "DUMMY 1", size(my_data) > >> >>> > >> >>> END SUBROUTINE DUMMY1 > >> >>> > >> >>> > >> >>> SUBROUTINE DUMMY2(i,j) > >> >>> USE vars > >> >>> implicit none > >> >>> INTEGER, INTENT(IN) ::i,j > >> >>> > >> >>> > >> >>> real, dimension(i,j) :: my_data > >> >>> > >> >>> write(ipt,*) "start: DUMMY 2", size(my_data) > >> >>> > >> >>> > >> >>> END SUBROUTINE DUMMY2 > >> >>> > >> >>> SUBROUTINE DUMMY3 > >> >>> USE vars > >> >>> implicit none > >> >>> > >> >>> > >> >>> real, dimension(m,n) :: my_data > >> >>> > >> >>> > >> >>> write(ipt,*) "start: DUMMY 3", size(my_data) > >> >>> > >> >>> > >> >>> END SUBROUTINE DUMMY3 -- Jeff Squyres Cisco Systems From koop at cse.ohio-state.edu Thu Jan 31 16:28:30 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu Jan 31 16:28:39 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: <4BCD4A22-812E-42F2-83A1-FBB5B91216D8@cisco.com> Message-ID: David, Can you try running with a lower optimization level or a different compiler? I'd also like to encourage you to update to the OpenFabrics/Gen2 stack instead of VAPI. If you want to rule out the memory registration issue that Jeff has suggested you can try it without. Since you are using VAPI (which is no longer where we add new features), you will have to remove the -DLAZY_MEM_UNREGISTER flag from the make.mvapich.vapi CFLAGS and recompile MPI and your application. The OpenFabrics device has enhanced memory registration code. Matt On Thu, 31 Jan 2008, Jeff Squyres wrote: > On Jan 31, 2008, at 1:32 PM, David Stuebe wrote: > > > Maybe I don't fully understand all the issues involved but I did > > read through several web sites that discuss the dangers of passing > > temporary arrays to non blocking MPI calls. Is MPI_BCAST non- > > blocking - I assumed that was a blocking call anyway? > > Yes it is; my bad for not noticing that that is what you were asking > about. :-) > > > Again, my concern is that MPI call returns the data on all > > processors as (perhaps, naively) expected, it is later in the > > program that an alloc called on entry to a different subroutine for > > an explicit shape array causes a sig sev. There is further evidence > > that it is an MPI issue because the problem is memory-size > > dependent, and only occurs when run using more than one node, using > > mvapich2.0. MPICH2.0 when I tested that on our cluster which does > > not have infiniband. > > Looking at your example code, I don't understand all of the F90 syntax > to fully appreciate what's going on. It looks like you're passing an > array subset to MPI_BCAST and when that is a non-contiguous buffer, > problems *may* occur later. Is that what you're trying to say? > > My *guesses/speculation* are: > > - perhaps there's an issue with compiler-provided temporary buffers > that are registered by MVAPICH and then later freed by the compiler, > but somehow evade being unregistered by MVAPICH (this could lead to > heap corruption that manifests later) > > - I don't know if fortran compilers are allowed to move buffers at > will, such as in garbage collection and/or memory compacting schemes > (do you?) -- this could lead to a similar problem that I describe above > > Again, these are pure speculation. I really don't know how F90 > compilers work, and I don't know what MVAPICH does with registered > memory caching and/or progress threads, so further speculation is > fairly pointless. :-) > > MVAPICH developers: can you comment on this? > > And just to be sure -- you compiled MVAPICH with the same compilers > that you're using, with the same levels of optimization, etc., right? > > > On Jan 31, 2008 1:06 PM, Jeff Squyres wrote: > > Brian is completely correct - if the F90 compiler chooses to make > > temporary buffers in order to pass array subsections to non-blocking > > MPI functions, there's little that an MPI implementation can do. > > Simply put: MPI requires that when you use non-blocking > > communications, the buffer must be available until you call some > > flavor of MPI_TEST or MPI_WAIT to complete the communication. > > > > I don't know of any way for an MPI implementation to know whether it > > has been handed a temporary buffer (e.g., one that a compiler silently > > created to pass an array subsection). Do you know if there is a way? > > > > > > > > On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote: > > > > > David, > > > > > > The MPI-2 documentation goes into great detail on issues with > > > Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236 > > > ). The conditions you are seeing should be directed to Intel. > > > > > > > > > Brian > > > > > > > > > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote: > > > > > >> > > >> Hi again Brian > > >> > > >> I just ran my test code on our cluster using ifort 10.1.011 and > > >> MVAPICH 1.0.1, but the behavior is still the same. > > >> > > >> Have you had a chance to try it on any of your test machines? > > >> > > >> David > > >> > > >> > > >> > > >> > > >> On Jan 25, 2008 12:31 PM, Brian Curtis > >> state.edu> wrote: > > >> David, > > >> > > >> I did some research on this issue and it looks like you have posted > > >> the > > >> bug with Intel. Please let us know what you find out. > > >> > > >> > > >> Brian > > >> > > >> David Stuebe wrote: > > >> > Hi Brian > > >> > > > >> > I downloaded the public release, it seems silly but I am not sure > > >> how to get > > >> > a rev number from the source... there does not seem to be a '- > > >> version' > > >> > option that gives more info, although I did not look to hard. > > >> > > > >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on > > >> the > > >> > cluster I will try 1.0.1 and see if it goes away. > > >> > > > >> > In the mean time please let me know if you can recreate the > > >> problem? > > >> > > > >> > David > > >> > > > >> > PS - Just want to make sure you understand my issue, I think it > > >> is a bad > > >> > idea to try and pass a non-contiguous F90 memory pointer, I > > >> should not do > > >> > that... but the way that it breaks has caused me headaches for > > >> weeks now. If > > >> > it reliably caused a sigsev on entering MPI_BCAST that would be > > >> great! As it > > >> > is it is really hard to trace the problem. > > >> > > > >> > > > >> > > > >> > > > >> > On Jan 23, 2008 3:23 PM, Brian Curtis > >> state.edu> wrote: > > >> > > > >> > > > >> >> David, > > >> >> > > >> >> Sorry to hear you are experience problems with the MVAPICH2 > > >> Fortran 90 > > >> >> interface. I will be investigating this issue, but need some > > >> additional > > >> >> information about your setup. What is the exact version of > > >> MVAPICH2 1.0 > > >> >> you are utilizing (daily tarball or release)? Have you tried > > >> MVAPICH2 > > >> >> 1.0.1? > > >> >> > > >> >> Brian > > >> >> > > >> >> David Stuebe wrote: > > >> >> > > >> >>> Hello MVAPICH > > >> >>> I have found a strange bug in MVAPICH2 using IFORT. The > > >> behavior is very > > >> >>> strange indeed - it seems to be related to how ifort deals with > > >> passing > > >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. > > >> >>> The MPI call returns successfully, but later calls to a dummy > > >> subroutine > > >> >>> cause a sigsev. > > >> >>> > > >> >>> Please look at the following code: > > >> >>> > > >> >>> > > >> >>> > > >> >> ! > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> > > ===================================================================== > > >> >> > > >> >> ! > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> > > ===================================================================== > > >> >> > > >> >> ! > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> > > ===================================================================== > > >> >> > > >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT > > >> >>> ! WRITEN BY: DAVID STUEBE > > >> >>> ! DATE: JAN 23, 2008 > > >> >>> ! > > >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest > > >> >>> ! > > >> >>> ! KNOWN BEHAVIOR: > > >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE > > >> OF > > >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS > > >> WITHOUT AN > > >> >>> INTERFACE - > > >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING > > >> VALID > > >> >>> > > >> >> DATA. > > >> >> > > >> >>> ! > > >> >>> ! COMMENTS: > > >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - > > >> SHAME ON > > >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR > > >> NOT. > > >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS > > >> >>> ! EXTREMELY DIFFICULT TO DEBUG! > > >> >>> ! > > >> >>> ! CONDITIONS FOR OCCURANCE: > > >> >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' > > >> >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? > > >> >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... > > >> >>> ! ie Running inside one SMP box does not crash. > > >> >>> ! > > >> >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV > > >> >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, > > >> >>> ! ONLY SOME PROCESSES SIGSEV ? > > >> >>> ! > > >> >>> ! ENVIRONMENTAL INFO: > > >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X > > >> >>> ! SYSTEM: ROCKS 4.2 > > >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > > >> >>> ! > > >> >>> ! IFORT/ICC: > > >> >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based > > >> applications, > > >> >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 > > >> >>> ! > > >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0 > > >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 > > >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- > > >> enable-f90 > > >> >>> --enable-cxx --disable-romio --without-mpe > > >> >>> ! > > > > >> > > ===================================================================== > > >> >> > > >> >>> Module vars > > >> >>> USE MPI > > >> >>> implicit none > > >> >>> > > >> >>> > > >> >>> integer :: n,m,MYID,NPROCS > > >> >>> integer :: ipt > > >> >>> > > >> >>> integer, allocatable, target :: data(:,:) > > >> >>> > > >> >>> contains > > >> >>> > > >> >>> subroutine alloc_vars > > >> >>> implicit none > > >> >>> > > >> >>> integer Status > > >> >>> > > >> >>> allocate(data(n,m),stat=status) > > >> >>> if (status /=0) then > > >> >>> write(ipt,*) "allocation error" > > >> >>> stop > > >> >>> end if > > >> >>> > > >> >>> data = 0 > > >> >>> > > >> >>> end subroutine alloc_vars > > >> >>> > > >> >>> SUBROUTINE INIT_MPI_ENV(ID,NP) > > >> >>> > > >> >>> > > >> >> ! > > > > >> > > ====================================================================| > > >> >> > > >> >>> ! INITIALIZE MPI > > >> >>> > > >> ENVIRONMENT | > > >> >>> > > >> >>> > > >> >> ! > > > > >> > > ====================================================================| > > >> >> > > >> >>> INTEGER, INTENT(OUT) :: ID,NP > > >> >>> INTEGER IERR > > >> >>> > > >> >>> IERR=0 > > >> >>> > > >> >>> CALL MPI_INIT(IERR) > > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID > > >> >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) > > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID > > >> >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) > > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID > > >> >>> > > >> >>> END SUBROUTINE INIT_MPI_ENV > > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >> ! > > =============================| > > >> >> > > >> >>> SUBROUTINE PSHUTDOWN > > >> >>> > > >> >>> > > >> >>> > > >> >> ! > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> = > > >> > > ====================================================================| > > >> >> > > >> >>> INTEGER IERR > > >> >>> > > >> >>> IERR=0 > > >> >>> CALL MPI_FINALIZE(IERR) > > >> >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID > > >> >>> close(IPT) > > >> >>> STOP > > >> >>> > > >> >>> END SUBROUTINE PSHUTDOWN > > >> >>> > > >> >>> > > >> >>> SUBROUTINE CONTIGUOUS_WORKS > > >> >>> IMPLICIT NONE > > >> >>> INTEGER, pointer :: ptest(:,:) > > >> >>> INTEGER :: IERR, I,J > > >> >>> > > >> >>> > > >> >>> write(ipt,*) "START CONTIGUOUS:" > > >> >>> n=2000 ! Set size here... > > >> >>> m=n+10 > > >> >>> > > >> >>> call alloc_vars > > >> >>> write(ipt,*) "ALLOCATED DATA" > > >> >>> ptest => data(1:N,1:N) > > >> >>> > > >> >>> IF (MYID == 0) ptest=6 > > >> >>> write(ipt,*) "Made POINTER" > > >> >>> > > >> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID > > >> >>> > > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > > >> >>> > > >> >>> DO I = 1,N > > >> >>> DO J = 1,N > > >> >>> if(data(I,J) /= 6) & > > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > > >> >>> END DO > > >> >>> > > >> >>> DO J = N+1,M > > >> >>> if(data(I,J) /= 0) & > > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > > >> >>> END DO > > >> >>> > > >> >>> END DO > > >> >>> > > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > > >> ITERFACE > > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > > >> >>> write(ipt,*) "CALLING DUMMY1" > > >> >>> CALL DUMMY1 > > >> >>> > > >> >>> write(ipt,*) "CALLING DUMMY2" > > >> >>> call Dummy2(m,n) > > >> >>> > > >> >>> write(ipt,*) "CALLING DUMMY3" > > >> >>> call Dummy3 > > >> >>> write(ipt,*) "FINISHED!" > > >> >>> > > >> >>> END SUBROUTINE CONTIGUOUS_WORKS > > >> >>> > > >> >>> SUBROUTINE NON_CONTIGUOUS_FAILS > > >> >>> IMPLICIT NONE > > >> >>> INTEGER, pointer :: ptest(:,:) > > >> >>> INTEGER :: IERR, I,J > > >> >>> > > >> >>> > > >> >>> write(ipt,*) "START NON_CONTIGUOUS:" > > >> >>> > > >> >>> m=200 ! Set size here - crash is size dependent! > > >> >>> n=m+10 > > >> >>> > > >> >>> call alloc_vars > > >> >>> write(ipt,*) "ALLOCATED DATA" > > >> >>> ptest => data(1:M,1:M) > > >> >>> > > >> >>> !=================================================== > > >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? > > >> >>> !=================================================== > > >> >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT > > >> >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG > > >> >>> > > >> >>> IF (MYID == 0) ptest=6 > > >> >>> write(ipt,*) "Made POINTER" > > >> >>> > > >> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" > > >> >>> > > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > > >> >>> > > >> >>> DO I = 1,M > > >> >>> DO J = 1,M > > >> >>> if(data(J,I) /= 6) & > > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > > >> >>> END DO > > >> >>> > > >> >>> DO J = M+1,N > > >> >>> if(data(J,I) /= 0) & > > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > > >> >>> END DO > > >> >>> END DO > > >> >>> > > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > > >> ITERFACE > > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > > >> >>> write(ipt,*) "CALLING DUMMY1" > > >> >>> CALL DUMMY1 > > >> >>> > > >> >>> write(ipt,*) "CALLING DUMMY2" > > >> >>> call Dummy2(m,n) ! SHOULD CRASH HERE! > > >> >>> > > >> >>> write(ipt,*) "CALLING DUMMY3" > > >> >>> call Dummy3 > > >> >>> write(ipt,*) "FINISHED!" > > >> >>> > > >> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS > > >> >>> > > >> >>> > > >> >>> End Module vars > > >> >>> > > >> >>> > > >> >>> Program main > > >> >>> USE vars > > >> >>> implicit none > > >> >>> > > >> >>> > > >> >>> CALL INIT_MPI_ENV(MYID,NPROCS) > > >> >>> > > >> >>> ipt=myid+10 > > >> >>> OPEN(ipt) > > >> >>> > > >> >>> > > >> >>> write(ipt,*) "Start memory test!" > > >> >>> > > >> >>> CALL NON_CONTIGUOUS_FAILS > > >> >>> > > >> >>> ! CALL CONTIGUOUS_WORKS > > >> >>> > > >> >>> write(ipt,*) "End memory test!" > > >> >>> > > >> >>> CALL PSHUTDOWN > > >> >>> > > >> >>> END Program main > > >> >>> > > >> >>> > > >> >>> > > >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS > > >> >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE > > >> >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE > > >> >>> > > >> >>> SUBROUTINE DUMMY1 > > >> >>> USE vars > > >> >>> implicit none > > >> >>> real, dimension(m) :: my_data > > >> >>> > > >> >>> write(ipt,*) "m,n",m,n > > >> >>> > > >> >>> write(ipt,*) "DUMMY 1", size(my_data) > > >> >>> > > >> >>> END SUBROUTINE DUMMY1 > > >> >>> > > >> >>> > > >> >>> SUBROUTINE DUMMY2(i,j) > > >> >>> USE vars > > >> >>> implicit none > > >> >>> INTEGER, INTENT(IN) ::i,j > > >> >>> > > >> >>> > > >> >>> real, dimension(i,j) :: my_data > > >> >>> > > >> >>> write(ipt,*) "start: DUMMY 2", size(my_data) > > >> >>> > > >> >>> > > >> >>> END SUBROUTINE DUMMY2 > > >> >>> > > >> >>> SUBROUTINE DUMMY3 > > >> >>> USE vars > > >> >>> implicit none > > >> >>> > > >> >>> > > >> >>> real, dimension(m,n) :: my_data > > >> >>> > > >> >>> > > >> >>> write(ipt,*) "start: DUMMY 3", size(my_data) > > >> >>> > > >> >>> > > >> >>> END SUBROUTINE DUMMY3 > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Thu Jan 31 16:37:23 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu Jan 31 16:37:33 2008 Subject: [mvapich-discuss] SIGSEV in F90: An MPI bug? In-Reply-To: Message-ID: David, Just to add, your code runs without error on the OpenFabrics/Gen2 device of MVAPICH2. This was using the 9.1 ICC/IFORT compiler for the MPI library and application, default settings. I'd suggest updating if at all possible to OpenFabrics, since VAPI is not expected to have additional features and updates. Matt On Thu, 31 Jan 2008, Matthew Koop wrote: > David, > > Can you try running with a lower optimization level or a different > compiler? > > I'd also like to encourage you to update to the OpenFabrics/Gen2 stack > instead of VAPI. If you want to rule out the memory registration issue > that Jeff has suggested you can try it without. Since you are using VAPI > (which is no longer where we add new features), you will have to remove > the -DLAZY_MEM_UNREGISTER flag from the make.mvapich.vapi CFLAGS and > recompile MPI and your application. The OpenFabrics device has enhanced > memory registration code. > > Matt > > > On Thu, 31 Jan 2008, Jeff Squyres wrote: > > > On Jan 31, 2008, at 1:32 PM, David Stuebe wrote: > > > > > Maybe I don't fully understand all the issues involved but I did > > > read through several web sites that discuss the dangers of passing > > > temporary arrays to non blocking MPI calls. Is MPI_BCAST non- > > > blocking - I assumed that was a blocking call anyway? > > > > Yes it is; my bad for not noticing that that is what you were asking > > about. :-) > > > > > Again, my concern is that MPI call returns the data on all > > > processors as (perhaps, naively) expected, it is later in the > > > program that an alloc called on entry to a different subroutine for > > > an explicit shape array causes a sig sev. There is further evidence > > > that it is an MPI issue because the problem is memory-size > > > dependent, and only occurs when run using more than one node, using > > > mvapich2.0. MPICH2.0 when I tested that on our cluster which does > > > not have infiniband. > > > > Looking at your example code, I don't understand all of the F90 syntax > > to fully appreciate what's going on. It looks like you're passing an > > array subset to MPI_BCAST and when that is a non-contiguous buffer, > > problems *may* occur later. Is that what you're trying to say? > > > > My *guesses/speculation* are: > > > > - perhaps there's an issue with compiler-provided temporary buffers > > that are registered by MVAPICH and then later freed by the compiler, > > but somehow evade being unregistered by MVAPICH (this could lead to > > heap corruption that manifests later) > > > > - I don't know if fortran compilers are allowed to move buffers at > > will, such as in garbage collection and/or memory compacting schemes > > (do you?) -- this could lead to a similar problem that I describe above > > > > Again, these are pure speculation. I really don't know how F90 > > compilers work, and I don't know what MVAPICH does with registered > > memory caching and/or progress threads, so further speculation is > > fairly pointless. :-) > > > > MVAPICH developers: can you comment on this? > > > > And just to be sure -- you compiled MVAPICH with the same compilers > > that you're using, with the same levels of optimization, etc., right? > > > > > On Jan 31, 2008 1:06 PM, Jeff Squyres wrote: > > > Brian is completely correct - if the F90 compiler chooses to make > > > temporary buffers in order to pass array subsections to non-blocking > > > MPI functions, there's little that an MPI implementation can do. > > > Simply put: MPI requires that when you use non-blocking > > > communications, the buffer must be available until you call some > > > flavor of MPI_TEST or MPI_WAIT to complete the communication. > > > > > > I don't know of any way for an MPI implementation to know whether it > > > has been handed a temporary buffer (e.g., one that a compiler silently > > > created to pass an array subsection). Do you know if there is a way? > > > > > > > > > > > > On Jan 31, 2008, at 12:36 PM, Brian Curtis wrote: > > > > > > > David, > > > > > > > > The MPI-2 documentation goes into great detail on issues with > > > > Fortran-90 bindings (http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node236 > > > > ). The conditions you are seeing should be directed to Intel. > > > > > > > > > > > > Brian > > > > > > > > > > > > On Jan 31, 2008, at 11:59 AM, David Stuebe wrote: > > > > > > > >> > > > >> Hi again Brian > > > >> > > > >> I just ran my test code on our cluster using ifort 10.1.011 and > > > >> MVAPICH 1.0.1, but the behavior is still the same. > > > >> > > > >> Have you had a chance to try it on any of your test machines? > > > >> > > > >> David > > > >> > > > >> > > > >> > > > >> > > > >> On Jan 25, 2008 12:31 PM, Brian Curtis > > >> state.edu> wrote: > > > >> David, > > > >> > > > >> I did some research on this issue and it looks like you have posted > > > >> the > > > >> bug with Intel. Please let us know what you find out. > > > >> > > > >> > > > >> Brian > > > >> > > > >> David Stuebe wrote: > > > >> > Hi Brian > > > >> > > > > >> > I downloaded the public release, it seems silly but I am not sure > > > >> how to get > > > >> > a rev number from the source... there does not seem to be a '- > > > >> version' > > > >> > option that gives more info, although I did not look to hard. > > > >> > > > > >> > I have not tried MVAPICH 1.0.1, but once I have intel ifort 10 on > > > >> the > > > >> > cluster I will try 1.0.1 and see if it goes away. > > > >> > > > > >> > In the mean time please let me know if you can recreate the > > > >> problem? > > > >> > > > > >> > David > > > >> > > > > >> > PS - Just want to make sure you understand my issue, I think it > > > >> is a bad > > > >> > idea to try and pass a non-contiguous F90 memory pointer, I > > > >> should not do > > > >> > that... but the way that it breaks has caused me headaches for > > > >> weeks now. If > > > >> > it reliably caused a sigsev on entering MPI_BCAST that would be > > > >> great! As it > > > >> > is it is really hard to trace the problem. > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On Jan 23, 2008 3:23 PM, Brian Curtis > > >> state.edu> wrote: > > > >> > > > > >> > > > > >> >> David, > > > >> >> > > > >> >> Sorry to hear you are experience problems with the MVAPICH2 > > > >> Fortran 90 > > > >> >> interface. I will be investigating this issue, but need some > > > >> additional > > > >> >> information about your setup. What is the exact version of > > > >> MVAPICH2 1.0 > > > >> >> you are utilizing (daily tarball or release)? Have you tried > > > >> MVAPICH2 > > > >> >> 1.0.1? > > > >> >> > > > >> >> Brian > > > >> >> > > > >> >> David Stuebe wrote: > > > >> >> > > > >> >>> Hello MVAPICH > > > >> >>> I have found a strange bug in MVAPICH2 using IFORT. The > > > >> behavior is very > > > >> >>> strange indeed - it seems to be related to how ifort deals with > > > >> passing > > > >> >>> pointers to the MVAPICH FORTRAN 90 INTERFACE. > > > >> >>> The MPI call returns successfully, but later calls to a dummy > > > >> subroutine > > > >> >>> cause a sigsev. > > > >> >>> > > > >> >>> Please look at the following code: > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >> ! > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> > > > ===================================================================== > > > >> >> > > > >> >> ! > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> > > > ===================================================================== > > > >> >> > > > >> >> ! > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> > > > ===================================================================== > > > >> >> > > > >> >>> ! TEST CODE TO FOR POSSIBLE BUG IN MVAPICH2 COMPILED ON IFORT > > > >> >>> ! WRITEN BY: DAVID STUEBE > > > >> >>> ! DATE: JAN 23, 2008 > > > >> >>> ! > > > >> >>> ! COMPILE WITH: mpif90 -xP mpi_prog.f90 -o xtest > > > >> >>> ! > > > >> >>> ! KNOWN BEHAVIOR: > > > >> >>> ! PASSING A NONE CONTIGUOUS POINTER TO MPI_BCAST CAUSES FAILURE > > > >> OF > > > >> >>> ! SUBROUTINES USING MULTI DIMENSIONAL EXPLICT SHAPE ARRAYS > > > >> WITHOUT AN > > > >> >>> INTERFACE - > > > >> >>> ! EVEN THOUGH THE MPI_BCAST COMPLETES SUCCESUFULLY, RETURNING > > > >> VALID > > > >> >>> > > > >> >> DATA. > > > >> >> > > > >> >>> ! > > > >> >>> ! COMMENTS: > > > >> >>> ! I REALIZE PASSING NON CONTIGUOUS POINTERS IS DANGEROUS - > > > >> SHAME ON > > > >> >>> ! ME FOR MAKING THAT MISTAKE. HOWEVER, IT SHOULD EITHER WORK OR > > > >> NOT. > > > >> >>> ! RETURNING SUCCESSFULLY BUT CAUSING INTERFACE ERRORS LATER IS > > > >> >>> ! EXTREMELY DIFFICULT TO DEBUG! > > > >> >>> ! > > > >> >>> ! CONDITIONS FOR OCCURANCE: > > > >> >>> ! COMPILER MUST OPTIMIZE USING 'VECTORIZATION' > > > >> >>> ! ARRAY MUST BE 'LARGE' -SYSTEM DEPENDENT ? > > > >> >>> ! MUST BE RUN ON MORE THAN ONE NODE TO CAUSE CRASH... > > > >> >>> ! ie Running inside one SMP box does not crash. > > > >> >>> ! > > > >> >>> ! RUNNING UNDER MPD, ALL PROCESSES SIGSEV > > > >> >>> ! RUNNING UNDER MPIEXEC0.82 FOR PBS, > > > >> >>> ! ONLY SOME PROCESSES SIGSEV ? > > > >> >>> ! > > > >> >>> ! ENVIRONMENTAL INFO: > > > >> >>> ! NODES: DELL 1850 3.0GHZ, 2GB RAM, INFINIBAND PCI-EX 4X > > > >> >>> ! SYSTEM: ROCKS 4.2 > > > >> >>> ! gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > > > >> >>> ! > > > >> >>> ! IFORT/ICC: > > > >> >>> ! Intel(R) Fortran Compiler for Intel(R) EM64T-based > > > >> applications, > > > >> >>> ! Version 9.1 Build 20061101 Package ID: l_fc_c_9.1.040 > > > >> >>> ! > > > >> >>> ! MVAPICH2: mpif90 for mvapich2-1.0 > > > >> >>> ! ./configure --prefix=/usr/local/share/mvapich2/1.0 > > > >> >>> --with-device=osu_ch3:mrail --with-rdma=vapi --with-pm=mpd -- > > > >> enable-f90 > > > >> >>> --enable-cxx --disable-romio --without-mpe > > > >> >>> ! > > > > > > >> > > > ===================================================================== > > > >> >> > > > >> >>> Module vars > > > >> >>> USE MPI > > > >> >>> implicit none > > > >> >>> > > > >> >>> > > > >> >>> integer :: n,m,MYID,NPROCS > > > >> >>> integer :: ipt > > > >> >>> > > > >> >>> integer, allocatable, target :: data(:,:) > > > >> >>> > > > >> >>> contains > > > >> >>> > > > >> >>> subroutine alloc_vars > > > >> >>> implicit none > > > >> >>> > > > >> >>> integer Status > > > >> >>> > > > >> >>> allocate(data(n,m),stat=status) > > > >> >>> if (status /=0) then > > > >> >>> write(ipt,*) "allocation error" > > > >> >>> stop > > > >> >>> end if > > > >> >>> > > > >> >>> data = 0 > > > >> >>> > > > >> >>> end subroutine alloc_vars > > > >> >>> > > > >> >>> SUBROUTINE INIT_MPI_ENV(ID,NP) > > > >> >>> > > > >> >>> > > > >> >> ! > > > > > > >> > > > ====================================================================| > > > >> >> > > > >> >>> ! INITIALIZE MPI > > > >> >>> > > > >> ENVIRONMENT | > > > >> >>> > > > >> >>> > > > >> >> ! > > > > > > >> > > > ====================================================================| > > > >> >> > > > >> >>> INTEGER, INTENT(OUT) :: ID,NP > > > >> >>> INTEGER IERR > > > >> >>> > > > >> >>> IERR=0 > > > >> >>> > > > >> >>> CALL MPI_INIT(IERR) > > > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_INIT", ID > > > >> >>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR) > > > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_RANK", ID > > > >> >>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NP,IERR) > > > >> >>> IF(IERR/=0) WRITE(*,*) "BAD MPI_COMM_SIZE", ID > > > >> >>> > > > >> >>> END SUBROUTINE INIT_MPI_ENV > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >> ! > > > =============================| > > > >> >> > > > >> >>> SUBROUTINE PSHUTDOWN > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >> ! > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> = > > > >> > > > ====================================================================| > > > >> >> > > > >> >>> INTEGER IERR > > > >> >>> > > > >> >>> IERR=0 > > > >> >>> CALL MPI_FINALIZE(IERR) > > > >> >>> if(ierr /=0) write(ipt,*) "BAD MPI_FINALIZE", MYID > > > >> >>> close(IPT) > > > >> >>> STOP > > > >> >>> > > > >> >>> END SUBROUTINE PSHUTDOWN > > > >> >>> > > > >> >>> > > > >> >>> SUBROUTINE CONTIGUOUS_WORKS > > > >> >>> IMPLICIT NONE > > > >> >>> INTEGER, pointer :: ptest(:,:) > > > >> >>> INTEGER :: IERR, I,J > > > >> >>> > > > >> >>> > > > >> >>> write(ipt,*) "START CONTIGUOUS:" > > > >> >>> n=2000 ! Set size here... > > > >> >>> m=n+10 > > > >> >>> > > > >> >>> call alloc_vars > > > >> >>> write(ipt,*) "ALLOCATED DATA" > > > >> >>> ptest => data(1:N,1:N) > > > >> >>> > > > >> >>> IF (MYID == 0) ptest=6 > > > >> >>> write(ipt,*) "Made POINTER" > > > >> >>> > > > >> >>> call MPI_BCAST(ptest,N*N,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > > > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST", MYID > > > >> >>> > > > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > > > >> >>> > > > >> >>> DO I = 1,N > > > >> >>> DO J = 1,N > > > >> >>> if(data(I,J) /= 6) & > > > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > > > >> >>> END DO > > > >> >>> > > > >> >>> DO J = N+1,M > > > >> >>> if(data(I,J) /= 0) & > > > >> >>> & write(ipt,*) "INCORRECT VALUE!", I,J,data(I,J) > > > >> >>> END DO > > > >> >>> > > > >> >>> END DO > > > >> >>> > > > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > > > >> ITERFACE > > > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > > > >> >>> write(ipt,*) "CALLING DUMMY1" > > > >> >>> CALL DUMMY1 > > > >> >>> > > > >> >>> write(ipt,*) "CALLING DUMMY2" > > > >> >>> call Dummy2(m,n) > > > >> >>> > > > >> >>> write(ipt,*) "CALLING DUMMY3" > > > >> >>> call Dummy3 > > > >> >>> write(ipt,*) "FINISHED!" > > > >> >>> > > > >> >>> END SUBROUTINE CONTIGUOUS_WORKS > > > >> >>> > > > >> >>> SUBROUTINE NON_CONTIGUOUS_FAILS > > > >> >>> IMPLICIT NONE > > > >> >>> INTEGER, pointer :: ptest(:,:) > > > >> >>> INTEGER :: IERR, I,J > > > >> >>> > > > >> >>> > > > >> >>> write(ipt,*) "START NON_CONTIGUOUS:" > > > >> >>> > > > >> >>> m=200 ! Set size here - crash is size dependent! > > > >> >>> n=m+10 > > > >> >>> > > > >> >>> call alloc_vars > > > >> >>> write(ipt,*) "ALLOCATED DATA" > > > >> >>> ptest => data(1:M,1:M) > > > >> >>> > > > >> >>> !=================================================== > > > >> >>> ! IF YOU CALL DUMMY2 HERE TOO, THEN EVERYTHING PASSES ??? > > > >> >>> !=================================================== > > > >> >>> ! CALL DUMMY1 ! THIS ONE HAS NO EFFECT > > > >> >>> ! CALL DUMMY2 ! THIS ONE 'FIXES' THE BUG > > > >> >>> > > > >> >>> IF (MYID == 0) ptest=6 > > > >> >>> write(ipt,*) "Made POINTER" > > > >> >>> > > > >> >>> call MPI_BCAST(ptest,M*M,MPI_INTEGER,0,MPI_COMM_WORLD,IERR) > > > >> >>> IF(IERR /= 0) WRITE(IPT,*) "BAD BCAST" > > > >> >>> > > > >> >>> write(ipt,*) "BROADCAST Data; a value:",data(1,6) > > > >> >>> > > > >> >>> DO I = 1,M > > > >> >>> DO J = 1,M > > > >> >>> if(data(J,I) /= 6) & > > > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > > > >> >>> END DO > > > >> >>> > > > >> >>> DO J = M+1,N > > > >> >>> if(data(J,I) /= 0) & > > > >> >>> & write(ipt,*) "INCORRECT VALUE!",I,J,DATA(I,J) > > > >> >>> END DO > > > >> >>> END DO > > > >> >>> > > > >> >>> ! CALL THREE DIFFERENT EXAMPLES OF SUBROUTINES W/OUT AN > > > >> ITERFACE > > > >> >>> ! THAT USE AN EXPLICIT SHAPE ARRAY > > > >> >>> write(ipt,*) "CALLING DUMMY1" > > > >> >>> CALL DUMMY1 > > > >> >>> > > > >> >>> write(ipt,*) "CALLING DUMMY2" > > > >> >>> call Dummy2(m,n) ! SHOULD CRASH HERE! > > > >> >>> > > > >> >>> write(ipt,*) "CALLING DUMMY3" > > > >> >>> call Dummy3 > > > >> >>> write(ipt,*) "FINISHED!" > > > >> >>> > > > >> >>> END SUBROUTINE NON_CONTIGUOUS_FAILS > > > >> >>> > > > >> >>> > > > >> >>> End Module vars > > > >> >>> > > > >> >>> > > > >> >>> Program main > > > >> >>> USE vars > > > >> >>> implicit none > > > >> >>> > > > >> >>> > > > >> >>> CALL INIT_MPI_ENV(MYID,NPROCS) > > > >> >>> > > > >> >>> ipt=myid+10 > > > >> >>> OPEN(ipt) > > > >> >>> > > > >> >>> > > > >> >>> write(ipt,*) "Start memory test!" > > > >> >>> > > > >> >>> CALL NON_CONTIGUOUS_FAILS > > > >> >>> > > > >> >>> ! CALL CONTIGUOUS_WORKS > > > >> >>> > > > >> >>> write(ipt,*) "End memory test!" > > > >> >>> > > > >> >>> CALL PSHUTDOWN > > > >> >>> > > > >> >>> END Program main > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> ! TWO DUMMY SUBROUTINE WITH EXPLICIT SHAPE ARRAYS > > > >> >>> ! DUMMY1 DECLARES A VECTOR - THIS ONE NEVER CAUSES FAILURE > > > >> >>> ! DUMMY2 DECLARES AN ARRAY - THIS ONE CAUSES FAILURE > > > >> >>> > > > >> >>> SUBROUTINE DUMMY1 > > > >> >>> USE vars > > > >> >>> implicit none > > > >> >>> real, dimension(m) :: my_data > > > >> >>> > > > >> >>> write(ipt,*) "m,n",m,n > > > >> >>> > > > >> >>> write(ipt,*) "DUMMY 1", size(my_data) > > > >> >>> > > > >> >>> END SUBROUTINE DUMMY1 > > > >> >>> > > > >> >>> > > > >> >>> SUBROUTINE DUMMY2(i,j) > > > >> >>> USE vars > > > >> >>> implicit none > > > >> >>> INTEGER, INTENT(IN) ::i,j > > > >> >>> > > > >> >>> > > > >> >>> real, dimension(i,j) :: my_data > > > >> >>> > > > >> >>> write(ipt,*) "start: DUMMY 2", size(my_data) > > > >> >>> > > > >> >>> > > > >> >>> END SUBROUTINE DUMMY2 > > > >> >>> > > > >> >>> SUBROUTINE DUMMY3 > > > >> >>> USE vars > > > >> >>> implicit none > > > >> >>> > > > >> >>> > > > >> >>> real, dimension(m,n) :: my_data > > > >> >>> > > > >> >>> > > > >> >>> write(ipt,*) "start: DUMMY 3", size(my_data) > > > >> >>> > > > >> >>> > > > >> >>> END SUBROUTINE DUMMY3 > > > > > > -- > > Jeff Squyres > > Cisco Systems > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > From weikuan.yu at gmail.com Thu Jan 31 18:05:50 2008 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Thu Jan 31 18:06:00 2008 Subject: [mvapich-discuss] where can I find similar env setting on mvapich as these three: MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX In-Reply-To: References: Message-ID: <47A2544E.5030503@gmail.com> Hi, Terrence, Thanks for the answers. With the large data volume and compelling application, I got more curious for further information. Here are some comments I have. 1) Though the data volume is big, the program dies at file_open. This means that the larger buffer size, communicators, types are not yet needed. So without three parameters to increase communicator/type/buffer sizes, I would presume it should be safer. 2) Have you configured Panasas support when using MVAPICH? If so, have you seen any error output from the program? Could you please post here? Better if you can provide a core dump or stack trace. 3) Interesting to know that the problem also happened to SGI MPI, and increasing three parameters has solved the problem. Is the problem really the same for both SGI MPI and MVAPICH? In case it is possible for you to share the I/O kernel of your program, that would be very good. --Weikuan Terrence LIAO wrote: > Hi, WeiKuan, > > 1) What does your MPI code do? how does it die? > This is finite difference 3D wave equation solver uses in seismic > depth imaging processing and due to its large file I/O, input file size > in 1~5TB range and output intermediate file in 1~10GB ranges. MPI-IO is > used. The code dies with something like "MPI process abnormal exit.." > right after it call the MPI_file_open(). > > 2) What system you are running with? What file system you are using? > The cluster is AMD Opteron Dual-core with IB and Panasas. Using > PGI7.1 and mvapich 1.0 beta. > > 3) What are the three parameters for? How did they solve your problem? > You can find more info on googling those 3 parameters. They have > effect on how cached (or buffer) memory been used. We think it die on > MPI_file_open when it is trying to allocate buffer memory. Those 3 > parameters increase the buffer size. > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > Email: terrence.liao@total.com > From Terrence.LIAO at total.com Thu Jan 31 17:10:38 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Thu Jan 31 18:51:03 2008 Subject: [mvapich-discuss] where can I find similar env setting on mvapich as these three: MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX Message-ID: Hi, WeiKuan, 1) What does your MPI code do? how does it die? This is finite difference 3D wave equation solver uses in seismic depth imaging processing and due to its large file I/O, input file size in 1~5TB range and output intermediate file in 1~10GB ranges. MPI-IO is used. The code dies with something like "MPI process abnormal exit.." right after it call the MPI_file_open(). 2) What system you are running with? What file system you are using? The cluster is AMD Opteron Dual-core with IB and Panasas. Using PGI7.1 and mvapich 1.0 beta. 3) What are the three parameters for? How did they solve your problem? You can find more info on googling those 3 parameters. They have effect on how cached (or buffer) memory been used. We think it die on MPI_file_open when it is trying to allocate buffer memory. Those 3 parameters increase the buffer size. Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC Email: terrence.liao@total.com -----Weikuan Yu wrote: ----- To: Terrence.LIAO@total.com From: Weikuan Yu Date: 01/31/2008 11:31AM cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] where can I find similar env setting on mvapich as these three: MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX Hi, Terrence, Your report is intriguing, hence my questions: 1) What does your MPI code do? how does it die? 2) What system you are running with? What file system you are using? 3) What are the three parameters for? How did they solve your problem? Any detail info? --Weikuan Terrence.LIAO@total.com wrote: > My MPI code die on MPI-IO using mvapich 1.0. On SGI Altix, the problem > was solved by tuning these 3 parameters: > MPI_COMM_MAX, MPI_TYPE_MAX and MPI_GROUP_MAX > However, there are SGI specific, does mvapich have similar parameters? > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > Email: terrence.liao@total.com > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080131/9c662c34/attachment-0001.html From jbernstein at penguincomputing.com Thu Jan 31 18:55:12 2008 From: jbernstein at penguincomputing.com (Joshua Bernstein) Date: Thu Jan 31 18:58:35 2008 Subject: [mvapich-discuss] On "Got Completion" and IBV_EVENT Errors In-Reply-To: References: Message-ID: <47A25FE0.4020306@penguincomputing.com> Thank you for your response Matthew, Matthew Koop wrote: > Joshua, > > So are you able to run `ibv_rc_pingpong' with a variety of message sizes? > You may want to double-check that the cables between machines are well > connected as well. ibv_rc_pingpong seems to work correctly: [root@flatline ~]# ibv_rc_pingpong -i 2 local address: LID 0x0006, QPN 0x050016, PSN 0x55eeb7 remote address: LID 0x0004, QPN 0x100406, PSN 0x07ccc8 8192000 bytes in 0.04 seconds = 1669.28 Mbit/sec 1000 iters in 0.04 seconds = 39.26 usec/iter As a side note, it would be nice if there was some description about what all the ibv_* commands do. For example there is also ibv_srq_pingpong and ibv_uc_pingpong. If there is some documentation about this some place that I missed, I apologize. > With the earlier request you cited, the issue didn't occur for simple > microbenchmarks, only with an application. We have previously seen issues > when fork or system calls are used in applications (due to > incompatibilities with the underlying OpenFabrics drivers). I'm not quite sure I understand the implications of this. Can you elaborate? I see the same behavior with the supplied osu_* codes as well. I should have mentioned this earlier, but we are attempting to move over a pmgr_client plugin from the vapi transport to the ch_gen2 transport that uses bproc (Scyld) for job startup instead of RSH. In this code we do a fork. Some I'd be interested to read your elaboration on this. Eventually, we (Penguin Computing) hope to be able to contribute this enhancement up stream. > It seems that your issue is more likely to be a setup issue. What does > ulimit -l report on your compute nodes? It is set to half the available memory on the system, as stated in the MVAPICH docs. > Also, it is unlikely that VIADEV_USE_SHMEM_COLL is causing any issue -- turning off this option > means there is less communication in the init phase (which allows you to > get to the stdout statements). no, no, I agree. In fact, my point was using name envar I was able to get the application to run a bit further. After a bit of playing around, I've gotten the code to run a bit farther and now when the cpi program does a MPI_Bcast, I get a hang, and the my old friend: Got completion with error IBV_WC_RETRY_EXC_ERR, *Both* processes threads call MPI_Bcast, but only *one* of them sees a return from MPI_Bcast (n==100) and subsequently calls MPI_Reduce. -Joshua Bernstein Penguin Computing Software Engineer From schuang at ats.ucla.edu Thu Jan 31 23:35:41 2008 From: schuang at ats.ucla.edu (Shao-Ching Huang) Date: Thu Jan 31 23:36:14 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: <47A1334B.5030906@ucla.edu> Message-ID: <20080201043541.GA5879@ats.ucla.edu> Hi Wei, We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) MPI process per node: mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out The results from the two runs are posted here: http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ During the tests, some other users are also running jobs on some of these 48 nodes. Could you please help us interpret these results, if possible? Thank you. Shao-Ching Huang On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > Hi Scott, > > We went up to 256 processes (32 nodes) and did not see the problem in few > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > the fabrics and system setup are ok. To diagnose this, we suggest you > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > This test stresses the interconnects. It should fail at a much higher > frequency than simple cpi program if there is a problem with your system > setup. > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > My co-worker passed this along... > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > among the 9 cases I ran. > > > > I was using 128 processes (on 32 4-core nodes). > > > > --- > > > > and another... > > > > It happens for a simple MPI program which just does MPI_Init and > > MPI_Finalize and print out number of processors. It happened for > > anything from 4 nodes (16 processors ) and more. > > > > What environment variables should we look for? > > > > Thanks, > > Scott > > > > wei huang wrote: > > > Hi Scott, > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > have any environmental variables when you are running the program? Does > > > the error happen on simple test like cpi? > > > > > > Thanks. > > > > > > Regards, > > > Wei Huang > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > Dept. of Computer Science and Engineering > > > Ohio State University > > > OH 43210 > > > Tel: (614)292-8501 > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > >> The low level ibv tests work fine. > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss