[mvapich-discuss] MVAPICH2 Invalid communicator errors
Nathan Dauchy
Nathan.Dauchy at noaa.gov
Thu Sep 13 13:17:55 EDT 2007
We have also run into a very similar sounding problem, with
mvapich2-0.9.8-2007.08.30 and intel-9.1.
mpiexec -np 54 /rt1/rtruc/13km_wjet/exec/hybcst_sp
Fatal error in MPI_Comm_rank: Invalid communicator, error stack:
MPI_Comm_rank(105): MPI_Comm_rank(comm=0x5b, rank=0x7fbfffc898) failed
MPI_Comm_rank(64).: Invalid communicatorFatal error in MPI_Comm_rank:
Invalid communicator, error stack:
Unfortunately, I haven't found an /usr/include/mpi.h file or other quick
fix yet.
Is there supposed to be "-I/usr/include" in the output of "mpicc -show"?
Perhaps something went wrong in the build process? Here is the output
on the system with the Invalid communicator errors:
$ mpicc -show
icc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
-DMPID_USE_SEQUENCE_NUMBERS -D_SHMEM_COLL_ -I/usr/include -O2
-I/opt/mvapich/2-0.9.8-2007.08.30/include
-L/opt/mvapich/2-0.9.8-2007.08.30/lib -lmpich -L/usr/lib64 -libverbs
-libumad -lpthread
Whereas another system using mvapich-0.9.9 works fine and does not have
"-I/usr/include":
$ mpicc -show
icc -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1
-DHAVE_STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1
-L/opt/mvapich/0.9.9-1326_single_rail_intel_9.1/lib -lmpich -L/usr/lib64
-Wl,-rpath=/usr/lib64 -libverbs -libumad -lpthread -lpthread -lrt
"ldd" on the executable shows:
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a9566c000)
libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a95778000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9589a000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a959af000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000002a95b36000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95c39000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000002a95e6d000)
libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x0000002a95f7b000)
/lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
Any other suggestions of where to look? Hopefully I'm missing something
obvious!
Thanks much,
Nathan
Shaun Rowland wrote:
> Mark Potts wrote:
>> DK and Tom,
>> Thanks for your interest.
>>
>> I'm not certain what version info you wanted. However, one
>> designator is "mvapich2-0.9.8-12". The MVAPICH2 source was
>> obtained as part of OFED 1.2. I'll get more explicit
>> version info (OFED and MVAPICH2) if you tell me what and where
>> to look.
>
> That's the information we were looking for. The -12 is the RPM version
> number, which has to be incremented whenever there is any SRPM change.
> That should correspond to the latest MVAPICH2. There's a slightly
> updated one with OFED 1.2.5.
>
>> We have built MVAPICH (and lots of other packages) with Intel
>> compilers and are using them without problem. However, the
>> responses received to date indicate that the problem is not
>> a known issue with MVAPICH2 and Intel compilers and thus must
>> be a setup issue on our end.
>
> It seems we have seen a similar error before on one of the clusters we
> use. The cluster had a modules system to set up user environments, and
> it ended up causing a different mpi.h file to be included, instead of
> the one that was supposed to be used with the package the user expected
> (from their specific build). You should check your user environment to
> make sure there's not something like that happening, or that there's no
> mpi.h in /usr/include or something. Also, check the mpicc command with
> the -show argument I suggested and check the paths. The type of error we
> would see was:
>
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7ffffff2b308) failed
> MPI_Comm_size(69).: Invalidcommunicatorrank 0 in job 4 bm1_48690
> caused collective abort of all ranks
> exit status of rank 0: killed by signal 9
>
> which looks like your error.
More information about the mvapich-discuss
mailing list