[mvapich-discuss] MVAPICH2 Invalid communicator errors

Sylvain Jeaugey sylvain.jeaugey at bull.net
Fri Sep 14 03:31:40 EDT 2007


Nathan,

You are clearly using the wrong mpi.h. The proof is here :
> MPI_Comm_rank(105): MPI_Comm_rank(comm=0x5b, rank=0x7fbfffc898) failed
We see here that comm=0x5b is 91, the value of MPI_COMM_WORLD in 
MPICH-1-like includes. In MPICH-2-like includes, MPI_COMM_WORLD is 
0x44000000.

So, for a reason I don't know (something hidden in the Makefile, ..) you 
are compiling with the wrong mpi.h.

A quick way to confirm it thought would be to remove (or move) you 
/usr/include/mpi.h which is interfering.

Hope this helps,
Sylvain

On Thu, 13 Sep 2007, Nathan Dauchy wrote:

> We have also run into a very similar sounding problem, with
> mvapich2-0.9.8-2007.08.30 and intel-9.1.
>
> mpiexec -np 54 /rt1/rtruc/13km_wjet/exec/hybcst_sp
> Fatal error in MPI_Comm_rank: Invalid communicator, error stack:
> MPI_Comm_rank(105): MPI_Comm_rank(comm=0x5b, rank=0x7fbfffc898) failed
> MPI_Comm_rank(64).: Invalid communicatorFatal error in MPI_Comm_rank:
> Invalid communicator, error stack:
>
> Unfortunately, I haven't found an /usr/include/mpi.h file or other quick
> fix yet.
>
> Is there supposed to be "-I/usr/include" in the output of "mpicc -show"?
>  Perhaps something went wrong in the build process?  Here is the output
> on the system with the Invalid communicator errors:
>
> $ mpicc -show
> icc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
> -DMPID_USE_SEQUENCE_NUMBERS -D_SHMEM_COLL_ -I/usr/include -O2
> -I/opt/mvapich/2-0.9.8-2007.08.30/include
> -L/opt/mvapich/2-0.9.8-2007.08.30/lib -lmpich -L/usr/lib64 -libverbs
> -libumad -lpthread
>
> Whereas another system using mvapich-0.9.9 works fine and does not have
> "-I/usr/include":
>
> $ mpicc -show
> icc -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1
> -DHAVE_STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1
> -L/opt/mvapich/0.9.9-1326_single_rail_intel_9.1/lib -lmpich -L/usr/lib64
> -Wl,-rpath=/usr/lib64 -libverbs -libumad -lpthread -lpthread -lrt
>
>
> "ldd" on the executable shows:
>
>        libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a9566c000)
>        libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a95778000)
>        libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9589a000)
>        libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a959af000)
>        libdl.so.2 => /lib64/libdl.so.2 (0x0000002a95b36000)
>        libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95c39000)
>        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000002a95e6d000)
>        libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x0000002a95f7b000)
>        /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
>
> Any other suggestions of where to look?  Hopefully I'm missing something
> obvious!
>
> Thanks much,
> Nathan
>
>
>
>
> Shaun Rowland wrote:
>> Mark Potts wrote:
>>> DK and Tom,
>>>    Thanks for your interest.
>>>
>>>    I'm not certain what version info you wanted.  However, one
>>>    designator is "mvapich2-0.9.8-12".  The MVAPICH2 source was
>>>    obtained as part of OFED 1.2.  I'll get more explicit
>>>    version info (OFED and MVAPICH2) if you tell me what and where
>>>    to look.
>>
>> That's the information we were looking for. The -12 is the RPM version
>> number, which has to be incremented whenever there is any SRPM change.
>> That should correspond to the latest MVAPICH2. There's a slightly
>> updated one with OFED 1.2.5.
>>
>>>    We have built MVAPICH (and lots of other packages) with Intel
>>>    compilers and are using them without problem.  However, the
>>>    responses received to date indicate that the problem is not
>>>    a known issue with MVAPICH2 and Intel compilers and thus must
>>>    be a setup issue on our end.
>>
>> It seems we have seen a similar error before on one of the clusters we
>> use. The cluster had a modules system to set up user environments, and
>> it ended up causing a different mpi.h file to be included, instead of
>> the one that was supposed to be used with the package the user expected
>> (from their specific build). You should check your user environment to
>> make sure there's not something like that happening, or that there's no
>> mpi.h in /usr/include or something. Also, check the mpicc command with
>> the -show argument I suggested and check the paths. The type of error we
>> would see was:
>>
>> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
>> MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7ffffff2b308) failed
>> MPI_Comm_size(69).: Invalidcommunicatorrank 0 in job 4  bm1_48690
>> caused collective abort of all ranks
>>  exit status of rank 0: killed by signal 9
>>
>> which looks like your error.
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>


More information about the mvapich-discuss mailing list