[mvapich-discuss] MVAPICH2 Invalid communicator errors

Mark Potts potts at hpcapplications.com
Thu Sep 13 14:37:20 EDT 2007


Nathan,
    I'm far from an expert, but when I manually removed the
    -L/usr/include from the icc command line generated by mpicc,
    my problem went away. The mpi.h in your /usr/include in all
    probability does conflict with the one provided in MVAPICH2.
    The harder problem in my case and perhaps yours was finding
    how the /usr/include was getting injected.  Hint: read the
    mpicc script for your intel builds very carefully.
          regards,

Nathan Dauchy wrote:
> We have also run into a very similar sounding problem, with
> mvapich2-0.9.8-2007.08.30 and intel-9.1.
> 
> mpiexec -np 54 /rt1/rtruc/13km_wjet/exec/hybcst_sp
> Fatal error in MPI_Comm_rank: Invalid communicator, error stack:
> MPI_Comm_rank(105): MPI_Comm_rank(comm=0x5b, rank=0x7fbfffc898) failed
> MPI_Comm_rank(64).: Invalid communicatorFatal error in MPI_Comm_rank:
> Invalid communicator, error stack:
> 
> Unfortunately, I haven't found an /usr/include/mpi.h file or other quick
> fix yet.
> 
> Is there supposed to be "-I/usr/include" in the output of "mpicc -show"?
>   Perhaps something went wrong in the build process?  Here is the output
> on the system with the Invalid communicator errors:
> 
> $ mpicc -show
> icc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED
> -DMPID_USE_SEQUENCE_NUMBERS -D_SHMEM_COLL_ -I/usr/include -O2
> -I/opt/mvapich/2-0.9.8-2007.08.30/include
> -L/opt/mvapich/2-0.9.8-2007.08.30/lib -lmpich -L/usr/lib64 -libverbs
> -libumad -lpthread
> 
> Whereas another system using mvapich-0.9.9 works fine and does not have
> "-I/usr/include":
> 
> $ mpicc -show
> icc -DUSE_STDARG -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_UNISTD_H=1
> -DHAVE_STDARG_H=1 -DUSE_STDARG=1 -DMALLOC_RET_VOID=1
> -L/opt/mvapich/0.9.9-1326_single_rail_intel_9.1/lib -lmpich -L/usr/lib64
> -Wl,-rpath=/usr/lib64 -libverbs -libumad -lpthread -lpthread -lrt
> 
> 
> "ldd" on the executable shows:
> 
>         libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a9566c000)
>         libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a95778000)
>         libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a9589a000)
>         libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a959af000)
>         libdl.so.2 => /lib64/libdl.so.2 (0x0000002a95b36000)
>         libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a95c39000)
>         libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000002a95e6d000)
>         libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x0000002a95f7b000)
>         /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
> 
> Any other suggestions of where to look?  Hopefully I'm missing something
> obvious!
> 
> Thanks much,
> Nathan
> 
> 
> 
> 
> Shaun Rowland wrote:
>> Mark Potts wrote:
>>> DK and Tom,
>>>    Thanks for your interest.
>>>
>>>    I'm not certain what version info you wanted.  However, one
>>>    designator is "mvapich2-0.9.8-12".  The MVAPICH2 source was
>>>    obtained as part of OFED 1.2.  I'll get more explicit
>>>    version info (OFED and MVAPICH2) if you tell me what and where
>>>    to look.
>> That's the information we were looking for. The -12 is the RPM version
>> number, which has to be incremented whenever there is any SRPM change.
>> That should correspond to the latest MVAPICH2. There's a slightly
>> updated one with OFED 1.2.5.
>>
>>>    We have built MVAPICH (and lots of other packages) with Intel
>>>    compilers and are using them without problem.  However, the
>>>    responses received to date indicate that the problem is not
>>>    a known issue with MVAPICH2 and Intel compilers and thus must
>>>    be a setup issue on our end.
>> It seems we have seen a similar error before on one of the clusters we
>> use. The cluster had a modules system to set up user environments, and
>> it ended up causing a different mpi.h file to be included, instead of
>> the one that was supposed to be used with the package the user expected
>> (from their specific build). You should check your user environment to
>> make sure there's not something like that happening, or that there's no
>> mpi.h in /usr/include or something. Also, check the mpicc command with
>> the -show argument I suggested and check the paths. The type of error we
>> would see was:
>>
>> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
>> MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7ffffff2b308) failed
>> MPI_Comm_size(69).: Invalidcommunicatorrank 0 in job 4  bm1_48690
>> caused collective abort of all ranks
>>  exit status of rank 0: killed by signal 9
>>
>> which looks like your error.
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
***********************************
 >> Mark J. Potts, PhD
 >>
 >> HPC Applications Inc.
 >> phone: 410-992-8360 Bus
 >>        410-313-9318 Home
 >>        443-418-4375 Cell
 >> email: potts at hpcapplications.com
 >>        potts at excray.com
***********************************


More information about the mvapich-discuss mailing list