[mvapich-discuss] caused collective abort of all ranks + signal 9

Sangamesh B forum.san at gmail.com
Fri May 9 03:14:41 EDT 2008


Hi,

My answers are inline:

On Tue, May 6, 2008 at 9:15 PM, Matthew Koop <koop at cse.ohio-state.edu>
wrote:

> Sangamesh,
>
> Can you run any of the included benchmarks with the OFED package? Try
> running the ibv_rc_pingpoing test between nodes in your system to first
> make sure the fabric is healthy.
>

Tested some of OFED pingpong tests on two nodes ibc12 (as server) and ibc11
(as slave). The output is:

Test 1:
[root at compute-0-12 bin]# ./ibv_rc_pingpong
  local address:  LID 0x0006, QPN 0x0c0405, PSN 0xf32003
  remote address: LID 0x0005, QPN 0x3b0405, PSN 0xcde263

[root at compute-0-11 bin]# ./ibv_rc_pingpong ibc12
  local address:  LID 0x0005, QPN 0x3b0405, PSN 0xcde263
  remote address: LID 0x0006, QPN 0x0c0405, PSN 0xf32003
Failed status 9 for wr_id 2

Test 2:

[root at compute-0-12 bin]# ibv_ud_pingpong
  local address:  LID 0x0006, QPN 0x0d0405, PSN 0x9affd4
  remote address: LID 0x0005, QPN 0x3c0405, PSN 0x6223f3

[root at compute-0-11 bin]# ibv_ud_pingpong ibc12
  local address:  LID 0x0005, QPN 0x3c0405, PSN 0x6223f3
  remote address: LID 0x0006, QPN 0x0d0405, PSN 0x9affd4

Test 3:
[root at compute-0-12 bin]# ibv_uc_pingpong
  local address:  LID 0x0006, QPN 0x200405, PSN 0x6f491e
  remote address: LID 0x0005, QPN 0x3e0405, PSN 0xc7e4ff

[root at compute-0-11 bin]# ibv_uc_pingpong ibc12
  local address:  LID 0x0005, QPN 0x3e0405, PSN 0xc7e4ff
  remote address: LID 0x0006, QPN 0x200405, PSN 0x6f491e

I think Test 2 and Test 3 are successful. In each test, I pressed Ctrl C to
comeout of test.

Are these successful?


>
> Also, can you give us some additional information on your setup? What type
> of cards are these?

I don't have knowledge about cards. This is my first mvapich install. I
guess these might be SDR cards.

> Also, how did you set the 'ulimit -l unlimited'.

I set it on comandline.

> We
> suggest placing it in /etc/init.d/sshd on all nodes and restarting sshd
> (and mpd).  This will ensure that the processes started will inherit the
> modified ulimit settings.
>
Did. But again the same error. The program runs, but in the end gives the
error:

[root at compute-0-12 mvapich2_ofed_intel]#
/opt/mvapich2_ofed_intel/bin/mpiexec -np 2 -env MV2_USE_COALESCE 0 -env
MV2_VBUF_TOTAL_SIZE 9216 ./samplmvaofedintel
Process 0 of 2 executed on compute-0-12.local
Process 1 of 2 executed on compute-0-11.local
rank 1 in job 6  compute-0-12.local_36014   caused collective abort of all
ranks
  exit status of rank 1: killed by signal 9
[root at compute-0-12 mvapich2_ofed_intel]#

Tried to build mvapich2 with different 'make'  files.

i.e. make.mvapich2.vapi  and make.mvapich2.udapl. Both of these gave library
missing error:

' 'CC=/opt/intel/cce/10.1.015/bin/icc' 'CFLAGS=-D_X86_64_ -DONE_SIDED
-DUSE_INLINE -DRDMA
_FAST_PATH -DUSE_HEADER_CACHING -DLAZY_MEM_UNREGISTER -D_SMP_ -D_PCI_EX_
-D_SDR_ -DMPIDI_
CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -D_SMALL_CLUSTER
-I/opt/OFED/include -O2 '
'CXX=/opt/intel/cce/10.1.015/bin/icpc'
'F77=/opt/intel/fce/10.1.015/bin/ifort' 'F90=/opt/
intel/fce/10.1.015/bin/ifort' 'FFLAGS=-L/opt/OFED/lib64'
Running on system: Linux compute-0-12.local 2.6.9-34.0.2.ELsmp #1 SMP Fri
Jul 7 18:22:55
CDT 2006 x86_64 x86_64 x86_64 GNU/Linux
Executing mpich2prereq in /root/mvapich2-1.0.2/src/mpid/osu_ch3 with mrail
Executing mpich2prereq in
/root/mvapich2-1.0.2/src/mpid/osu_ch3/channels/mrail
sourcing /root/mvapich2-1.0.2/src/pm/mpd/mpich2prereq
sourcing /root/mvapich2-1.0.2/src/pm/mpd/setup_pm
checking for gcc... /opt/intel/cce/10.1.015/bin/icc
checking for C compiler default output file name... configure: error: C
compiler cannot c
reate executables
See `config.log' for more details.
Failure in configuration.

 $ ./configure --prefix=/opt/mvapich2_vapi_intel --with-device=osu_ch3:mrail
--with-rdma=
vapi --with-pm=mpd --disable-romio --without-mpe

configure:3202: $? = 0
configure:3225: checking for C compiler default output file name
configure:3228: /opt/intel/cce/10.1.015/bin/icc -D_X86_64_ -DONE_SIDED
-DUSE_INLINE -DRDM
A_FAST_PATH -DUSE_HEADER_CACHING -DLAZY_MEM_UNREGISTER -D_SMP_ -D_PCI_EX_
-D_SDR_ -DMPIDI
_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -D_SMALL_CLUSTER
-I/opt/OFED/include -O2
  conftest.c -L/opt/OFED/lib64 -lmtl_common -lvapi -lpthread -lmosal -lmpga
>&5
ld: cannot find -lmtl_common
configure:3231: $? = 1
configure: failed program was:
| /* confdefs.h.  */

The 'configure' is looking these libraries inside OFED. I located these
libraries. They are not available.

Is it problem with OFED? (It is installed with buliding rpms.)
OR, vapi and udapl version requires other libraries than OFED?

And:-


[root at compute-0-12 ~]# ofed_info OFED-1.2.5 ofa_kernel-1.2.5: ... And:-
[root at compute-0-12 ~]# ibhosts Ca : 0x0002c90200272778 ports 1 "MT25204
InfiniHostLx Mellanox Technologies" Ca : 0x0002c9020027277c ports 1 "MT25204
InfiniHostLx Mellanox Technologies" Ca : 0x0002c902002728a4 ports 1 "MT25204
InfiniHostLx Mellanox Technologies" Ca : 0x0002c9020027276c ports 1 "MT25204
InfiniHostLx Mellanox Technologies" Ca : 0x0002c90200272798 ports 1 "MT25204
InfiniHostLx Mellanox Technologies" [root at compute-0-12 With the link:
http://www.opensubscriber.com/message/general@lists.openfabrics.org/8956919.html
  I came to know its a bug.

Shall I reinstall OFED (without using rpms), for building mvapich2 with
vapi  or udapl

What's the solution?

Thanks in advance,
Sangamesh


> Thanks,
>
> Matt
>
> On Tue, 6 May 2008, Sangamesh B wrote:
>
> > Hi all,
> >
> >
> > *I got some problem, can someone help me on this issue.*
> >
> > *The scenario is : We have a Rocks(4.2) cluster with 12 nodes. We
> installed
> > Infiniband cards newly in 5 nodes(Masternode doesn't have IB card).
> > Installation of OFED is successful and IP got assigned.*
> >
> > *I installed Mvapich2 in that and created password free environment from
> > computenode-0-8 to 12(the nodes which have IB card).  So far everything
> is
> > fine, And the MPD is booting up also. *
> >
> > *I've compiled a sample MPI program and tried to execute it and I got the
> > following kind of results:*
> >
> > Scenario 1: Using root to execute Hellow.o (compiled with mvapich2-mpicc)
> >
> > [root at compute-0-8 test]# /opt/mvapich2_ps/bin/mpiexec -np 2
> /test/Hellow.o
> > Hello world from process 0 of 2
> > Hello world from process 1 of 2
> > rank 1 in job 8  compute-0-8.local_34399   caused collective abort of all
> > ranks
> >   exit status of rank 1: killed by signal 9
> > rank 0 in job 8  compute-0-8.local_34399   caused collective abort of all
> > ranks
> >   exit status of rank 0: killed by signal 9
> >
> > Scenario 2: Using user id (srinu) to execute the same file.
> >
> > [srinu at compute-0-8 test]$ /opt/mvapich2_ps/bin/mpiexec -np 2
> /test/Hellow.o
> > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> >     This will severely limit memory registrations.
> > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
> >     This will severely limit memory registrations.
> > Fatal error in MPI_Init:
> > Other MPI error, error stack:
> > MPIR_Init_thread(259)....: Initialization failed
> > MPID_Init(102)...........: channel initialization failed
> > MPIDI_CH3_Init(178)......:
> > MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
> > rdma_iba_hca_init(645)...: cannot create cq
> > Fatal error in MPI_Init:
> > Other MPI error, error stack:
> > MPIR_Init_thread(259)....: Initialization failed
> > MPID_Init(102)...........: channel initialization failed
> > MPIDI_CH3_Init(178)......:
> > MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
> > rdma_iba_hca_init(645)...: cannot create cq
> > rank 1 in job 9  compute-0-8.local_34399   caused collective abort of all
> > ranks
> >   exit status of rank 1: return code 1
> >
> > For 2nd scenario,  I found solution from net such as ulimit –l unlimited.
> > But further, this also produced same error as of 1st scenario.
> > Can someone solve this error?
> >
> > Thanks in advance,
> >
> > Sangamesh
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080509/f94e7c45/attachment.html


More information about the mvapich-discuss mailing list