[mvapich-discuss] How do I start the IB modules?

Bruno Gauthier bgauthier at terrascale.net
Fri Apr 11 16:06:16 EDT 2008


I guess you need to load your infiniband drivers.

lsmod shall giving you something like that:

Module                  Size  Used by
iscsi_tcp              27904  0 
ib_iser                37416  0 
libiscsi               29824  2 iscsi_tcp,ib_iser
scsi_transport_iscsi    36240  3 iscsi_tcp,ib_iser,libiscsi
rdma_ucm               16128  0 
rdma_cm                36132  2 ib_iser,rdma_ucm
iw_cm                  12552  1 rdma_cm
ib_addr                10248  1 rdma_cm
sunrpc                201608  3 
dm_mirror              26112  0 
dm_mod                 64240  1 dm_mirror
button                 11424  0 
ib_mthca              130052  0 
i2c_amd756              9220  0 
i2c_core               27648  1 i2c_amd756
ib_ipoib               82288  0 
ib_umad                19752  0 
ib_ucm                 19720  0 
ib_uverbs              45776  2 rdma_ucm,ib_ucm
ib_cm                  37208  3 rdma_cm,ib_ipoib,ib_ucm
ib_sa                  44248  3 rdma_cm,ib_ipoib,ib_cm
ib_mad                 40888  4 ib_mthca,ib_umad,ib_cm,ib_sa
ib_core                64128  12
ib_iser,rdma_ucm,rdma_cm,iw_cm,ib_mthca,ib_ipoib,ib_umad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad
ipv6                  278408  19 ib_ipoib
tg3                   115076  0 
floppy                 66056  0 
sr_mod                 20644  0 
ext3                  136464  2 
jbd                    56232  1 ext3
sata_sil               14216  3 
libata                161144  1 sata_sil
usb_storage            72480  0 
uhci_hcd               27552  0 
ohci_hcd               25220  0 
ehci_hcd               36364  0 
sd_mod                 30592  4 
scsi_mod              168056  8
iscsi_tcp,ib_iser,libiscsi,scsi_transport_iscsi,sr_mod,libata,usb_storage,sd_mod

You might refer to your infiniband manufacturer instruction and/or
openib instruction for a proper installation


On Fri, 2008-04-11 at 14:58 -0400, Christopher Tanner wrote:
> All -
> 
> How do I make sure that the pertinent IB modules are loading (i.e.  
> rdma_ucm, ib_uverbs, etc)? I am getting the following error when I try  
> to execute the OSU benchmarks:
> 
> libibverbs: Fatal: couldn't read uverbs ABI version.
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(259)...........: Initialization failed
> MPID_Init(102)..................: channel initialization failed
> MPIDI_CH3_Init(178).............:
> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters
> rdma_get_control_parameters(432):
> rdma_open_hca(367)..............: No IB device found
> rank 0 in job 15  master.cl.ae.gatech.edu_42042   caused collective
> abort of all ranks
> exit status of rank 0: return code 1
> 
> -------------------------------------------
> Chris Tanner
> Space Systems Design Lab
> Georgia Institute of Technology
> christopher.tanner at gatech.edu
> -------------------------------------------
> 
> 
> 
> On Apr 10, 2008, at 1:49 PM, wei huang wrote:
> > Hi Chris,
> >
> > You have to make sure related kernel modules are loaded (including
> > rdma_ucm, ib_uverbs, ib_mthca, etc). Thanks.
> >
> > Regards,
> > Wei Huang
> >
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> >
> >
> > On Thu, 10 Apr 2008, Christopher Tanner wrote:
> >
> >> Ok Wei -
> >>
> >> Even though I've copied the libib* libraries from the master node to
> >> all of the other nodes and included the /usr/local/lib directory in
> >> the LD_LIBRARY_PATH, it seems that osu_latency still cannot find
> >> libibverbs.so.1. I'm kind of stuck... Any thoughts?
> >>
> >> Also, whenever I try to execute osu_latency using just one core on  
> >> the
> >> master node (mpiexec -n 1 ./osu_latency), I receive the following  
> >> error:
> >>
> >> libibverbs: Fatal: couldn't read uverbs ABI version.
> >> Fatal error in MPI_Init:
> >> Other MPI error, error stack:
> >> MPIR_Init_thread(259)...........: Initialization failed
> >> MPID_Init(102)..................: channel initialization failed
> >> MPIDI_CH3_Init(178).............:
> >> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters
> >> rdma_get_control_parameters(432):
> >> rdma_open_hca(367)..............: No IB device found
> >> rank 0 in job 15  master.cl.ae.gatech.edu_42042   caused collective
> >> abort of all ranks
> >>   exit status of rank 0: return code 1
> >>
> >> Does this output help solve the other problem?
> >>
> >> -------------------------------------------
> >> Chris Tanner
> >> Space Systems Design Lab
> >> Georgia Institute of Technology
> >> christopher.tanner at gatech.edu
> >> -------------------------------------------
> >>
> >>
> >>
> >> On Apr 10, 2008, at 11:53 AM, wei huang wrote:
> >>>
> >>> Do you see the same error?
> >>>
> >>> Try:
> >>> export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH
> >>>
> >>> Regards,
> >>> Wei Huang
> >>>
> >>> 774 Dreese Lab, 2015 Neil Ave,
> >>> Dept. of Computer Science and Engineering
> >>> Ohio State University
> >>> OH 43210
> >>> Tel: (614)292-8501
> >>>
> >>>
> >>> On Thu, 10 Apr 2008, Christopher Tanner wrote:
> >>>
> >>>> Thanks Wei. Of course, the problem isn't solved yet...
> >>>>
> >>>> So I found the file in the /usr/local/lib64 directory on the master
> >>>> node only. I copied the file to the rest of the nodes to the /usr/
> >>>> local/lib64 directory and included the directory in my path. When I
> >>>> tried to execute the osu_latency program, it gave me the same
> >>>> error. A
> >>>> 'which librdmacm.so.1' command reveals that it can indeed find the
> >>>> library.
> >>>>
> >>>> Any clues? Or perhaps, any other ways to determine if the  
> >>>> Infiniband
> >>>> is working?
> >>>>
> >>>> -------------------------------------------
> >>>> Chris Tanner
> >>>> Space Systems Design Lab
> >>>> Georgia Institute of Technology
> >>>> christopher.tanner at gatech.edu
> >>>> -------------------------------------------
> >>>>
> >>>>
> >>>>
> >>>> On Apr 10, 2008, at 11:18 AM, wei huang wrote:
> >>>>> Hi Chris,
> >>>>>
> >>>>> It seems that some ib libraries are not in your default path. You
> >>>>> may need
> >>>>> to explicitly export the path to ib library in your environmental
> >>>>> variables (bash profile or similar places). To find where those
> >>>>> libraries
> >>>>> are, you may try to see /etc/infiniband/info file. Or you can ask
> >>>>> your
> >>>>> system administrator about the path.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> Regards,
> >>>>> Wei Huang
> >>>>>
> >>>>> 774 Dreese Lab, 2015 Neil Ave,
> >>>>> Dept. of Computer Science and Engineering
> >>>>> Ohio State University
> >>>>> OH 43210
> >>>>> Tel: (614)292-8501
> >>>>>
> >>>>>
> >>>>> On Thu, 10 Apr 2008, Dhabaleswar Panda wrote:
> >>>>>
> >>>>>> ---------- Forwarded message ----------
> >>>>>> Date: Wed, 9 Apr 2008 20:01:00 -0400
> >>>>>> From: Christopher Tanner <christopher.tanner at gatech.edu>
> >>>>>> To: mvapich-discuss at cse.ohio-state.edu
> >>>>>> Subject: [mvapich-discuss] Running latency tests
> >>>>>>
> >>>>>> All -
> >>>>>>
> >>>>>> I believe I am gravy with the mvapich2 install so now I'm  
> >>>>>> trying to
> >>>>>> run the latency tests to see if it's really working. But, I'm a
> >>>>>> dummy
> >>>>>> and can't get it to work. Here's what I've done so far:
> >>>>>>
> >>>>>> a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh - 
> >>>>>> n 16
> >>>>>> -1). I have multiple processors, each with multiple cores on each
> >>>>>> node, thus the '-1'.
> >>>>>> b) Compiled osu_latency.c using mpicc (to an executable called
> >>>>>> osu_latency)
> >>>>>> b) Tried to execute the compile file via 'mpiexec -machinefile
> >>>>>> machine.list -n 16 ./osu_latency'
> >>>>>>
> >>>>>> I receive the following error (16 times naturally) ::
> >>>>>> ./osu_latency: error while loading shared libraries:  
> >>>>>> librdmacm.so.
> >>>>>> 1:
> >>>>>> cannot open shared object file: No such file or directory
> >>>>>>
> >>>>>> I don't know where this file would be -- it's not in the /usr/lib
> >>>>>> with
> >>>>>> all of the other *.so.* files.
> >>>>>> Any thoughts? Thanks.
> >>>>>>
> >>>>>> -------------------------------------------
> >>>>>> Chris Tanner
> >>>>>> Space Systems Design Lab
> >>>>>> Georgia Institute of Technology
> >>>>>> christopher.tanner at gatech.edu
> >>>>>> -------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote:
> >>>>>>> Hi Fred,
> >>>>>>>
> >>>>>>> If InfiniBand is not working then the job will not run. There is
> >>>>>>> currently
> >>>>>>> no method by which it will fall back to TCP/IP.
> >>>>>>>
> >>>>>>> Does this answer your question?
> >>>>>>>
> >>>>>>> Matt
> >>>>>>>
> >>>>>>> On Wed, 9 Apr 2008, Stecher, Fred wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>> When I installed MVAPICH, I used the default. If Infiniband is
> >>>>>>>> not
> >>>>>>>> working will my executable still run?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Fred
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> mvapich-discuss mailing list
> >>>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> mvapich-discuss mailing list
> >>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list