[mvapich-discuss] How do I start the IB modules?
Bruno Gauthier
bgauthier at terrascale.net
Fri Apr 11 16:06:16 EDT 2008
I guess you need to load your infiniband drivers.
lsmod shall giving you something like that:
Module Size Used by
iscsi_tcp 27904 0
ib_iser 37416 0
libiscsi 29824 2 iscsi_tcp,ib_iser
scsi_transport_iscsi 36240 3 iscsi_tcp,ib_iser,libiscsi
rdma_ucm 16128 0
rdma_cm 36132 2 ib_iser,rdma_ucm
iw_cm 12552 1 rdma_cm
ib_addr 10248 1 rdma_cm
sunrpc 201608 3
dm_mirror 26112 0
dm_mod 64240 1 dm_mirror
button 11424 0
ib_mthca 130052 0
i2c_amd756 9220 0
i2c_core 27648 1 i2c_amd756
ib_ipoib 82288 0
ib_umad 19752 0
ib_ucm 19720 0
ib_uverbs 45776 2 rdma_ucm,ib_ucm
ib_cm 37208 3 rdma_cm,ib_ipoib,ib_ucm
ib_sa 44248 3 rdma_cm,ib_ipoib,ib_cm
ib_mad 40888 4 ib_mthca,ib_umad,ib_cm,ib_sa
ib_core 64128 12
ib_iser,rdma_ucm,rdma_cm,iw_cm,ib_mthca,ib_ipoib,ib_umad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad
ipv6 278408 19 ib_ipoib
tg3 115076 0
floppy 66056 0
sr_mod 20644 0
ext3 136464 2
jbd 56232 1 ext3
sata_sil 14216 3
libata 161144 1 sata_sil
usb_storage 72480 0
uhci_hcd 27552 0
ohci_hcd 25220 0
ehci_hcd 36364 0
sd_mod 30592 4
scsi_mod 168056 8
iscsi_tcp,ib_iser,libiscsi,scsi_transport_iscsi,sr_mod,libata,usb_storage,sd_mod
You might refer to your infiniband manufacturer instruction and/or
openib instruction for a proper installation
On Fri, 2008-04-11 at 14:58 -0400, Christopher Tanner wrote:
> All -
>
> How do I make sure that the pertinent IB modules are loading (i.e.
> rdma_ucm, ib_uverbs, etc)? I am getting the following error when I try
> to execute the OSU benchmarks:
>
> libibverbs: Fatal: couldn't read uverbs ABI version.
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(259)...........: Initialization failed
> MPID_Init(102)..................: channel initialization failed
> MPIDI_CH3_Init(178).............:
> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters
> rdma_get_control_parameters(432):
> rdma_open_hca(367)..............: No IB device found
> rank 0 in job 15 master.cl.ae.gatech.edu_42042 caused collective
> abort of all ranks
> exit status of rank 0: return code 1
>
> -------------------------------------------
> Chris Tanner
> Space Systems Design Lab
> Georgia Institute of Technology
> christopher.tanner at gatech.edu
> -------------------------------------------
>
>
>
> On Apr 10, 2008, at 1:49 PM, wei huang wrote:
> > Hi Chris,
> >
> > You have to make sure related kernel modules are loaded (including
> > rdma_ucm, ib_uverbs, ib_mthca, etc). Thanks.
> >
> > Regards,
> > Wei Huang
> >
> > 774 Dreese Lab, 2015 Neil Ave,
> > Dept. of Computer Science and Engineering
> > Ohio State University
> > OH 43210
> > Tel: (614)292-8501
> >
> >
> > On Thu, 10 Apr 2008, Christopher Tanner wrote:
> >
> >> Ok Wei -
> >>
> >> Even though I've copied the libib* libraries from the master node to
> >> all of the other nodes and included the /usr/local/lib directory in
> >> the LD_LIBRARY_PATH, it seems that osu_latency still cannot find
> >> libibverbs.so.1. I'm kind of stuck... Any thoughts?
> >>
> >> Also, whenever I try to execute osu_latency using just one core on
> >> the
> >> master node (mpiexec -n 1 ./osu_latency), I receive the following
> >> error:
> >>
> >> libibverbs: Fatal: couldn't read uverbs ABI version.
> >> Fatal error in MPI_Init:
> >> Other MPI error, error stack:
> >> MPIR_Init_thread(259)...........: Initialization failed
> >> MPID_Init(102)..................: channel initialization failed
> >> MPIDI_CH3_Init(178).............:
> >> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters
> >> rdma_get_control_parameters(432):
> >> rdma_open_hca(367)..............: No IB device found
> >> rank 0 in job 15 master.cl.ae.gatech.edu_42042 caused collective
> >> abort of all ranks
> >> exit status of rank 0: return code 1
> >>
> >> Does this output help solve the other problem?
> >>
> >> -------------------------------------------
> >> Chris Tanner
> >> Space Systems Design Lab
> >> Georgia Institute of Technology
> >> christopher.tanner at gatech.edu
> >> -------------------------------------------
> >>
> >>
> >>
> >> On Apr 10, 2008, at 11:53 AM, wei huang wrote:
> >>>
> >>> Do you see the same error?
> >>>
> >>> Try:
> >>> export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH
> >>>
> >>> Regards,
> >>> Wei Huang
> >>>
> >>> 774 Dreese Lab, 2015 Neil Ave,
> >>> Dept. of Computer Science and Engineering
> >>> Ohio State University
> >>> OH 43210
> >>> Tel: (614)292-8501
> >>>
> >>>
> >>> On Thu, 10 Apr 2008, Christopher Tanner wrote:
> >>>
> >>>> Thanks Wei. Of course, the problem isn't solved yet...
> >>>>
> >>>> So I found the file in the /usr/local/lib64 directory on the master
> >>>> node only. I copied the file to the rest of the nodes to the /usr/
> >>>> local/lib64 directory and included the directory in my path. When I
> >>>> tried to execute the osu_latency program, it gave me the same
> >>>> error. A
> >>>> 'which librdmacm.so.1' command reveals that it can indeed find the
> >>>> library.
> >>>>
> >>>> Any clues? Or perhaps, any other ways to determine if the
> >>>> Infiniband
> >>>> is working?
> >>>>
> >>>> -------------------------------------------
> >>>> Chris Tanner
> >>>> Space Systems Design Lab
> >>>> Georgia Institute of Technology
> >>>> christopher.tanner at gatech.edu
> >>>> -------------------------------------------
> >>>>
> >>>>
> >>>>
> >>>> On Apr 10, 2008, at 11:18 AM, wei huang wrote:
> >>>>> Hi Chris,
> >>>>>
> >>>>> It seems that some ib libraries are not in your default path. You
> >>>>> may need
> >>>>> to explicitly export the path to ib library in your environmental
> >>>>> variables (bash profile or similar places). To find where those
> >>>>> libraries
> >>>>> are, you may try to see /etc/infiniband/info file. Or you can ask
> >>>>> your
> >>>>> system administrator about the path.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> Regards,
> >>>>> Wei Huang
> >>>>>
> >>>>> 774 Dreese Lab, 2015 Neil Ave,
> >>>>> Dept. of Computer Science and Engineering
> >>>>> Ohio State University
> >>>>> OH 43210
> >>>>> Tel: (614)292-8501
> >>>>>
> >>>>>
> >>>>> On Thu, 10 Apr 2008, Dhabaleswar Panda wrote:
> >>>>>
> >>>>>> ---------- Forwarded message ----------
> >>>>>> Date: Wed, 9 Apr 2008 20:01:00 -0400
> >>>>>> From: Christopher Tanner <christopher.tanner at gatech.edu>
> >>>>>> To: mvapich-discuss at cse.ohio-state.edu
> >>>>>> Subject: [mvapich-discuss] Running latency tests
> >>>>>>
> >>>>>> All -
> >>>>>>
> >>>>>> I believe I am gravy with the mvapich2 install so now I'm
> >>>>>> trying to
> >>>>>> run the latency tests to see if it's really working. But, I'm a
> >>>>>> dummy
> >>>>>> and can't get it to work. Here's what I've done so far:
> >>>>>>
> >>>>>> a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh -
> >>>>>> n 16
> >>>>>> -1). I have multiple processors, each with multiple cores on each
> >>>>>> node, thus the '-1'.
> >>>>>> b) Compiled osu_latency.c using mpicc (to an executable called
> >>>>>> osu_latency)
> >>>>>> b) Tried to execute the compile file via 'mpiexec -machinefile
> >>>>>> machine.list -n 16 ./osu_latency'
> >>>>>>
> >>>>>> I receive the following error (16 times naturally) ::
> >>>>>> ./osu_latency: error while loading shared libraries:
> >>>>>> librdmacm.so.
> >>>>>> 1:
> >>>>>> cannot open shared object file: No such file or directory
> >>>>>>
> >>>>>> I don't know where this file would be -- it's not in the /usr/lib
> >>>>>> with
> >>>>>> all of the other *.so.* files.
> >>>>>> Any thoughts? Thanks.
> >>>>>>
> >>>>>> -------------------------------------------
> >>>>>> Chris Tanner
> >>>>>> Space Systems Design Lab
> >>>>>> Georgia Institute of Technology
> >>>>>> christopher.tanner at gatech.edu
> >>>>>> -------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote:
> >>>>>>> Hi Fred,
> >>>>>>>
> >>>>>>> If InfiniBand is not working then the job will not run. There is
> >>>>>>> currently
> >>>>>>> no method by which it will fall back to TCP/IP.
> >>>>>>>
> >>>>>>> Does this answer your question?
> >>>>>>>
> >>>>>>> Matt
> >>>>>>>
> >>>>>>> On Wed, 9 Apr 2008, Stecher, Fred wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>> When I installed MVAPICH, I used the default. If Infiniband is
> >>>>>>>> not
> >>>>>>>> working will my executable still run?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Fred
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> mvapich-discuss mailing list
> >>>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> mvapich-discuss mailing list
> >>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mvapich-discuss mailing list
> >>> mvapich-discuss at cse.ohio-state.edu
> >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
More information about the mvapich-discuss
mailing list