[mvapich-discuss] IB is not loading

Christopher Tanner christopher.tanner at gatech.edu
Wed Apr 16 09:39:59 EDT 2008


Scott - thanks for your help man. I'm still new to Linux, so detailed  
commands were great. I did use the automated installer, and I did just  
the basic OFED 1.3 install. However...

a) 'ibstat' just doesn't exist. I've installed OFED three times now  
and each time 'ibstat' is not created or is not placed in an intuitive  
directory (/usr/bin, /usr/local/bin, etc.)

b)  I've confirmed that the modules are NOT loaded - the lsmod  
returned nothing (literally)

c) chkconfig resulted in this :
openibd        	0:off	   1:off   2:on   3:on   4:on   5:on   6:off
Which I assume to mean the initscripts for run levels 3 and 5 are  
executing. So, I think we're gravy there.

d) service command resulted in this :
Loading Mellanox HCA driver:                       [FAILED]
Loading Mellanox MLX4 HCA driver:           [FAILED]
Loading cxgb3 driver:                                      [FAILED]
Loading HCA driver and Access Layer:       [FAILED]
Please open an issue in the http://openib.org/bugzilla and attach /tmp/ 
ib_debug_info.log

Tek recommended burning new firmware onto each of the Infiniband  
cards, but that seems like an arduous process for a relatively new  
cluster.

Is it this hard to get an Infiniband network running on every cluster  
or am I really missing something?

-------------------------------------------
Chris Tanner
Space Systems Design Lab
Georgia Institute of Technology
christopher.tanner at gatech.edu
-------------------------------------------



On Apr 15, 2008, at 10:41 PM, Scott A. Friedman wrote:
> Hi Chris,
>
> I have been watching your messages and thought I'd send you a note.
>
> If your drivers (kernel modules) are not loading you should try a  
> few simple things - especially if you used the OFED installer.
>
> You should run 'ibstat' first. You normally need to be root to run  
> this but you can also run it with the full path if you are not root.
>
> /usr/sbin/ibstat
>
> You should also confirm that the kernel models are in fact loaded.
>
> /sbin/lsmod | grep ib
>
> You should see a bunch of ib_blah entries - like ib_uverbs etc.
>
> if none of these work then the modules are probably not loaded at  
> all. In that case you should check (on a redhat/fedora/centos type  
> system) as root.
>
> chkconfig --list openibd
>
> It should show that the initscript runs (on) for the run level you  
> are using (typically 3 or 5). If it says it is off then that is your  
> problem - and why the modules are not loaded upon startup.
>
> chkconfig openibd on
> service openibd start
>
> Then try your mpi again - no need to reboot.
>
> You will also need the subnet manager running - which is opensmd on  
> at least one node. /usr/sbin/sminfo will show you if it is running  
> someplace on your IB network - have to run this as root. If it  
> isn't...
>
> chkconfig opensmd on
> service opensmd start
>
> do this on, say, your head node.
>
> You may also need to setup ipoib if you are using mvapich with the  
> newer connection management rdmacm setup (which I think the default  
> mvapich that comes with OFED does for the connection management to  
> work).
>
> Let me know how it goes,
> Scott
>
> ----
> Scott A. Friedman, Ph.D
> Computer Scientist
> Research Computing Technologies Group
> UCLA Academic Technology Services
> 310-825-8607
>
>
>
> Christopher Tanner wrote:
>> Sorry to keep hassling everyone, but I have received several  
>> potential solutions to my problem, but none have worked (or I'm a  
>> little to novice to understand what to do). Thanks for all your  
>> help though. Here's another try...
>> I'm pretty sure the IB drivers are not loading and I don't know how  
>> to load them. Here's the error I get when trying to execute the  
>> osu_latency benchmark in mvapich2:
>> libibverbs: Fatal: couldn't read uverbs ABI version.
>> Fatal error in MPI_Init:
>> Other MPI error, error stack:
>> MPIR_Init_thread(259)...........: Initialization failed
>> MPID_Init(102)..................: channel initialization failed
>> MPIDI_CH3_Init(178).............:
>> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters
>> rdma_get_control_parameters(432):
>> rdma_open_hca(367)..............: No IB device found
>> rank 3 in job 1  master.cl.ae.gatech.edu_41302   caused collective  
>> abort of all ranks
>> exit status of rank 3: return code 1
>> Matt suggested running 'ibstat', which doesn't exist on my machine.  
>> I'm executing the script on four separate nodes via a machinefile  
>> (not the master), all of which have OFED and mvapich2 installed.
>> So... I'm essentially looking for a way to load the drivers.  
>> Rebooting the master and each node post install didn't work. Anyone  
>> have any thoughts? Thanks!
>> -------------------------------------------
>> Chris Tanner
>> Space Systems Design Lab
>> Georgia Institute of Technology
>> christopher.tanner at gatech.edu
>> -------------------------------------------
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list