[mvapich-discuss] Problem with mvapich2 on a cluster connected with GigE and IB

Matthew Koop koop at cse.ohio-state.edu
Wed Jul 26 17:03:04 EDT 2006


Salvador,

Sorry, I left out one other detail. In the hosts file after the hostname,
place ifhn=<the hostname of the interface you wish to use>. e.g.

n1 ifhn=n1-ib
n2 ifhn=n2-ib

The communication is definately running over IB after the program starts
up. If you want to convince yourself that it is using the IB fabric you
can compile and run the osu_bw or osu_latency test in the osu_benchmarks
directory of the distribution. You can compare your results with ones
posted on our webpage --

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ (under "Performance")

As for the error you are seeing, "sched_setaffinity: Bad address", can you
give us a little more information about your setup -- such as the kernel
version, architecture, etc? It would be especially helpful if you
could send the config-mine.log and make-mine.log. They should be in the
main directory where you compiled MVAPICH2. [ You can the logs directly to
my address to avoid having large files sent to the whole list. ]

Thanks,

Matthew Koop
-
Network-Based Computing Lab
Ohio State University


On Wed, 26 Jul 2006, Salvador Ramirez wrote:

> Matthew,
>
>    Thanks for your answer. Here is the output when I run the
> "cpi" program example that comes with the distribution:
>
> ---------------------
> ~/mvapich2-0.9.3/examples> mpdtrace -l
> newen_2574 (192.168.0.250)   <--- GigE IP addressess
> n2_32945 (192.168.0.2)
> n3_32932 (192.168.0.3)
> n4_32923 (192.168.0.4)
> n1_33014 (192.168.0.1)
> n7_32909 (192.168.0.7)
> n8_32909 (192.168.0.8)
> n6_32911 (192.168.0.6)
>
> ~/mvapich2-0.9.3/examples> mpiexec -n 8 ./cpi
> sched_setaffinity: Bad address
> sched_setaffinity: Bad address
> sched_setaffinity: Bad address
> sched_setaffinity: Bad address
> sched_setaffinity: Bad address
> sched_setaffinity: Bad address
> sched_setaffinity: Bad address
> sched_setaffinity: Bad address
> Process 0 of 8 is on newen
> Process 4 of 8 is on n1
> Process 1 of 8 is on n2
> Process 3 of 8 is on n4
> Process 2 of 8 is on n3
> Process 6 of 8 is on n8
> Process 5 of 8 is on n7
> Process 7 of 8 is on n6
> pi is approximately 3.1415926544231247, Error is
> 0.0000000008333316
> wall clock time = 0.111762
> ----------------------
>
>    After what you said I think I should just ignore the
> first messages and realize that mvapich is actually working
> on all of the nodes, right? anyway what means that error
> messages? I've googled it but I just found only discussions
> about kernel development.
>
> Thanks again.
> Best regards,
>
> ---sram
>
> Matthew Koop wrote:
> > Salvador,
> >
> > When running mpdtrace, it will report the hostname of the machine, not the
> > hostname associated with the IP address MPD is listening on. On our
> > systems here, I need to run:
> >
> > mpdboot -n 2 -f hosts --ifhn=d2ib
> >
> > (where d2ib is a hostname that resolves to the IPoIB interface of the
> > machine I am running mpdboot on, which is also in the hosts file) You may
> > not need this parameter. You can verify that things are running over IPoIB
> > by using the following command on n2 before and after running mpdboot:
> >
> > netstat -a | grep n1-ib
> >
> > It is very important to note that changing the interface is not likely to
> > change performance (or solve your startup problem). MVAPICH2 only uses the
> > specified interface to exchange a few startup parameters over IP. After
> > exchanging enough information to startup native IB connections, all
> > further communication will go over the native IB layer -- not the IPoIB
> > interface.
> >
> > What is the problem you are experiencing on startup? That will allow us to
> > better debug the problem.
> >
> > Thanks,
> >
> > Matthew Koop
> > -
> > Network-Based Computing Lab
> > Ohio-State University
> >
> >
> >
> > On Wed, 26 Jul 2006, Salvador Ramirez wrote:
> >
> >
> >>Hello,
> >>
> >>    I recently downloaded and installed mvapich2 on a
> >>cluster that has two connections among the nodes: gigabit
> >>ethernet and infiniband. Each node has then two ip addresses
> >>(one for each connection of course) related to obvious names
> >>like n1 and n1-ib, n2 and n2-ib, et-cetera.
> >>
> >>    For the compilation I selected VAPI and everything
> >>compiled without problems, so the successful installation
> >>was on /usr/local/mvapich2. Then I created the file hostfile
> >>like this:
> >>
> >>n1-ib
> >>n2-ib
> >>...
> >>
> >>    and then ran the mpdboot -n 8 -f hostfile. Everything
> >>fine until here but then when I checked with mpdtrace -l I
> >>see that the nodes are n1, n2, n3... with the IP address of
> >>the gigE network. So I wonder why mpd choose this address
> >>when in the hostfile the names are explicitly listed as
> >>their corresponding IB address??
> >>
> >>    Of course this has further problems since when I try to
> >>run a mpi program with mpiexec I received error message from
> >>the vapi library since the address are not over IB.
> >>
> >>Any help is very appreciated. Thanks.
> >>
> >>---sram
> >>
> >>_______________________________________________
> >>mvapich-discuss mailing list
> >>mvapich-discuss at mail.cse.ohio-state.edu
> >>http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> >
> >
>




More information about the mvapich-discuss mailing list