[mvapich-discuss] Problems running MPI jobs with large (?)
numbers of processors
Webb, Michael
Michael.Webb at atk.com
Fri Jan 19 15:50:43 EST 2007
Sayantan,
I will check the C code ...
Re: the libraries; I'm not sure which libraries you mean, but I checked
those in ../mvapich-0.9.7/lib, and I noticed the following.
Our system is setup to have a "head" node, in which all users are
supposed to operate/compile/submit jobs, etc. The lib directory of my
head node is:
drwxr-xr-x 3 root root 4096 Aug 3 10:58 .
drwxr-xr-x 8 root root 4096 Aug 3 10:57 ..
-rw-r--r-- 1 root root 294712 Aug 3 10:55 libfmpich.a
-rwxr-xr-x 1 root root 104417 Aug 3 10:56 libfmpich.so.1.0
-rw-r--r-- 1 root root 2735490 Aug 3 10:55 libmpich.a
-rwxr-xr-x 1 root root 1345255 Aug 3 10:56 libmpich.so.1.0
-rw-r--r-- 1 root root 1277294 Aug 3 10:56 libmpichf90.a
-rw-r--r-- 1 root root 6724 Aug 3 10:56 libmpichf90nc.a
-rw-r--r-- 1 root root 4422 Aug 3 10:56 libmpichfarg.a
-rw-r--r-- 1 root root 11118 Aug 3 10:55 libmpichfsup.a
-rw-r--r-- 1 root root 556860 Aug 3 10:56 libpmpich++.a
lrwxrwxrwx 1 root root 10 Aug 3 10:58 libpmpich.a -> libmpich.a
-rwxr-xr-x 1 root root 1345255 Aug 3 10:56 libpmpich.so.1.0
lrwxrwxrwx 1 root root 15 Aug 3 10:58 libtvmpich.so ->
libtvmpich.so.1
lrwxrwxrwx 1 root root 17 Aug 3 10:58 libtvmpich.so.1 ->
libtvmpich.so.1.0
-rwxr-xr-x 1 root root 92455 Aug 3 10:55 libtvmpich.so.1.0
drwxr-xr-x 2 root root 4096 Aug 3 10:58 shared
I checked a few "worker" nodes (10/100+); they all have lib structures
that look like this:
drwxr-xr-x 3 root root 4096 Sep 4 09:48 .
drwxr-xr-x 8 root root 4096 Aug 3 10:57 ..
-rw-r--r-- 1 root root 294712 Aug 3 10:55 libfmpich.a
-rwxr-xr-x 1 root root 104417 Aug 3 10:56 libfmpich.so.1.0
-rw-r--r-- 1 root root 2735490 Aug 3 10:55 libmpich.a
-rwxr-xr-x 1 root root 1345255 Aug 3 10:56 libmpich.so.1.0
-rw-r--r-- 1 root root 1277294 Aug 3 10:56 libmpichf90.a
-rw-r--r-- 1 root root 6724 Aug 3 10:56 libmpichf90nc.a
-rw-r--r-- 1 root root 4422 Aug 3 10:56 libmpichfarg.a
-rw-r--r-- 1 root root 11118 Aug 3 10:55 libmpichfsup.a
-rw-r--r-- 1 root root 556860 Aug 3 10:56 libpmpich++.a
lrwxrwxrwx 1 root root 10 Sep 4 09:48 libpmpich.a -> libmpich.a
-rwxr-xr-x 1 root root 1345255 Aug 3 10:56 libpmpich.so.1.0
lrwxrwxrwx 1 root root 15 Sep 4 09:48 libtvmpich.so ->
libtvmpich.so.1
lrwxrwxrwx 1 root root 17 Sep 4 09:48 libtvmpich.so.1 ->
libtvmpich.so.1.0
-rwxr-xr-x 1 root root 92455 Aug 3 10:55 libtvmpich.so.1.0
drwxr-xr-x 2 root root 4096 Sep 4 09:48 shared
Notice that all the libraries have the same dates and sizes (the dates
on some of the links are different, but I don't see how that would
matter).
Also, the /lib/shared and /bin directory structures are the same on the
nodes I checked as those of the head node.
Michael
> -----Original Message-----
> From: Sayantan Sur [mailto:surs at cse.ohio-state.edu]
> Sent: Friday, January 19, 2007 1:20 PM
> To: Webb, Michael
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Problems running MPI jobs with
> large (?) numbers of processors
>
> Michael,
>
> > The original (much more complicated) code that spurred this report
> > does have an MPI::Finalize() at the end. I wrote the simple example
> > code to illustrate the problem. I neglected to include
> MPI::Finalize()
> > in the sample program, but that oversight isn't important, at least
> > operationally speaking, it doesn't matter. I just added it
> to the C++
> > code and I get the same problem--the code quits before getting past
> > MPI::Init(), and produces no output.
>
> I'm just wondering -- does the C version work correctly?
> Ideally, there shouldn't be any difference at all based on
> which language you used to write that very simple code snippet.
>
> If the problem continues to show up only with C++ code, could
> you check if all the nodes have the same version of C++
> libraries installed?
>
> > This leads me to believe that there is an issue with the
> comm backbone
> > of the cluster, but our cluster administrators assure me
> this is not
> > the case. I am new to cluster work and have no idea how to prove or
> > disprove their contention.
>
> You could try to run the Intel MPI benchmarks on this cluster
> to see if large runs with lot of communication are able to
> execute successfully.
> Please let us know if this works on your cluster.
>
> Thanks,
> Sayantan.
>
> --
> http://www.cse.ohio-state.edu/~surs
>
More information about the mvapich-discuss
mailing list