[mvapich-discuss] Problems running MPI jobs with large (?) numbers of processors

Webb, Michael Michael.Webb at atk.com
Fri Jan 19 15:50:43 EST 2007


Sayantan,

I will check the C code ...

Re: the libraries; I'm not sure which libraries you mean, but I checked
those in ../mvapich-0.9.7/lib, and I noticed the following.

Our system is setup to have a "head" node, in which all users are
supposed to operate/compile/submit jobs, etc. The lib directory of my
head node is:

drwxr-xr-x  3 root root    4096 Aug  3 10:58 .
drwxr-xr-x  8 root root    4096 Aug  3 10:57 ..
-rw-r--r--  1 root root  294712 Aug  3 10:55 libfmpich.a
-rwxr-xr-x  1 root root  104417 Aug  3 10:56 libfmpich.so.1.0
-rw-r--r--  1 root root 2735490 Aug  3 10:55 libmpich.a
-rwxr-xr-x  1 root root 1345255 Aug  3 10:56 libmpich.so.1.0
-rw-r--r--  1 root root 1277294 Aug  3 10:56 libmpichf90.a
-rw-r--r--  1 root root    6724 Aug  3 10:56 libmpichf90nc.a
-rw-r--r--  1 root root    4422 Aug  3 10:56 libmpichfarg.a
-rw-r--r--  1 root root   11118 Aug  3 10:55 libmpichfsup.a
-rw-r--r--  1 root root  556860 Aug  3 10:56 libpmpich++.a
lrwxrwxrwx  1 root root      10 Aug  3 10:58 libpmpich.a -> libmpich.a
-rwxr-xr-x  1 root root 1345255 Aug  3 10:56 libpmpich.so.1.0
lrwxrwxrwx  1 root root      15 Aug  3 10:58 libtvmpich.so ->
libtvmpich.so.1
lrwxrwxrwx  1 root root      17 Aug  3 10:58 libtvmpich.so.1 ->
libtvmpich.so.1.0
-rwxr-xr-x  1 root root   92455 Aug  3 10:55 libtvmpich.so.1.0
drwxr-xr-x  2 root root    4096 Aug  3 10:58 shared

I checked a few "worker" nodes (10/100+); they all have lib structures
that look like this:

drwxr-xr-x  3 root root    4096 Sep  4 09:48 .
drwxr-xr-x  8 root root    4096 Aug  3 10:57 ..
-rw-r--r--  1 root root  294712 Aug  3 10:55 libfmpich.a
-rwxr-xr-x  1 root root  104417 Aug  3 10:56 libfmpich.so.1.0
-rw-r--r--  1 root root 2735490 Aug  3 10:55 libmpich.a
-rwxr-xr-x  1 root root 1345255 Aug  3 10:56 libmpich.so.1.0
-rw-r--r--  1 root root 1277294 Aug  3 10:56 libmpichf90.a
-rw-r--r--  1 root root    6724 Aug  3 10:56 libmpichf90nc.a
-rw-r--r--  1 root root    4422 Aug  3 10:56 libmpichfarg.a
-rw-r--r--  1 root root   11118 Aug  3 10:55 libmpichfsup.a
-rw-r--r--  1 root root  556860 Aug  3 10:56 libpmpich++.a
lrwxrwxrwx  1 root root      10 Sep  4 09:48 libpmpich.a -> libmpich.a
-rwxr-xr-x  1 root root 1345255 Aug  3 10:56 libpmpich.so.1.0
lrwxrwxrwx  1 root root      15 Sep  4 09:48 libtvmpich.so ->
libtvmpich.so.1
lrwxrwxrwx  1 root root      17 Sep  4 09:48 libtvmpich.so.1 ->
libtvmpich.so.1.0
-rwxr-xr-x  1 root root   92455 Aug  3 10:55 libtvmpich.so.1.0
drwxr-xr-x  2 root root    4096 Sep  4 09:48 shared

Notice that all the libraries have the same dates and sizes (the dates
on some of the links are different, but I don't see how that would
matter).

Also, the /lib/shared and /bin directory structures are the same on the
nodes I checked as those of the head node.

Michael 
 

> -----Original Message-----
> From: Sayantan Sur [mailto:surs at cse.ohio-state.edu] 
> Sent: Friday, January 19, 2007 1:20 PM
> To: Webb, Michael
> Cc: mvapich-discuss at cse.ohio-state.edu
> Subject: Re: [mvapich-discuss] Problems running MPI jobs with 
> large (?) numbers of processors
> 
> Michael,
> 
> > The original (much more complicated) code that spurred this report 
> > does have an MPI::Finalize() at the end. I wrote the simple example 
> > code to illustrate the problem. I neglected to include 
> MPI::Finalize() 
> > in the sample program, but that oversight isn't important, at least 
> > operationally speaking, it doesn't matter. I just added it 
> to the C++ 
> > code and I get the same problem--the code quits before getting past 
> > MPI::Init(), and produces no output.
> 
> I'm just wondering -- does the C version work correctly? 
> Ideally, there shouldn't be any difference at all based on 
> which language you used to write that very simple code snippet.
> 
> If the problem continues to show up only with C++ code, could 
> you check if all the nodes have the same version of C++ 
> libraries installed?
> 
> > This leads me to believe that there is an issue with the 
> comm backbone 
> > of the cluster, but our cluster administrators assure me 
> this is not 
> > the case. I am new to cluster work and have no idea how to prove or 
> > disprove their contention.
> 
> You could try to run the Intel MPI benchmarks on this cluster 
> to see if large runs with lot of communication are able to 
> execute successfully.
> Please let us know if this works on your cluster.
> 
> Thanks,
> Sayantan.
> 
> --
> http://www.cse.ohio-state.edu/~surs
> 



More information about the mvapich-discuss mailing list