[mvapich-discuss] problems in executing higher number process job

Tue Aug 19 14:56:47 EDT 2008

Sangamesh,

I'm not sure what your issue here is, however, we have run each of these
sets of software in the past without any problem. I just re-verified again
that NAMD works fine with that version of MVAPICH2 and compilers at 128
processes and above.

Can you give your parameters that you used for building Charm++?
(conv_mach.sh)

I've posted this in the past as a guide for MVAPICH:
cd charm-5.9
cd ./src/arch

cp -r mpi-linux-amd64 mpi-linux-amd64-mvapich
cd mpi-linux-amd64-mvapich

* edit conv-mach.h and change:

#define CMK_MALLOC_USE_GNU_MALLOC                          1
#define CMK_MALLOC_USE_OS_BUILTIN                          0

to

#define CMK_MALLOC_USE_GNU_MALLOC                          0
#define CMK_MALLOC_USE_OS_BUILTIN                          1

* make sure the MVAPICH mpicc and mpiCC are first in your path. Otherwise,
add the full path to the mpicc and mpiCC commands in conv_mach.sh

cd ../../..

./build charm++ mpi-linux-amd64-mvapich --no-build-shared

You may need to change mpiCC to mpicxx in the conv_mach.sh in
charm-5.9/src/arch/mpi-linux-amd64-mvapich

Matt

On Tue, 19 Aug 2008, Sangamesh B wrote:

> Hi DK Sir,
>
>      I'm using OpenIB. MVAPICH2 is built with OFED-1.3 and Intel compilers.
>
> This is the new cluster we built recently. The environment is different from
> the earlier. But earlier also we built mvapich2 for OFA interface only.
>
> We've used make.mvapich2.ofa for installation. This will not install uDAPL
> stack right?
>
> Thank you,
> Sangamesh
>
> On Tue, Aug 19, 2008 at 5:51 AM, Joshua Bernstein <
> jbernstein at penguincomputing.com> wrote:
>
> > Agreed,
> >
> >        Generally the "OpenIB" transport provides for greater startup and
> > reliability over large number of cores, so if you are using uDAPL, I would
> > suggest giving openib a shot.
> >
> > -Joshua Bernstein
> > Software Engineer
> > Penguin Computing
> >
> > Dhabaleswar Panda wrote:
> >
> >> Sangamesh,
> >>
> >> Some of your earlier queries were for the uDAPL interface of MVAPICH2
> >> running on your customized adapter. Do these problems occur on the same
> >> environment/interface? Since MVAPICH2 supports multiple interfaces, it
> >> will be good if you can indicate which interface of MVAPICH2 you are using
> >> here.
> >>
> >> DK
> >>
> >> On Mon, 18 Aug 2008, Sangamesh B wrote:
> >>
> >>   Dear all,
> >>>
> >>> Problem No 1:
> >>>
> >>> Application: GROMACS 3.3.3
> >>>
> >>> Parallel Library: MVAPICH2-1.0.3
> >>>
> >>> Compilers: Intel C++ and Fortran 10
> >>>
> >>>  A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on
> >>> a
> >>> Rocks 4.3, 33
> >>> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ).
> >>>
> >>> But if I submit same job for 64 or higher no of processes, it  comes
> >>> without
> >>> doing
> >>> anything.
> >>>
> >>> This is my command line:
> >>>
> >>> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr
> >>> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run
> >>>
> >>>
> >>>
> >>> Problem No 2:
> >>>
> >>> Application: NAMD 2.6
> >>>
> >>> Parallel Library: MVAPICH2-1.0.3
> >>>
> >>> Compilers: Intel C++ and Fortran 10
> >>>
> >>> I built successfully charm++ with mvapich2 and intel compilers, and then
> >>> compiled NAMD2.
> >>>
> >>> The test examples given in the NAMD distribution works fine.
> >>>
> >>> With the following input file( This input file is the one which is used
> >>> in
> >>> the NAMD website, for benchmarking. It runs/scales upto 252 processes as
> >>> mentioned in NAMD website). But in my case it runs only for 8 process, 16
> >>> process, 32 process, 64 processes.
> >>>
> >>> But when a 128 core job submitted, it doesn't run at all. The following
> >>> is
> >>> the command and error.
> >>>
> >>> #mpirun -machinefile ./machfile -np 128
> >>> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee
> >>> namd_128cores
> >>> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0
> >>> rank 65 in job 4  master_host_name_50238   caused collective abort of all
> >>> ranks
> >>>  exit status of rank 65: killed by signal 9
> >>>
> >>>
> >>> So, in further, I built charmc with network version of charm++ library
> >>> without using mvapich2. Now it works for any number process job.
> >>>
> >>> So, for the above two problems, I guess there is some thing problem with
> >>> mvapich2 itself.  Is there a solution for it?
> >>>
> >>>
> >>> Regards,
> >>> Sangamesh
> >>>
> >>>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
>