[mvapich-discuss] problems in executing higher number process job

Dhabaleswar Panda panda at cse.ohio-state.edu
Mon Aug 18 09:06:00 EDT 2008


Sangamesh,

Some of your earlier queries were for the uDAPL interface of MVAPICH2
running on your customized adapter. Do these problems occur on the same
environment/interface? Since MVAPICH2 supports multiple interfaces, it
will be good if you can indicate which interface of MVAPICH2 you are using
here.

DK

On Mon, 18 Aug 2008, Sangamesh B wrote:

>  Dear all,
>
> Problem No 1:
>
> Application: GROMACS 3.3.3
>
> Parallel Library: MVAPICH2-1.0.3
>
> Compilers: Intel C++ and Fortran 10
>
>   A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on a
> Rocks 4.3, 33
> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ).
>
> But if I submit same job for 64 or higher no of processes, it  comes without
> doing
> anything.
>
> This is my command line:
>
> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr
> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run
>
>
>
> Problem No 2:
>
> Application: NAMD 2.6
>
> Parallel Library: MVAPICH2-1.0.3
>
> Compilers: Intel C++ and Fortran 10
>
> I built successfully charm++ with mvapich2 and intel compilers, and then
> compiled NAMD2.
>
> The test examples given in the NAMD distribution works fine.
>
> With the following input file( This input file is the one which is used in
> the NAMD website, for benchmarking. It runs/scales upto 252 processes as
> mentioned in NAMD website). But in my case it runs only for 8 process, 16
> process, 32 process, 64 processes.
>
> But when a 128 core job submitted, it doesn't run at all. The following is
> the command and error.
>
> #mpirun -machinefile ./machfile -np 128
> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee
> namd_128cores
> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0
> rank 65 in job 4  master_host_name_50238   caused collective abort of all
> ranks
>   exit status of rank 65: killed by signal 9
>
>
> So, in further, I built charmc with network version of charm++ library
> without using mvapich2. Now it works for any number process job.
>
> So, for the above two problems, I guess there is some thing problem with
> mvapich2 itself.  Is there a solution for it?
>
>
> Regards,
> Sangamesh
>



More information about the mvapich-discuss mailing list