[mvapich-discuss] problems in executing higher number process job

Joshua Bernstein jbernstein at penguincomputing.com
Mon Aug 18 20:21:44 EDT 2008


Agreed,

	Generally the "OpenIB" transport provides for greater startup and 
reliability over large number of cores, so if you are using uDAPL, I 
would suggest giving openib a shot.

-Joshua Bernstein
Software Engineer
Penguin Computing

Dhabaleswar Panda wrote:
> Sangamesh,
> 
> Some of your earlier queries were for the uDAPL interface of MVAPICH2
> running on your customized adapter. Do these problems occur on the same
> environment/interface? Since MVAPICH2 supports multiple interfaces, it
> will be good if you can indicate which interface of MVAPICH2 you are using
> here.
> 
> DK
> 
> On Mon, 18 Aug 2008, Sangamesh B wrote:
> 
>>  Dear all,
>>
>> Problem No 1:
>>
>> Application: GROMACS 3.3.3
>>
>> Parallel Library: MVAPICH2-1.0.3
>>
>> Compilers: Intel C++ and Fortran 10
>>
>>   A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on a
>> Rocks 4.3, 33
>> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ).
>>
>> But if I submit same job for 64 or higher no of processes, it  comes without
>> doing
>> anything.
>>
>> This is my command line:
>>
>> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr
>> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run
>>
>>
>>
>> Problem No 2:
>>
>> Application: NAMD 2.6
>>
>> Parallel Library: MVAPICH2-1.0.3
>>
>> Compilers: Intel C++ and Fortran 10
>>
>> I built successfully charm++ with mvapich2 and intel compilers, and then
>> compiled NAMD2.
>>
>> The test examples given in the NAMD distribution works fine.
>>
>> With the following input file( This input file is the one which is used in
>> the NAMD website, for benchmarking. It runs/scales upto 252 processes as
>> mentioned in NAMD website). But in my case it runs only for 8 process, 16
>> process, 32 process, 64 processes.
>>
>> But when a 128 core job submitted, it doesn't run at all. The following is
>> the command and error.
>>
>> #mpirun -machinefile ./machfile -np 128
>> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee
>> namd_128cores
>> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0
>> rank 65 in job 4  master_host_name_50238   caused collective abort of all
>> ranks
>>   exit status of rank 65: killed by signal 9
>>
>>
>> So, in further, I built charmc with network version of charm++ library
>> without using mvapich2. Now it works for any number process job.
>>
>> So, for the above two problems, I guess there is some thing problem with
>> mvapich2 itself.  Is there a solution for it?
>>
>>
>> Regards,
>> Sangamesh
>>
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list