[mvapich-discuss] problems in executing higher number process job

Sangamesh B forum.san at gmail.com
Tue Aug 19 00:14:53 EDT 2008


Hi DK Sir,

     I'm using OpenIB. MVAPICH2 is built with OFED-1.3 and Intel compilers.

This is the new cluster we built recently. The environment is different from
the earlier. But earlier also we built mvapich2 for OFA interface only.

We've used make.mvapich2.ofa for installation. This will not install uDAPL
stack right?

Thank you,
Sangamesh

On Tue, Aug 19, 2008 at 5:51 AM, Joshua Bernstein <
jbernstein at penguincomputing.com> wrote:

> Agreed,
>
>        Generally the "OpenIB" transport provides for greater startup and
> reliability over large number of cores, so if you are using uDAPL, I would
> suggest giving openib a shot.
>
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
>
> Dhabaleswar Panda wrote:
>
>> Sangamesh,
>>
>> Some of your earlier queries were for the uDAPL interface of MVAPICH2
>> running on your customized adapter. Do these problems occur on the same
>> environment/interface? Since MVAPICH2 supports multiple interfaces, it
>> will be good if you can indicate which interface of MVAPICH2 you are using
>> here.
>>
>> DK
>>
>> On Mon, 18 Aug 2008, Sangamesh B wrote:
>>
>>   Dear all,
>>>
>>> Problem No 1:
>>>
>>> Application: GROMACS 3.3.3
>>>
>>> Parallel Library: MVAPICH2-1.0.3
>>>
>>> Compilers: Intel C++ and Fortran 10
>>>
>>>  A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on
>>> a
>>> Rocks 4.3, 33
>>> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ).
>>>
>>> But if I submit same job for 64 or higher no of processes, it  comes
>>> without
>>> doing
>>> anything.
>>>
>>> This is my command line:
>>>
>>> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr
>>> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run
>>>
>>>
>>>
>>> Problem No 2:
>>>
>>> Application: NAMD 2.6
>>>
>>> Parallel Library: MVAPICH2-1.0.3
>>>
>>> Compilers: Intel C++ and Fortran 10
>>>
>>> I built successfully charm++ with mvapich2 and intel compilers, and then
>>> compiled NAMD2.
>>>
>>> The test examples given in the NAMD distribution works fine.
>>>
>>> With the following input file( This input file is the one which is used
>>> in
>>> the NAMD website, for benchmarking. It runs/scales upto 252 processes as
>>> mentioned in NAMD website). But in my case it runs only for 8 process, 16
>>> process, 32 process, 64 processes.
>>>
>>> But when a 128 core job submitted, it doesn't run at all. The following
>>> is
>>> the command and error.
>>>
>>> #mpirun -machinefile ./machfile -np 128
>>> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee
>>> namd_128cores
>>> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0
>>> rank 65 in job 4  master_host_name_50238   caused collective abort of all
>>> ranks
>>>  exit status of rank 65: killed by signal 9
>>>
>>>
>>> So, in further, I built charmc with network version of charm++ library
>>> without using mvapich2. Now it works for any number process job.
>>>
>>> So, for the above two problems, I guess there is some thing problem with
>>> mvapich2 itself.  Is there a solution for it?
>>>
>>>
>>> Regards,
>>> Sangamesh
>>>
>>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080819/7f117cc4/attachment-0001.html


More information about the mvapich-discuss mailing list