[mvapich-discuss] crash of runs over InfiniBand

Wed Oct 21 02:32:49 EDT 2009

Hi again,

No, the socond step didn't help. I also tried inserting in limits.conf:
* soft memlock unlimited
* hard memlock unlimited
On the two nodes that I am using, but with the same result - a crash.

This is my findings with the test case that I am running:
Mpirun_rsh/mpispawn:
1 node, 4 processors: OK
2 nodes, 4 processors (2 on each): crash
1 node, 8 processors: crash
2 nodes, 8 processors (4 on each): crash
Launched on one of the computational nodes, example for the 2 nodes 8 processors, by:
nohup mpirun_rsh -np 8 -hostfile hosts ./testprogram inputfile &

Mpiexec/mpd:
1 node, 4 processors: OK
2 nodes. 4 processors (2 on each): OK
1 node, 8 processors: crash
2 nodes, 8 processors (4 on each): crash
Launched on one of the computational nodes by, example for the 2 nodes 8 processors:
mpdboot -n 2 -h hosts
nohup mpiexec -n 8 ./testprogram inputfile > /dev/null 2>/dev/null &

Both with mpirun_rsh and mpiexec, the crashes happens after a while (during the initialization of the program: the cores have been distributed and the crash happens during reading the mesh)

I should perhaps also mention, that I tried the same test-program with MPICH2 on 1 node only and 8 cores, with success. Perhaps this gives a clue?

The cluster is 12 compute nodes each with 2x quad core (Intel 5550 Nehalem 2.7 GHz) and 12GB RAM. When I use 8 cores I use all the cores of a node, and I assume that the Infiniband is being used in this case, even in the case where I run on 1 node only...

I really hope you have some ideas of what may be the problem, and please let me know if you need more information.

Best regards and thanks,
Iris

-----Original Message-----
From: Dhabaleswar Panda [mailto:panda at cse.ohio-state.edu] 
Sent: 19 October 2009 15:49
To: Iris Pernille Lohmann
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] crash of runs over InfiniBand

Iris - InfiniBand communication relies on pinning and registering
communication buffers (the associated memory) before communication can
take place. It appears that you are running out of memory that can be
pinned when running applications for a longer period of time. You can
carry out the second step and let us know whether the problem goes away or
not.

Thanks,

DK

On Mon, 19 Oct 2009, Iris Pernille Lohmann wrote:

> Dear list members,
>
> I am using MVAPICH 1.4 on a linux cluster. I have made some computations on 1 and 2 nodes using mpirun_rsh. When I run a relatively small computation, the run on 2 nodes works fine, whereas with a relatively large computation, the run on 2 nodes crashes (I get no error messages). Running on 1 node works fine.
>
> I am thinking that it may have something to do with memory, and in the User Guide section 9.3.4 there is a description on setting the soft memlock.
>
> In my limits.conf the soft memlock and hard memlock are already set to 6000000.
>
> Could the problem be that the second step mentioned in section 9.3.4, namely to add the following to /etc/init.d/sshd:
> ulimit -l <phys mem in KB>
>
> has not been done? What does it actually mean?
>
> Or can it be something completely different?
>
>
> Best regards,
>
> Iris Lohmann
>
>
>
>
>
> Iris Pernille Lohmann
>
> MSc, PhD
>
> Ports & Offshore Technology (POT)
>
>
>
> [cid:image001.gif at 01CA50A7.0EF6B450]
>
>
>
> DHI
>
> Agern Allé 5
>
> DK-2970 Hørsholm
>
> Denmark
>
>
>
> Tel:
>
>
>
> +45 4516 9200
>
> Direct:
>
>
>
> 45169427
>
>
>
> ipl at dhigroup.com
>
> www.dhigroup.com
>
>
>
> WATER  *  ENVIRONMENT  *  HEALTH
>
>
>
>