[mvapich-discuss] "Too many open files" error
Mike Heinz
michael.heinz at qlogic.com
Mon Mar 9 13:37:27 EDT 2009
Hey, we're QA testing a release of OFED 1.4, including MVAPICH, and the testers just run into the following problem - they're running Pallas across 44 nodes when, part way through the run when machines start failing with a "too many open files" error (see below).
At first blush, this sounds like a ulimit problem, and I'm trying to get access to the failing machines to test that theory - but is there some known condition where mvapich will leak file handles?
[root at st28]# /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun -np 44 -machinefile
(prior test cases trimmed)
#----------------------------------------------------------------
# Benchmarking Bcast
# ( #processes = 8 )
# ( 36 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.07 0.05
1 1000 8.70 8.71 8.71
2 1000 8.16 8.18 8.17
4 1000 8.17 8.19 8.18
8 1000 7.83 7.84 7.83
16 1000 8.08 8.10 8.09
32 1000 8.36 8.38 8.37
64 1000 8.28 8.30 8.29
128 1000 9.02 9.03 9.03
256 1000 9.33 9.35 9.34
512 1000 10.13 10.14 10.13
1024 1000 12.33 12.35 12.33
2048 1000 14.86 14.89 14.87
4096 1000 20.21 20.23 20.22
8192 1000 33.47 33.51 33.49
16384 1000 126.25 126.32 126.27
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st28-0-1.tmp>: 24
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st37-0-1.tmp>: 24
open: Too many open files
open: Too many open files
open: Too many open files
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st30-0-1.tmp>: 24
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st46-0-1.tmp>: 24
[0] shmem_coll_mmap:error in mmapping shared memory: 2
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st47-0-1.tmp>: 24
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090309/9a82772d/attachment-0001.html
More information about the mvapich-discuss
mailing list