[mvapich-discuss] "Too many open files" error

Mike Heinz michael.heinz at qlogic.com
Mon Mar 9 13:37:27 EDT 2009


Hey, we're QA testing a release of OFED 1.4, including MVAPICH, and the testers just run into the following problem - they're running Pallas across 44 nodes when, part way through the run when machines start failing with a "too many open files" error (see below).

At first blush, this sounds like a ulimit problem, and I'm trying to get access to the failing machines to test that theory - but is there some known condition where mvapich will leak file handles?


[root at st28]# /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun -np 44 -machinefile

(prior test cases trimmed)

#----------------------------------------------------------------
# Benchmarking Bcast
# ( #processes = 8 )
# ( 36 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.05         0.07         0.05
            1         1000         8.70         8.71         8.71
            2         1000         8.16         8.18         8.17
            4         1000         8.17         8.19         8.18
            8         1000         7.83         7.84         7.83
           16         1000         8.08         8.10         8.09
           32         1000         8.36         8.38         8.37
           64         1000         8.28         8.30         8.29
          128         1000         9.02         9.03         9.03
          256         1000         9.33         9.35         9.34
          512         1000        10.13        10.14        10.13
         1024         1000        12.33        12.35        12.33
         2048         1000        14.86        14.89        14.87
         4096         1000        20.21        20.23        20.22
         8192         1000        33.47        33.51        33.49
        16384         1000       126.25       126.32       126.27
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st28-0-1.tmp>: 24
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st37-0-1.tmp>: 24
open: Too many open files
open: Too many open files
open: Too many open files
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st30-0-1.tmp>: 24
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st46-0-1.tmp>: 24
[0] shmem_coll_mmap:error in mmapping shared memory: 2
open: Too many open files
[5820] shmem_coll_init:error in opening shared memory file
</tmp/ib_shmem_bcast_coll-5820-st47-0-1.tmp>: 24


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090309/9a82772d/attachment-0001.html


More information about the mvapich-discuss mailing list