[mvapich-discuss] deadlock with g95/gfortran

Shaun Rowland rowland at cse.ohio-state.edu
Fri Mar 2 16:09:30 EST 2007


Aliva Pattnaik wrote:
> Hi,
> 
> I am trying to run the fortran example problem(fpi.f) that comes with mvapich2-
> 0.9.8. I am using g95 to compile it. But while running it with mpiexec its
> getting deadlock, though in the "Top" output I can see the processes taking 99%
> of CPU time. The same situation is arising while using gfortran. But I am able
> to run c example problems compiled with gcc, successfully.
> 
> The cluster that I am using is 64 bit AMD opteron with infiniband.
> 
> I will really appreciate if someone can help me in fixing this problem.
> 
> Thank you very much for your help,
> Aliva

Hello Aliva. I've been looking into this problem with a variety of
compilers:

Intel
PGI
Pathscale
GCC (gfortran - also latest from gcc SVN trunk)
GCC (g77)

I tested with the fpi.f and pi3f90.f90 examples. All cases work as
expected except those using gfortran. I only tested fpi.f with g77. When
running with two processes, it appeared as if one of the processes was
stuck in libgfortran:

(gdb) bt
#0  0x000000342e00b0df in __read_nocancel () from /lib64/tls/libpthread.so.0
#1  0x0000002a9580b79a in find_or_create_unit ()
    from /usr/lib64/libgfortran.so.1
#2  0x0000002a95807b2e in _gfortran_transfer_real ()
    from /usr/lib64/libgfortran.so.1
#3  0x0000002a9580800d in _gfortran_transfer_real ()
    from /usr/lib64/libgfortran.so.1
#4  0x0000002a95806288 in _gfortran_st_open () from 
/usr/lib64/libgfortran.so.1
#5  0x0000002a9580956b in _gfortran_st_read_done ()
    from /usr/lib64/libgfortran.so.1
#6  0x0000000000403b92 in MAIN__ () at fpi.f:46
#7  0x000000000047342e in main ()

while the other was waiting for it after getting to the MPI_BCAST call:

(gdb) bt
#0  0x0000002a95b3adeb in mthca_poll_cq (ibcq=0x5eb060, ne=1, 
wc=0x7fbffff080)
     at src/cq.c:482
#1  0x000000000042ca3b in ibv_poll_cq (cq=0x5eb060, num_entries=1,
     wc=0x7fbffff080) at /usr/local/ofed/include/infiniband/verbs.h:815
#2  0x000000000042bbf6 in MPIDI_CH3I_MRAILI_Cq_poll 
(vbuf_handle=0x7fbffff140,
     vc_req=0x0, receiving=0) at ibv_channel_manager.c:456
#3  0x0000000000414eef in MPIDI_CH3I_read_progress (vc_pptr=0x7fbffff158,
     v_ptr=0x7fbffff140) at ch3_read_progress.c:110
#4  0x0000000000413b56 in MPIDI_CH3I_Progress (is_blocking=1,
     state=0x7fbffff1a0) at ch3_progress.c:158
#5  0x000000000040a7dd in MPIC_Wait (request_ptr=0x5bde60) at 
helper_fns.c:316
#6  0x0000000000409db3 in MPIC_Recv (buf=0x7fbffff664, count=1,
     datatype=1275069467, source=0, tag=2, comm=1140850688, status=0x1)
     at helper_fns.c:86
#7  0x000000000040426f in MPIR_Bcast (buffer=0x7fbffff664, count=1,
     datatype=1275069467, root=0, comm_ptr=0x5a4940) at bcast.c:208
#8  0x000000000040594b in PMPI_Bcast (buffer=0x7fbffff664, count=1,
     datatype=1275069467, root=0, comm=1140850688) at bcast.c:785
#9  0x0000000000403bfd in pmpi_bcast_ (v1=0x7fbffff664, v2=0x4735ac,
     v3=0x4735a8, v4=0x4735a4, v5=0x473550, ierr=0x7fbffff66c) at 
bcastf.c:119
#10 0x0000000000403991 in MAIN__ () at fpi.f:50
#11 0x000000000047342e in main ()

The process stuck in libgfortran is the one with myid of 0, and should
be prompting for the number of intervals. I believe this is where it is
stuck. However, I can make it go if I do something like this:

[rowland at ro0-oib examples]$ ../bin/mpiexec -n 2 ./fpi
10
100
1000
10000
10
0
  Process            1  of            2  is alive
  Process            0  of            2  is alive
Enter the number of intervals: (0 quits)
   pi is approximately: 3.1424259850010983  Error is: 0.0008333314113051
Enter the number of intervals: (0 quits)
   pi is approximately: 3.1416009869231241  Error is: 0.0000083333333309
Enter the number of intervals: (0 quits)
   pi is approximately: 3.1415927369231254  Error is: 0.0000000833333322
Enter the number of intervals: (0 quits)
   pi is approximately: 3.1415926544231318  Error is: 0.0000000008333387
Enter the number of intervals: (0 quits)
   pi is approximately: 3.1424259850010983  Error is: 0.0008333314113051
Enter the number of intervals: (0 quits)

This seems only necessary with gfortran. It looks like there is some
input/output buffering issue or something. I see the same behavior with
the pi3f90 example. Other than this issue, it seems gfortran is actually
working correctly.

Can you let us know if you can duplicate these same results?
-- 
Shaun Rowland	rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


More information about the mvapich-discuss mailing list