[mvapich-discuss] fortran system calls crash mpi

Pavel Shamis (Pasha) pasha at mellanox.co.il
Tue Oct 3 06:45:27 EDT 2006


Current openib-gen in OFED-1.1 don't have support "fork" commands.
You may see the problem with C/Fortran and any other code.
As I know the problem was solved in latest gen2 (NOT ofed).

Regards.

Sayantan Sur wrote:
> Hello Michael,
> 
> 
> Thanks for writing to the list. I have tried your program using MVAPICH
> on both OFED-1.0, and OFED-1.1-rc5. I removed the "USE IFPORT" and
> changed the "systemqq" to "system". According to your mail, you had
> tried these changes and they had failed for you. However, I'm able to
> execute these programs just fine, with the following output:
> 
> [surs at e0-oib:temp]
> /home/7/surs/projects/mvapich/release/trunk/bin/mpirun_rsh -np 2 e0 e1
> ./a.out 
>  after init
>  after init
>  after comm_rank 0
>  after barrier 0
>  after barrier2 0
>  after comm_rank 1
>  after barrier 1
> /home/7/surs/temp
>  after barrier2 1
> 
> These system calls depend on the call 'fork' to work properly with
> underlying Gen2 library. I see that the current version of Gen2 has
> 'fork' support using a call "ibv_fork_init(void)". This call is not yet
> available on OFED releases though. MVAPICH will use this call in future
> versions.
> 
> However, as I see, there is no problem executing this program on the
> existing OFED release.
> 
> Thanks,
> Sayantan.
> 
> * On Oct,1 Michael Harding<harding at uni-mainz.de> wrote :
>> hi,
>>
>> I try to use the installed mvapich on lonestar2
>> (http://www.tacc.utexas.edu/services/userguides/lonestar2/) together
>> with our quantum chemical program package aces2 ( http://www.aces2.de ).
>> On lonestar2 they run the generation 2 stack called OFED (OpenFabrics
>> Enterprise Edition) together with mvapich. After quite a long time I
>> found out that doing a system call from fortran ends in porblems with
>> the mpi:
>>
>> i tried the following small and stupid program: 
>>
>> program barrtest 
>> USE IFPORT 
>> include 'mpif.h' 
>>
>> integer*4 mpierr 
>> integer*4 mpirank 
>> integer*4 i 
>>
>> call MPI_INIT(mpierr) 
>> write(*,*)"after init" 
>> call mpi_comm_rank(MPI_COMM_WORLD, mpirank, mpierr) 
>> write(*,*)"after comm_rank",mpirank 
>> call MPI_BARRIER(MPI_COMM_WORLD, mpierr) 
>> write(*,*)"after barrier",mpirank 
>> if (mpirank.eq.1) then 
>> i=systemqq('pwd') 
>> endif 
>> call MPI_BARRIER(MPI_COMM_WORLD, mpierr) 
>> write(*,*)"after barrier2",mpirank 
>> call mpi_finalize(mpierr) 
>> end 
>>
>> This explained to me why we cannot run our program suite currently on
>> this computer. The same happens if I do not use IFORT and system instead
>> of systemqq. 
>>
>> So my questions are:
>>
>> Is this normal for mvapich or is this related to local (tacc , utexas)
>> modifications of the code ? 
>> I know that scali mpi (also an infiniband cluster) had no problems
>> running our code. If this is a general problem with mvapich, when one
>> can expect a fix for that ?
>>
>> Thanks for any reply on that ! I also appreciate any hint how i can
>> manage this problem without having system calls. ( I need especially a
>> replacement for copy.)
>>
>> I now tried to get the our code working there for three weeks ...
>> ( even on a completely unknown system i had taken me never more than
>> three days to get it work before )
>>
>>
>> michael
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 


-- 
Pavel Shamis (Pasha)
Software Engineer
Mellanox Technologies LTD.
pasha at mellanox.co.il


More information about the mvapich-discuss mailing list