[mvapich-discuss] mvapich2_munmap

burlen burlen.loring at gmail.com
Thu Dec 3 21:33:55 EST 2009


Hi Krishna,

I built mvapich2-1.4 today. bad news man, I got the same problem.

With mvapich2-1.4 the program crashes right off with a segfault, and a 
very similar stack as the mvapich2-1.2p1 build (see below). In both the
builds an intel compiler has been used (just to be sure to mention). The
stack showed that a call to free() initiated the issue. Any ideas?

    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 46912874878096 (LWP 28347)]
    0x00002aaaaaddffcf in find_and_free_dregs_inside ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    (gdb)
    where
    #0  0x00002aaaaaddffcf in find_and_free_dregs_inside
    ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #1  0x00002aaaaadcd73b in mvapich2_mem_unhook
    ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #2  0x00002aaaaadcd77a in mvapich2_munmap
    ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #3  0x00002aaaaf3cc37c in munmap
    ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
    #4  0x00002aaaaadcd78f in mvapich2_munmap ()

    ... repeated mvapich2_munmap , munmap  sequence

    #16567 0x00002aaaaf3cc37c in munmap ()
       from /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
    #16568 0x00002aaaaadcd78f in mvapich2_munmap ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #16569 0x00002aaaaf3cc37c in munmap ()
       from /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
    #16570 0x00002aaaaadcd78f in mvapich2_munmap ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #16571 0x00002aaaaadc7ad5 in free ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #16572 0x00002aaaaadd686a in MPIDI_CH3I_SMP_init ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #16573 0x00002aaaaae49d24 in MPIDI_CH3_Init ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #16574 0x00002aaaaae0b3fd in MPID_Init ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #16575 0x00002aaaaae33d40 in MPIR_Init_thread ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
    #16576 0x00002aaaac6118ff in PMPI_Init ()
       from
    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerManager.so
    #16577 0x00002aaaab5230a8 in vtkPVMain::Initialize (argc=0x7fffffffdb00,
        argv=0x7fffffffdab0)
        at
    /u/burlen/ParaView/ParaView3-3.7/Servers/Filters/vtkPVMain.cxx:107
    #16578 0x00000000004027bd in main (argc=3, argv=0x7fffffffdbf8)
        at
    /u/burlen/ParaView/ParaView3-3.7/Servers/Executables/pvserver.cxx:30



Krishna Chaitanya Kandalla wrote:
> I am guessing that as long as you use the right InfiniBand related 
> paths, everything should be fine. You can build mvapich2-1.4rc1 
> locally instead and for that you wont be needing any sudo permissions.
>
> Krishna
>
> burlen wrote:
>> right I did say that, sorry for the confusion. When you said that I 
>> wondered/hoped you might have seen something else that suggested the 
>> wrong library was linked in. I am all for upgrading to the latest, 
>> but I'm not a sys admin on this system and I don't know the details 
>> of the hardware. So if I built the new release with the same 
>> configure options that were used on the current build will the 
>> infiniband stuff just work? or do I have to have access to drivers 
>> etc.? I never built mvapich before :)
>>
>> Krishna Chaitanya Kandalla wrote:
>>> Burlen,
>>> In your first mail, you had mentioned :
>>> > I have this strange situation when running paraview on a 
>>> particular build/install/revision of mvapich.
>>>
>>> So, I concluded that you were using mvapich and not mvapich2.  But, 
>>> its still not very clear as to why you are seeing a seg-fault inside 
>>> the function find_and_free_dregs(), with this flag on. I can think 
>>> of a few options to move ahead. You can try out the 1.4 version of 
>>> mvapich2 that we released a few weeks ago. 1.2p1 is quite old. If 
>>> you get the same failure even with 1.4, would it be possible for you 
>>> to point us to where this application can be found so that we can 
>>> reproduce it on our cluster?
>>>
>>> Thanks,
>>> Krishna
>>>
>>>
>>>
>>>
>>> burlen wrote:
>>>> I get the same problem (as initially reported) using 
>>>> VIADEV_USE_DREG_CACHE, but for sure it's mvapich2.
>>>>
>>>> Krishna Chaitanya Kandalla wrote:
>>>>> Burlen,
>>>>>          I just noticed that you are using MVAPICH and not 
>>>>> MVAPICH2. The equivalent flag on MVAPICH is VIADEV_USE_DREG_CACHE. 
>>>>> So, please set this flag to 0 instead of the MV2_* flag. I am 
>>>>> sorry for the confusion.
>>>>>
>>>>> Thanks,
>>>>> Krishna
>>>>>
>>>>> burlen wrote:
>>>>>> OK, I didn't use the mpirun_rsh before because it doesn't pass 
>>>>>> some of the environment vars through. So with mpirun_rsh method, 
>>>>>> without the MV2_USE_LAZY_MEM_UNREGISTER flag, I get the same 
>>>>>> result as before, but with set to 0 I now have a segfault:
>>>>>>
>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>> [Switching to Thread 46912793699472 (LWP 24718)]
>>>>>> 0x00002aaaaadae366 in find_and_free_dregs_inside () from 
>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>> (gdb) where
>>>>>> #0  0x00002aaaaadae366 in find_and_free_dregs_inside () from 
>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>> Cannot access memory at address 0x7fffedb06ff0
>>>>>>
>>>>>>
>>>>>>
>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>> Burlen,
>>>>>>>           In MVAPICH2, we use the mpirun_rsh feature for 
>>>>>>> job-launch.
>>>>>>>           So, for the default configuration, you would be doing 
>>>>>>> something like :
>>>>>>>
>>>>>>> mpirun_rsh -np 1 pvserver --server-port=50001 
>>>>>>> --use-offscreen-rendering
>>>>>>>
>>>>>>>           But, to turn off this memory optimization feature, you 
>>>>>>> can do :
>>>>>>> mpirun_rsh -np 1 MV2_USE_LAZY_MEM_UNREGISTER=0 pvserver 
>>>>>>> --server-port=50001 --use-offscreen-rendering
>>>>>>>
>>>>>>>           Please let us know if either of there is any 
>>>>>>> difference in the behavior across these two cases..
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Krishna
>>>>>>>
>>>>>>> burlen wrote:
>>>>>>>> Maybe it was a coincidence that it seemed to die faster...
>>>>>>>>
>>>>>>>> r50i1n14:~$export MV2_USE_LAZY_MEM_UNREGISTER=0
>>>>>>>> r50i1n14:~$mpiexec -np 1 pvserver --server-port=50001 
>>>>>>>> --use-offscreen-rendering
>>>>>>>>
>>>>>>>> is that right?
>>>>>>>>
>>>>>>>>
>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>> Burlen,
>>>>>>>>>          Thats very strange. With this flag set to 0, one of 
>>>>>>>>> our memory optimizations is turned off and our memory 
>>>>>>>>> foot-print should actually get better. Can you also let us 
>>>>>>>>> know how you are running the job? This flag should appear 
>>>>>>>>> before the name of the executable that you are trying to run.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Krishna
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> burlen wrote:
>>>>>>>>>> Hi Krishna, I tried it, but it didn't seem to help. Now the 
>>>>>>>>>> available ram was exhausted very quickly. way faster than 
>>>>>>>>>> before. The node quickly became unresponsive, gdb never 
>>>>>>>>>> finished starting, and the job was killed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>>>> Burlen,
>>>>>>>>>>>          Can you run your application with the run-time flag 
>>>>>>>>>>> MV2_USE_LAZY_MEM_UNREGISTER=0.  This might lead to slightly 
>>>>>>>>>>> poorer performance, but can help us narrow down the problem.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Krishna
>>>>>>>>>>>
>>>>>>>>>>> burlen wrote:
>>>>>>>>>>>> I have this strange situation when running paraview on a 
>>>>>>>>>>>> particular build/install/revision of mvapich. Shortly after 
>>>>>>>>>>>> paraview starts up it hangs, and watching in top I see 
>>>>>>>>>>>> memory grow before it's killed for using too much. 
>>>>>>>>>>>> Attaching a debugger I see what looks like an infinite 
>>>>>>>>>>>> recursion. It's only happened to me using this particular 
>>>>>>>>>>>> build of mvapich which happens to be the only one on this 
>>>>>>>>>>>> system.
>>>>>>>>>>>>
>>>>>>>>>>>> Just curious if anyone has seen anything like this before?
>>>>>>>>>>>>
>>>>>>>>>>>>    (gdb)
>>>>>>>>>>>>    
>>>>>>>>>>>> where                                                                                                
>>>>>>>>>>>>
>>>>>>>>>>>>    #0  0x00002aaaaadbb25b in avlfindex () from
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>
>>>>>>>>>>>>    #1  0x00002aaaaadae427 in find_and_free_dregs_inside () 
>>>>>>>>>>>> from
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>
>>>>>>>>>>>>    #2  0x00002aaaaad9d1f9 in mvapich2_mem_unhook () from
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>
>>>>>>>>>>>>    #3  0x00002aaaaad9d244 in mvapich2_munmap () from
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>
>>>>>>>>>>>>    #4  0x00002aaaadfa88c6 in munmap () from
>>>>>>>>>>>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>>>>>>>>>>>    #5  0x00002aaaaad9d259 in mvapich2_munmap () from
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    ...
>>>>>>>>>>>>
>>>>>>>>>>>>    #73059 0x00002aaaadfa88c6 in munmap () from
>>>>>>>>>>>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>>>>>>>>>>>    #73060 0x00002aaaaad9d259 in mvapich2_munmap () from
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>
>>>>>>>>>>>>    #73061 0x00002aaaaad979a1 in free () from
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>
>>>>>>>>>>>>    #73062 0x00002aaaae441e7e in icetResizeBuffer 
>>>>>>>>>>>> (size=91607685) at
>>>>>>>>>>>>    
>>>>>>>>>>>> /u/burlen/ParaView/ParaView3-3.7/Utilities/IceT/src/ice-t/context.c:129 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> mvapich info:
>>>>>>>>>>>> Version:          1.2p1.
>>>>>>>>>>>> Compiled with:    Intel version 11.0.074
>>>>>>>>>>>> Configured with:  --prefix=/nasa/mvapich2/1.2p1/intel 
>>>>>>>>>>>> --enable-f77 --enable-f90
>>>>>>>>>>>>                  --enable-cxx --enable-mpe --enable-romio 
>>>>>>>>>>>> --enable-threads=multiple
>>>>>>>>>>>>                  --with-rdma=gen2
>>>>>>>>>>>>
>>>>>>>>>>>>                  CFLAGS = -fPIC
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>



More information about the mvapich-discuss mailing list