[mvapich-discuss] mvapich2_munmap

burlen burlen.loring at gmail.com
Thu Dec 3 23:40:50 EST 2009


Hi Krishna,

It's a perplexing bug, the only thing I can come up with is that its due 
to the wrong mix of libraries, because no one else has complained about 
this... some info below, if it's not detailed enough I can put you in 
touch with a sys-admin.

Intel icpc/icc 10.1 20081024.

SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2

Linux pfe3 2.6.16.60-0.42.5.03schamp-nasa #1 SMP Tue Nov 10 20:46:20 UTC 
2009 x86_64 x86_64 x86_64 GNU/Linux

Pleiades System Facts
Manufacturer - SGI
System Architecture
    * 110 Compute Cabinets (64 nodes each; 7,040 nodes total)
    * 673.4 Tflop/s peak cluster
    * 544.3 Tflop/s LINPACK rating
    * Total cores: 56,320
    * Total memory: 74.7TB
    * Nodes
          o 5,888 nodes
                + 2 quad-core processors per node
                + Xeon E5472 (Harpertown) processors
                + Processor speed - 3GHz
                + Cache - 6MB per pair of cores
                + Memory Type - DDR2 FB-DIMMs
                + 1GB per core, 8GB per node
          o 1,152 nodes
                + 2 quad-core processors per node
                + Xeon X5570 (Nehalem) processors
                + Processor speed - 2.93GHz
                + Cache - 4MB per pair of cores
                + Memory Type - DDR3 FB-DIMMs
                + 3GB per core, 24GB per node
Subsystems
    * 8 front-end nodes
    * 1 PBS server
Interconnects
    * Internode - InfiniBand, 7,040 compute nodes in an 11D hypercube
    * Two independent InfiniBand fabrics
    * 24 miles of DDR, QDR, and hybrid cabling
    * Gigabit Ethernet management network
Storage
    * Nexis 9000 home filesystem
    * 4 DDN 9900 RAIDs - 2.8 PB total
    * 6 Lustre cluster-wide filesystems, each containing:
          o 8 Object Storage Servers (OSS)
          o 1 Metadata server (MDS)
Operating Environment
    * Operating system - SUSE Linux
    * Job Scheduler - PBS
    * Compilers - C,Intel Fortran, SGI MPI








Krishna Chaitanya Kandalla wrote:
> Burlen,
>         Sorry to know that the problem persists even with 
> mvapich2-1.4. Can you please re-configure and re-build the library 
> with the config-time flag : --disable-registration-cache. This will 
> turn this feature off completely and you will be using the default 
> memory related functions.
>         Its very surprising that your application is failing inside 
> MPI_Init itself. We have tested the release version with Intel 
> compilers, but we have not see such an issue before. Can you also give 
> us some more information about the compiler version, operating system 
> and anything related to your hardware?
> Thanks,
> Krishna
>
> burlen wrote:
>> Hi Krishna,
>>
>> I built mvapich2-1.4 today. bad news man, I got the same problem.
>>
>> With mvapich2-1.4 the program crashes right off with a segfault, and 
>> a very similar stack as the mvapich2-1.2p1 build (see below). In both 
>> the
>> builds an intel compiler has been used (just to be sure to mention). The
>> stack showed that a call to free() initiated the issue. Any ideas?
>>
>>    Program received signal SIGSEGV, Segmentation fault.
>>    [Switching to Thread 46912874878096 (LWP 28347)]
>>    0x00002aaaaaddffcf in find_and_free_dregs_inside ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    (gdb)
>>    where
>>    #0  0x00002aaaaaddffcf in find_and_free_dregs_inside
>>    ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #1  0x00002aaaaadcd73b in mvapich2_mem_unhook
>>    ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #2  0x00002aaaaadcd77a in mvapich2_munmap
>>    ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #3  0x00002aaaaf3cc37c in munmap
>>    ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>    #4  0x00002aaaaadcd78f in mvapich2_munmap ()
>>
>>    ... repeated mvapich2_munmap , munmap  sequence
>>
>>    #16567 0x00002aaaaf3cc37c in munmap ()
>>       from /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>    #16568 0x00002aaaaadcd78f in mvapich2_munmap ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #16569 0x00002aaaaf3cc37c in munmap ()
>>       from /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>    #16570 0x00002aaaaadcd78f in mvapich2_munmap ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #16571 0x00002aaaaadc7ad5 in free ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #16572 0x00002aaaaadd686a in MPIDI_CH3I_SMP_init ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #16573 0x00002aaaaae49d24 in MPIDI_CH3_Init ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #16574 0x00002aaaaae0b3fd in MPID_Init ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #16575 0x00002aaaaae33d40 in MPIR_Init_thread ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>    #16576 0x00002aaaac6118ff in PMPI_Init ()
>>       from
>>    /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerManager.so
>>    #16577 0x00002aaaab5230a8 in vtkPVMain::Initialize 
>> (argc=0x7fffffffdb00,
>>        argv=0x7fffffffdab0)
>>        at
>>    /u/burlen/ParaView/ParaView3-3.7/Servers/Filters/vtkPVMain.cxx:107
>>    #16578 0x00000000004027bd in main (argc=3, argv=0x7fffffffdbf8)
>>        at
>>    /u/burlen/ParaView/ParaView3-3.7/Servers/Executables/pvserver.cxx:30
>>
>>
>>
>> Krishna Chaitanya Kandalla wrote:
>>> I am guessing that as long as you use the right InfiniBand related 
>>> paths, everything should be fine. You can build mvapich2-1.4rc1 
>>> locally instead and for that you wont be needing any sudo permissions.
>>>
>>> Krishna
>>>
>>> burlen wrote:
>>>> right I did say that, sorry for the confusion. When you said that I 
>>>> wondered/hoped you might have seen something else that suggested 
>>>> the wrong library was linked in. I am all for upgrading to the 
>>>> latest, but I'm not a sys admin on this system and I don't know the 
>>>> details of the hardware. So if I built the new release with the 
>>>> same configure options that were used on the current build will the 
>>>> infiniband stuff just work? or do I have to have access to drivers 
>>>> etc.? I never built mvapich before :)
>>>>
>>>> Krishna Chaitanya Kandalla wrote:
>>>>> Burlen,
>>>>> In your first mail, you had mentioned :
>>>>> > I have this strange situation when running paraview on a 
>>>>> particular build/install/revision of mvapich.
>>>>>
>>>>> So, I concluded that you were using mvapich and not mvapich2.  
>>>>> But, its still not very clear as to why you are seeing a seg-fault 
>>>>> inside the function find_and_free_dregs(), with this flag on. I 
>>>>> can think of a few options to move ahead. You can try out the 1.4 
>>>>> version of mvapich2 that we released a few weeks ago. 1.2p1 is 
>>>>> quite old. If you get the same failure even with 1.4, would it be 
>>>>> possible for you to point us to where this application can be 
>>>>> found so that we can reproduce it on our cluster?
>>>>>
>>>>> Thanks,
>>>>> Krishna
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> burlen wrote:
>>>>>> I get the same problem (as initially reported) using 
>>>>>> VIADEV_USE_DREG_CACHE, but for sure it's mvapich2.
>>>>>>
>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>> Burlen,
>>>>>>>          I just noticed that you are using MVAPICH and not 
>>>>>>> MVAPICH2. The equivalent flag on MVAPICH is 
>>>>>>> VIADEV_USE_DREG_CACHE. So, please set this flag to 0 instead of 
>>>>>>> the MV2_* flag. I am sorry for the confusion.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Krishna
>>>>>>>
>>>>>>> burlen wrote:
>>>>>>>> OK, I didn't use the mpirun_rsh before because it doesn't pass 
>>>>>>>> some of the environment vars through. So with mpirun_rsh 
>>>>>>>> method, without the MV2_USE_LAZY_MEM_UNREGISTER flag, I get the 
>>>>>>>> same result as before, but with set to 0 I now have a segfault:
>>>>>>>>
>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>> [Switching to Thread 46912793699472 (LWP 24718)]
>>>>>>>> 0x00002aaaaadae366 in find_and_free_dregs_inside () from 
>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>
>>>>>>>> (gdb) where
>>>>>>>> #0  0x00002aaaaadae366 in find_and_free_dregs_inside () from 
>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>
>>>>>>>> Cannot access memory at address 0x7fffedb06ff0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>> Burlen,
>>>>>>>>>           In MVAPICH2, we use the mpirun_rsh feature for 
>>>>>>>>> job-launch.
>>>>>>>>>           So, for the default configuration, you would be 
>>>>>>>>> doing something like :
>>>>>>>>>
>>>>>>>>> mpirun_rsh -np 1 pvserver --server-port=50001 
>>>>>>>>> --use-offscreen-rendering
>>>>>>>>>
>>>>>>>>>           But, to turn off this memory optimization feature, 
>>>>>>>>> you can do :
>>>>>>>>> mpirun_rsh -np 1 MV2_USE_LAZY_MEM_UNREGISTER=0 pvserver 
>>>>>>>>> --server-port=50001 --use-offscreen-rendering
>>>>>>>>>
>>>>>>>>>           Please let us know if either of there is any 
>>>>>>>>> difference in the behavior across these two cases..
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Krishna
>>>>>>>>>
>>>>>>>>> burlen wrote:
>>>>>>>>>> Maybe it was a coincidence that it seemed to die faster...
>>>>>>>>>>
>>>>>>>>>> r50i1n14:~$export MV2_USE_LAZY_MEM_UNREGISTER=0
>>>>>>>>>> r50i1n14:~$mpiexec -np 1 pvserver --server-port=50001 
>>>>>>>>>> --use-offscreen-rendering
>>>>>>>>>>
>>>>>>>>>> is that right?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>>>> Burlen,
>>>>>>>>>>>          Thats very strange. With this flag set to 0, one of 
>>>>>>>>>>> our memory optimizations is turned off and our memory 
>>>>>>>>>>> foot-print should actually get better. Can you also let us 
>>>>>>>>>>> know how you are running the job? This flag should appear 
>>>>>>>>>>> before the name of the executable that you are trying to run.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Krishna
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> burlen wrote:
>>>>>>>>>>>> Hi Krishna, I tried it, but it didn't seem to help. Now the 
>>>>>>>>>>>> available ram was exhausted very quickly. way faster than 
>>>>>>>>>>>> before. The node quickly became unresponsive, gdb never 
>>>>>>>>>>>> finished starting, and the job was killed.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>>>>>> Burlen,
>>>>>>>>>>>>>          Can you run your application with the run-time 
>>>>>>>>>>>>> flag MV2_USE_LAZY_MEM_UNREGISTER=0.  This might lead to 
>>>>>>>>>>>>> slightly poorer performance, but can help us narrow down 
>>>>>>>>>>>>> the problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Krishna
>>>>>>>>>>>>>
>>>>>>>>>>>>> burlen wrote:
>>>>>>>>>>>>>> I have this strange situation when running paraview on a 
>>>>>>>>>>>>>> particular build/install/revision of mvapich. Shortly 
>>>>>>>>>>>>>> after paraview starts up it hangs, and watching in top I 
>>>>>>>>>>>>>> see memory grow before it's killed for using too much. 
>>>>>>>>>>>>>> Attaching a debugger I see what looks like an infinite 
>>>>>>>>>>>>>> recursion. It's only happened to me using this particular 
>>>>>>>>>>>>>> build of mvapich which happens to be the only one on this 
>>>>>>>>>>>>>> system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just curious if anyone has seen anything like this before?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    (gdb)
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> where                                                                                                
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #0  0x00002aaaaadbb25b in avlfindex () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #1  0x00002aaaaadae427 in find_and_free_dregs_inside 
>>>>>>>>>>>>>> () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #2  0x00002aaaaad9d1f9 in mvapich2_mem_unhook () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #3  0x00002aaaaad9d244 in mvapich2_munmap () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #4  0x00002aaaadfa88c6 in munmap () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>>>>>>>>>>>>>    #5  0x00002aaaaad9d259 in mvapich2_munmap () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #73059 0x00002aaaadfa88c6 in munmap () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>>>>>>>>>>>>>    #73060 0x00002aaaaad9d259 in mvapich2_munmap () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #73061 0x00002aaaaad979a1 in free () from
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    #73062 0x00002aaaae441e7e in icetResizeBuffer 
>>>>>>>>>>>>>> (size=91607685) at
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> /u/burlen/ParaView/ParaView3-3.7/Utilities/IceT/src/ice-t/context.c:129 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mvapich info:
>>>>>>>>>>>>>> Version:          1.2p1.
>>>>>>>>>>>>>> Compiled with:    Intel version 11.0.074
>>>>>>>>>>>>>> Configured with:  --prefix=/nasa/mvapich2/1.2p1/intel 
>>>>>>>>>>>>>> --enable-f77 --enable-f90
>>>>>>>>>>>>>>                  --enable-cxx --enable-mpe --enable-romio 
>>>>>>>>>>>>>> --enable-threads=multiple
>>>>>>>>>>>>>>                  --with-rdma=gen2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                  CFLAGS = -fPIC
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>



More information about the mvapich-discuss mailing list