[mvapich-discuss] mvapich2_munmap
burlen
burlen.loring at gmail.com
Thu Dec 3 23:40:50 EST 2009
Hi Krishna,
It's a perplexing bug, the only thing I can come up with is that its due
to the wrong mix of libraries, because no one else has complained about
this... some info below, if it's not detailed enough I can put you in
touch with a sys-admin.
Intel icpc/icc 10.1 20081024.
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2
Linux pfe3 2.6.16.60-0.42.5.03schamp-nasa #1 SMP Tue Nov 10 20:46:20 UTC
2009 x86_64 x86_64 x86_64 GNU/Linux
Pleiades System Facts
Manufacturer - SGI
System Architecture
* 110 Compute Cabinets (64 nodes each; 7,040 nodes total)
* 673.4 Tflop/s peak cluster
* 544.3 Tflop/s LINPACK rating
* Total cores: 56,320
* Total memory: 74.7TB
* Nodes
o 5,888 nodes
+ 2 quad-core processors per node
+ Xeon E5472 (Harpertown) processors
+ Processor speed - 3GHz
+ Cache - 6MB per pair of cores
+ Memory Type - DDR2 FB-DIMMs
+ 1GB per core, 8GB per node
o 1,152 nodes
+ 2 quad-core processors per node
+ Xeon X5570 (Nehalem) processors
+ Processor speed - 2.93GHz
+ Cache - 4MB per pair of cores
+ Memory Type - DDR3 FB-DIMMs
+ 3GB per core, 24GB per node
Subsystems
* 8 front-end nodes
* 1 PBS server
Interconnects
* Internode - InfiniBand, 7,040 compute nodes in an 11D hypercube
* Two independent InfiniBand fabrics
* 24 miles of DDR, QDR, and hybrid cabling
* Gigabit Ethernet management network
Storage
* Nexis 9000 home filesystem
* 4 DDN 9900 RAIDs - 2.8 PB total
* 6 Lustre cluster-wide filesystems, each containing:
o 8 Object Storage Servers (OSS)
o 1 Metadata server (MDS)
Operating Environment
* Operating system - SUSE Linux
* Job Scheduler - PBS
* Compilers - C,Intel Fortran, SGI MPI
Krishna Chaitanya Kandalla wrote:
> Burlen,
> Sorry to know that the problem persists even with
> mvapich2-1.4. Can you please re-configure and re-build the library
> with the config-time flag : --disable-registration-cache. This will
> turn this feature off completely and you will be using the default
> memory related functions.
> Its very surprising that your application is failing inside
> MPI_Init itself. We have tested the release version with Intel
> compilers, but we have not see such an issue before. Can you also give
> us some more information about the compiler version, operating system
> and anything related to your hardware?
> Thanks,
> Krishna
>
> burlen wrote:
>> Hi Krishna,
>>
>> I built mvapich2-1.4 today. bad news man, I got the same problem.
>>
>> With mvapich2-1.4 the program crashes right off with a segfault, and
>> a very similar stack as the mvapich2-1.2p1 build (see below). In both
>> the
>> builds an intel compiler has been used (just to be sure to mention). The
>> stack showed that a call to free() initiated the issue. Any ideas?
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 46912874878096 (LWP 28347)]
>> 0x00002aaaaaddffcf in find_and_free_dregs_inside ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> (gdb)
>> where
>> #0 0x00002aaaaaddffcf in find_and_free_dregs_inside
>> ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #1 0x00002aaaaadcd73b in mvapich2_mem_unhook
>> ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #2 0x00002aaaaadcd77a in mvapich2_munmap
>> ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #3 0x00002aaaaf3cc37c in munmap
>> ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>> #4 0x00002aaaaadcd78f in mvapich2_munmap ()
>>
>> ... repeated mvapich2_munmap , munmap sequence
>>
>> #16567 0x00002aaaaf3cc37c in munmap ()
>> from /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>> #16568 0x00002aaaaadcd78f in mvapich2_munmap ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #16569 0x00002aaaaf3cc37c in munmap ()
>> from /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>> #16570 0x00002aaaaadcd78f in mvapich2_munmap ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #16571 0x00002aaaaadc7ad5 in free ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #16572 0x00002aaaaadd686a in MPIDI_CH3I_SMP_init ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #16573 0x00002aaaaae49d24 in MPIDI_CH3_Init ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #16574 0x00002aaaaae0b3fd in MPID_Init ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #16575 0x00002aaaaae33d40 in MPIR_Init_thread ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>> #16576 0x00002aaaac6118ff in PMPI_Init ()
>> from
>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerManager.so
>> #16577 0x00002aaaab5230a8 in vtkPVMain::Initialize
>> (argc=0x7fffffffdb00,
>> argv=0x7fffffffdab0)
>> at
>> /u/burlen/ParaView/ParaView3-3.7/Servers/Filters/vtkPVMain.cxx:107
>> #16578 0x00000000004027bd in main (argc=3, argv=0x7fffffffdbf8)
>> at
>> /u/burlen/ParaView/ParaView3-3.7/Servers/Executables/pvserver.cxx:30
>>
>>
>>
>> Krishna Chaitanya Kandalla wrote:
>>> I am guessing that as long as you use the right InfiniBand related
>>> paths, everything should be fine. You can build mvapich2-1.4rc1
>>> locally instead and for that you wont be needing any sudo permissions.
>>>
>>> Krishna
>>>
>>> burlen wrote:
>>>> right I did say that, sorry for the confusion. When you said that I
>>>> wondered/hoped you might have seen something else that suggested
>>>> the wrong library was linked in. I am all for upgrading to the
>>>> latest, but I'm not a sys admin on this system and I don't know the
>>>> details of the hardware. So if I built the new release with the
>>>> same configure options that were used on the current build will the
>>>> infiniband stuff just work? or do I have to have access to drivers
>>>> etc.? I never built mvapich before :)
>>>>
>>>> Krishna Chaitanya Kandalla wrote:
>>>>> Burlen,
>>>>> In your first mail, you had mentioned :
>>>>> > I have this strange situation when running paraview on a
>>>>> particular build/install/revision of mvapich.
>>>>>
>>>>> So, I concluded that you were using mvapich and not mvapich2.
>>>>> But, its still not very clear as to why you are seeing a seg-fault
>>>>> inside the function find_and_free_dregs(), with this flag on. I
>>>>> can think of a few options to move ahead. You can try out the 1.4
>>>>> version of mvapich2 that we released a few weeks ago. 1.2p1 is
>>>>> quite old. If you get the same failure even with 1.4, would it be
>>>>> possible for you to point us to where this application can be
>>>>> found so that we can reproduce it on our cluster?
>>>>>
>>>>> Thanks,
>>>>> Krishna
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> burlen wrote:
>>>>>> I get the same problem (as initially reported) using
>>>>>> VIADEV_USE_DREG_CACHE, but for sure it's mvapich2.
>>>>>>
>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>> Burlen,
>>>>>>> I just noticed that you are using MVAPICH and not
>>>>>>> MVAPICH2. The equivalent flag on MVAPICH is
>>>>>>> VIADEV_USE_DREG_CACHE. So, please set this flag to 0 instead of
>>>>>>> the MV2_* flag. I am sorry for the confusion.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Krishna
>>>>>>>
>>>>>>> burlen wrote:
>>>>>>>> OK, I didn't use the mpirun_rsh before because it doesn't pass
>>>>>>>> some of the environment vars through. So with mpirun_rsh
>>>>>>>> method, without the MV2_USE_LAZY_MEM_UNREGISTER flag, I get the
>>>>>>>> same result as before, but with set to 0 I now have a segfault:
>>>>>>>>
>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>> [Switching to Thread 46912793699472 (LWP 24718)]
>>>>>>>> 0x00002aaaaadae366 in find_and_free_dregs_inside () from
>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>
>>>>>>>> (gdb) where
>>>>>>>> #0 0x00002aaaaadae366 in find_and_free_dregs_inside () from
>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>
>>>>>>>> Cannot access memory at address 0x7fffedb06ff0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>> Burlen,
>>>>>>>>> In MVAPICH2, we use the mpirun_rsh feature for
>>>>>>>>> job-launch.
>>>>>>>>> So, for the default configuration, you would be
>>>>>>>>> doing something like :
>>>>>>>>>
>>>>>>>>> mpirun_rsh -np 1 pvserver --server-port=50001
>>>>>>>>> --use-offscreen-rendering
>>>>>>>>>
>>>>>>>>> But, to turn off this memory optimization feature,
>>>>>>>>> you can do :
>>>>>>>>> mpirun_rsh -np 1 MV2_USE_LAZY_MEM_UNREGISTER=0 pvserver
>>>>>>>>> --server-port=50001 --use-offscreen-rendering
>>>>>>>>>
>>>>>>>>> Please let us know if either of there is any
>>>>>>>>> difference in the behavior across these two cases..
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Krishna
>>>>>>>>>
>>>>>>>>> burlen wrote:
>>>>>>>>>> Maybe it was a coincidence that it seemed to die faster...
>>>>>>>>>>
>>>>>>>>>> r50i1n14:~$export MV2_USE_LAZY_MEM_UNREGISTER=0
>>>>>>>>>> r50i1n14:~$mpiexec -np 1 pvserver --server-port=50001
>>>>>>>>>> --use-offscreen-rendering
>>>>>>>>>>
>>>>>>>>>> is that right?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>>>> Burlen,
>>>>>>>>>>> Thats very strange. With this flag set to 0, one of
>>>>>>>>>>> our memory optimizations is turned off and our memory
>>>>>>>>>>> foot-print should actually get better. Can you also let us
>>>>>>>>>>> know how you are running the job? This flag should appear
>>>>>>>>>>> before the name of the executable that you are trying to run.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Krishna
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> burlen wrote:
>>>>>>>>>>>> Hi Krishna, I tried it, but it didn't seem to help. Now the
>>>>>>>>>>>> available ram was exhausted very quickly. way faster than
>>>>>>>>>>>> before. The node quickly became unresponsive, gdb never
>>>>>>>>>>>> finished starting, and the job was killed.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Krishna Chaitanya Kandalla wrote:
>>>>>>>>>>>>> Burlen,
>>>>>>>>>>>>> Can you run your application with the run-time
>>>>>>>>>>>>> flag MV2_USE_LAZY_MEM_UNREGISTER=0. This might lead to
>>>>>>>>>>>>> slightly poorer performance, but can help us narrow down
>>>>>>>>>>>>> the problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Krishna
>>>>>>>>>>>>>
>>>>>>>>>>>>> burlen wrote:
>>>>>>>>>>>>>> I have this strange situation when running paraview on a
>>>>>>>>>>>>>> particular build/install/revision of mvapich. Shortly
>>>>>>>>>>>>>> after paraview starts up it hangs, and watching in top I
>>>>>>>>>>>>>> see memory grow before it's killed for using too much.
>>>>>>>>>>>>>> Attaching a debugger I see what looks like an infinite
>>>>>>>>>>>>>> recursion. It's only happened to me using this particular
>>>>>>>>>>>>>> build of mvapich which happens to be the only one on this
>>>>>>>>>>>>>> system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just curious if anyone has seen anything like this before?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #0 0x00002aaaaadbb25b in avlfindex () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #1 0x00002aaaaadae427 in find_and_free_dregs_inside
>>>>>>>>>>>>>> () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #2 0x00002aaaaad9d1f9 in mvapich2_mem_unhook () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #3 0x00002aaaaad9d244 in mvapich2_munmap () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #4 0x00002aaaadfa88c6 in munmap () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>>>>>>>>>>>>> #5 0x00002aaaaad9d259 in mvapich2_munmap () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #73059 0x00002aaaadfa88c6 in munmap () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libicet_mpi.so
>>>>>>>>>>>>>> #73060 0x00002aaaaad9d259 in mvapich2_munmap () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #73061 0x00002aaaaad979a1 in free () from
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/apps/PV3-3.7-D-IV/lib/paraview-3.7/libvtkPVServerCommon.so
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #73062 0x00002aaaae441e7e in icetResizeBuffer
>>>>>>>>>>>>>> (size=91607685) at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /u/burlen/ParaView/ParaView3-3.7/Utilities/IceT/src/ice-t/context.c:129
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mvapich info:
>>>>>>>>>>>>>> Version: 1.2p1.
>>>>>>>>>>>>>> Compiled with: Intel version 11.0.074
>>>>>>>>>>>>>> Configured with: --prefix=/nasa/mvapich2/1.2p1/intel
>>>>>>>>>>>>>> --enable-f77 --enable-f90
>>>>>>>>>>>>>> --enable-cxx --enable-mpe --enable-romio
>>>>>>>>>>>>>> --enable-threads=multiple
>>>>>>>>>>>>>> --with-rdma=gen2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CFLAGS = -fPIC
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
More information about the mvapich-discuss
mailing list