[mvapich-discuss] Segault on HDF5 1.10.5 "make check" with MVAPICH 2.3.1

Ryan Novosielski novosirj at rutgers.edu
Thu Apr 18 17:47:52 EDT 2019


UPDATE: 

I’m now seeing this same thing with HDF5 1.10.5 and MVAPICH 2.3.1 with the Intel compilers — both 18.0.5 and 19.0.3 — as well. I’ve not yet gone back to try any of the GCC versions.

I’m not even entirely sure that this is MVAPICH2’s fault, but it notably /has not/ occurred with any combination of compilers, OpenMPI, and HDF5 1.10.5 (1.10.4 apparently was not tested with OpenMPI — that’s new for 1.10.5).

If anyone has any idea of how I could produce some more information, I’d be happy to do it. Even if you know where this strange “Alarm clock” message comes from, that would be a help. It appears to always happen at the 20 minute mark, but is /not/ the scheduler — note an hour is specified for these jobs:

RUNPARALLEL="srun --mpi=pmi2 --mem=12G -p main -t 1:00:00 -n6 -N1”

And again the error always something like this:

[slepner012.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: slepner012: task 0: Segmentation fault
srun: error: slepner012: tasks 1-5: Alarm clock

Thanks for any help you can offer.

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'

> On Mar 22, 2019, at 2:16 PM, Ryan Novosielski <novosirj at rutgers.edu> wrote:
> 
> I don’t think I ultimately saw a response to this. Any feedback?
> 
> I suppose I shall try with MVAPICH2 2.3.1 in the meantime.
> 
>> On Feb 21, 2019, at 8:20 PM, Ryan Novosielski <novosirj at rutgers.edu> wrote:
>> 
>> Of course, sorry — I didn’t think to include that. 
>> 
>> I have no particular reason to believe that it is related to GPFS, I just mention it on the off chance that it matters. I can try on XFS as well, probably more easily than you can get GPFS to rule it out. 
>> 
>> For both, I used the method where I build outside the source directory (I forget the term for that). 
>> 
>> MVAPICH2:
>> 
>> #!/bin/sh
>> 
>> module purge
>> module load gcc/8.2
>> module list
>> 
>> ../mvapich2-2.3/configure --with-pmi=pmi2 --with-pm=slurm --prefix=/opt/sw/packages/gcc-8_2/mvapich2/2.3 && \
>>        make -j32
>> 
>> HDF5:
>> 
>> #!/bin/sh
>> 
>> module purge
>> module load gcc/8.2 mvapich2/2.3
>> module list
>> 
>> RUNPARALLEL="srun --mpi=pmi2 --mem=12G -p main -t 1:00:00 -n6 -N1" CC=mpicc F9X=mpifort CXX=mpicxx ../hdf5-1.10.4/configure --prefix=/opt/sw/packages/gcc-8_2/mvapich2-2_3/hdf5/1.10.4 --enable-fortran --enable-build-mode=production --enable-parallel \
>>        && make -j32 && make check
>> 
>> Thank you!
>> 
>> --
>> ____
>> || \\UTGERS,       |---------------------------*O*---------------------------
>> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
>>    `'
>> 
>> On Feb 21, 2019, at 19:55, Subramoni, Hari <subramoni.1 at osu.edu> wrote:
>> 
>>> Hi, Ryan.
>>> 
>>> Can you please let us know how you configured MVAPICH2 and how you built and ran HDF5 with the said version of MVAPICH2?
>>> 
>>> Unfortunately, we do not have GPFS locally. However, let me try to reproduce the problem locally.
>>> 
>>> Thx,
>>> Hari.
>>> 
>>> -----Original Message-----
>>> From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> On Behalf Of Ryan Novosielski
>>> Sent: Thursday, February 21, 2019 3:40 PM
>>> To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
>>> Subject: [mvapich-discuss] Segault on HDF5 1.10.4 "make check" with MVAPICH 2.3 compiled by GCC 8.2
>>> 
>>> Hi there,
>>> 
>>> I’m only seeing this particular failure with GCC 8.2, MVAPICH 2.3 (that’s the only version I’ve tried on though), and HDF5 1.10.4. GCC 4.8 and 7.4 both allow the make check on HDF5 to pass properly. All of this is on CentOS 7.5, compiling on GPFS 4.2 storage (I’ve seen some screwy FS-dependent things lately, so I mention it).
>>> 
>>> The below is what happens. Is there any more data I can gather to help with this? It appears as if it hangs for almost exactly 20 minutes each time and something whacks it. A successful run is only 2-3 seconds long. Note that a running a “sleep 1800” (30 minutes) does not do this. Either related or not, the combination of OpenMPI 3.1.3 and GCC 4.8 (but not 7.4 or 8.2) does a similar thing, but on the t_mpi test, not t_filters_parallel, and without mentioning the signal 11 (that might just be they way they present errors being different — don’t know):
>>> 
>>> I’m launching the tests via srun via make check with these options:
>>> 
>>> RUNPARALLEL = srun --mpi=pmi2 --mem=12G -p main -t 1:00:00 -n6 -N1
>>> 
>>> HDF5 make check when it gets to the sticking point:
>>> 
>>> Testing  t_filters_parallel
>>> ============================
>>> t_filters_parallel  Test Log
>>> ============================
>>> srun: job 84117363 queued and waiting for resources
>>> srun: job 84117363 has been allocated resources [slepner063.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
>>> srun: error: slepner063: task 0: Segmentation fault
>>> srun: error: slepner063: tasks 1-3: Alarm clock 0.01user 0.01system 20:01.44elapsed 0%CPU (0avgtext+0avgdata 5144maxresident)k
>>> 0inputs+0outputs (0major+1524minor)pagefaults 0swaps
>>> make[4]: *** [t_filters_parallel.chkexe_] Error 1
>>> make[4]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
>>> make[3]: *** [build-check-p] Error 1
>>> make[3]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
>>> make[2]: *** [test] Error 2
>>> make[2]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
>>> make[1]: *** [check-am] Error 2
>>> make[1]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
>>> make: *** [check-recursive] Error 1
>>> 
>>> 
>>> --
>>> ____
>>> || \\UTGERS,       |---------------------------*O*---------------------------
>>> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>>> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
>>>    `'
>>> 
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> --
> ____
> || \\UTGERS,  	 |---------------------------*O*---------------------------
> ||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
>     `'
> 




More information about the mvapich-discuss mailing list