[mvapich-discuss] Failures inside Torque batch jobs due to affinity, and cgroups
Doug Johnson
djohnson at osc.edu
Sat Nov 12 11:04:01 EST 2016
Hi,
We have encountered problems with processor affinity with MPI jobs executed
through Torque batch jobs that only request 1 core per node with MVAPICH2-2.2.
There is apparently bad interaction between Torque cgroups, and MVAPICH2
processor affinity code. The error is encountered when affinity is enabled.
We set MV2_HCA_AWARE_PROCESS_MAPPING=0 due to the fact these are single HCA
systems. The error below is from a job that requests 'nodes=2:ppn=1'.
While this is a corner case, it has caused us to disable processor affinity
for all jobs.
mpiexec ./simple
Warning! : Core id 11025 does not exist on this architecture!
CPU Affinity is undefined
Error parsing CPU mapping string
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIDI_CH3I_set_affinity:2391
[cli_0]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(514):
MPID_Init(370).......:
Warning! : Core id 11171 does not exist on this architecture!
CPU Affinity is undefined
Error parsing CPU mapping string
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIDI_CH3I_set_affinity:2391
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(514):
MPID_Init(370).......:
The system is RHEL 7.2, and the output from 'hwloc-ls' from inside the
'nodes=2:ppn=1' batch job is included below. Let me know if there are other
details needed.
Doug
hwloc-ls
Machine (64GB)
NUMANode L#0 (P#0 64GB) + Socket L#0 + L3 L#0 (35MB) + L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
NUMANode L#1 (P#1)
HostBridge L#0
PCIBridge
PCI 1000:0072
Block L#0 "sda"
PCIBridge
PCI 15b3:1013
Net L#1 "ib0"
OpenFabrics L#2 "mlx5_0"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
GPU L#3 "card0"
GPU L#4 "controlD64"
PCI 8086:8d02
HostBridge L#7
PCIBridge
PCI 8086:10fb
Net L#5 "eth0"
PCI 8086:10fb
Net L#6 "eth1"
More information about the mvapich-discuss
mailing list