[Mvapich-discuss] Error parsing CPU mapping string/Invalid error code (-1) (error ring index 127 invalid)
Korzennik, Sylvain
skorzennik at cfa.harvard.edu
Mon Sep 30 15:12:08 EDT 2024
Hi,
sorry for the late reply, I was busy w/ end of FY purchases.
Attached are 2 files:
cpu_arch.list - the CPU info for each compute node.
err.log - sorted result of a egrep 'setaff|compute' on the job log files,
each log file lists which nodes are in the MPI machine file. Failed cases
are below successful ones.
Our compute nodes naming convention is compute-NN-MM where NN refer to
the Dell model number, hence all nodes w/ same NN have the same CPU.
The number of processors requested is on purpose more that the number
available on a single node to make sure it runs on more than one.
I do not see any pattern, maybe you do. Let me know if you need
more/different info.
Cheers,
Sylvain
--
On Tue, Sep 17, 2024 at 10:05 AM Shineman, Nat <shineman.5 at osu.edu> wrote:
> Hi Sylvain,
>
> Typically, this is caused by a non-standard CPU situation on your node.
> Are all tests being run on the same node or is there a pattern on the nodes
> that see failure? Can you send us the info from lscpu on the failing run?
>
> Thanks,
> Nat
> ------------------------------
> *From:* Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf
> of Korzennik, Sylvain via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
> *Sent:* Sunday, September 8, 2024 13:17
> *To:* Panda, Dhabaleswar <panda at cse.ohio-state.edu>
> *Cc:* Announcement about MVAPICH2 (MPI over InfiniBand, RoCE, Omni-Path,
> iWARP and EFA) Libraries developed at NBCL/OSU <
> mvapich-discuss at lists.osu.edu>
> *Subject:* [Mvapich-discuss] Error parsing CPU mapping string/Invalid
> error code (-1) (error ring index 127 invalid)
>
> While testing mvapich-3. 0 built with newest compilers (gcc 14. 2. 0,
> intel 2024. [12] and nvidia 24. [57]) I'm encountering the following error,
> when running a trivial set of tests (a hello world or a ring passing, in C
> or F90): Error parsing
> While testing mvapich-3.0 built with newest compilers (gcc 14.2.0, intel
> 2024.[12] and nvidia 24.[57]) I'm encountering the following error, when
> running a trivial set of tests (a hello world or a ring passing, in C or
> F90):
>
> Error parsing CPU mapping string
> Invalid error code (-1) (error ring index 127 invalid)
> INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in
> smpi_setaffinity:2791
> Abort(2141583) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(175)...........:
> MPID_Init(597)..................:
> MPIDI_MVP_mpi_init_hook(268)....:
> MPIDI_MVP_CH4_set_affinity(3746):
> smpi_setaffinity(2791)..........: Error parsing CPU mapping string
>
> This error creeps up somewhat randomly, the same job+compiler combo will
> work most of the time, but not all the time.
> Any suggestions on how to track this down?
>
> Thx, cheers,
> Sylvain
> --
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240930/14c78152/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu_arch.list
Type: application/octet-stream
Size: 5996 bytes
Desc: not available
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240930/14c78152/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: err.log
Type: text/x-log
Size: 6327 bytes
Desc: not available
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240930/14c78152/attachment-0002.bin>
More information about the Mvapich-discuss
mailing list