One time crash (so far) with TCO639-jane crash (pmixp_coll_ring.c:742: collective timeout)

Added by Jan Streffing almost 3 years ago

On Friday I created the required initial and remapping files for the highest resolution FESOM2 ocean mesh (jane, 33million surface nodes) coupling to a medium-high resolution OpenIFS (TCO639L137). During the MPI init I got:

4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_reset_if_to: l40547 [37]: pmixp_coll_ring.c:742: 0x15553c0078a0: collective timeout seq=0
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_log: l40547 [37]: pmixp_coll.c:281: Dumping collective state
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:760: 0x15553c0078a0: COLL_FENCE_RING state seq=0
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:762: my peerid: 37:l40547
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:769: neighbor id: next 0:l10357, prev 36:l40546
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:779: Context ptr=0x15553c007918, #0, in-use=0
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:779: Context ptr=0x15553c007950, #1, in-use=0
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:779: Context ptr=0x15553c007988, #2, in-use=1
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:790:          seq=0 contribs: loc=1/prev=22/fwd=23
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:792:          neighbor contribs [38]:
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:825:                  done contrib: l[30377,30600,30604-30605,30611-30612,30618-30620,40406-40409,40513,40539-40546]
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:827:                  wait contrib: l[10357,10362,10365,10374-10376,10394-10395,30354,30356-30357,30360,30370-30372]
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:829:          status=PMIXP_COLL_RING_PROGRESS
4225: slurmstepd: error:  mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:833:          buf (offset/size): 261015/532347

See: /work/ab0246/a270092/runtime/awicm3-v3.1/jane_maybe_some_steps_1/log/jane_maybe_some_steps_1_awicm3_compute_20000101-20000101_1092005.log

This is rather peculiar, as the previous experiment, that was used to create the oasis grids.nc areas.nc and masks.nc files for the offline remapping tool did not crash during mpi init, and nothing has changed in the setup since then. The only difference between the runs, is that I link in the rmp_* files that I made offline.

Indeed I ran the model again this morning, and no such crash reappeared. So far I only got this error once, in case I appears again we have it documented already. The error message gets a handful of hits on google. In particular https://lists.schedmd.com/pipermail/slurm-users/2019-July/003785.html document sporadic errors with heterogeneous parallel jobs (which this one is, albeit using tasksets instead)

Cheers, Jan

Project

General

Profile

Model-Forum for Levante

One time crash (so far) with TCO639-jane crash (pmixp_coll_ring.c:742: collective timeout)