Skip to content

Workaround for reading block decomposition files with OpenMPI v5.x #1318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

gdicker1
Copy link
Collaborator

This PR adds a workaround for problems encountered on systems using OpenMPI v5.x (observed with 5.0.7). The inlist argument to mpas_dmpar_scatter_ints seems to be affected by the MPI_ScatterV call, which then causes run-time fails in mpas_block_decomp_cells_for_proc when re-reading the block decomposition file. Making global_list an allocatable, associating a pointer with it, and using that pointer as the inlist argument to mpas_dmpar_scatter_ints seems to resolve the issue.

NOTE: So this PR can be applied broadly it is based on a very, very old commit. I think around v4.0, at least as old as the v6.0 tag.

…_proc

This is a workaround for problems encountered on systems using OpenMPI
v5.x (observed with 5.0.7).  The inlist argument to
mpas_dmpar_scatter_ints seems to be affected by the MPI_ScatterV call,
which then causes run-time fails in mpas_block_decomp_cells_for_proc
when re-reading the block decomposition file. Making global_list an
allocatable, associating a pointer with it, and passing that pointer to
mpas_dmpar_scatter_ints seems to resolve the issue.
@gdicker1
Copy link
Collaborator Author

The problems in this PR were noted first when a collaborator on the EarthWorks project who was working on TACC's Vista system ran into issues getting MPAS-A to work as a step towards running EarthWorks (CESM). Around August of 2024.

Another EarthWorks user posted about this in EarthWorksOrg/EarthWorks#109 and proposed this fix. When they were running on the Narval system in Canada, runs would die with a "FIO-F-231/list-directed read/unit=1/error on data conversion." message which pointed back to mpas_block_decomp.F near line 181.

On Derecho, I ran the develop branch (from the merge of PR#1298) with nvhpc/25.1 and openmpi/5.0.7 modules. This reported a "double free or corruption (!prev)" which crashed the model. Examining this with gdb seems to point back to mpas_block_decomp.F line 261. Once the fix was applied, I could not recreate this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant