Skip to content

MPICH test 58/72 co_reduce-factorial-int8 fails on Fedora #522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
susilehtola opened this issue Mar 27, 2018 · 11 comments
Closed

MPICH test 58/72 co_reduce-factorial-int8 fails on Fedora #522

susilehtola opened this issue Mar 27, 2018 · 11 comments

Comments

@susilehtola
Copy link

I've packaged OpenCoarrays for Fedora both using OpenMPI and MPICH, see review request at
https://bugzilla.redhat.com/show_bug.cgi?id=1560874

With OpenMPI all tests run succesfully, but with MPICH test 58/72 fails:

58/72 Test #58: co_reduce-factorial-int8 ...............***Failed  Required regular expression not found.Regex=[Test passed.
]  0.02 sec
@zbeekman
Copy link
Collaborator

Hi Susi, thanks for the report!

Can you please re-run the tests using either

  1. make check
  2. ctest --output-on-failure
  3. If CMake >= 3.x set CTEST_OUTPUT_ON_FAILURE=TRUE in your environment and running the tests again however you first ran them

and then report the results here?

This will show us the test output and help us track down the source of the problem. Also, which version of MPICH are you using? There is an MPICH bug that has been fixed but has not made it into a release or pre-release, AFAIK. If you disable failed image support by passing -DCAF_ENABLE_FAILED_IMAGES=FALSE to cmake during configure it will turn off failed image support, and will no longer trigger the MPICH bug. I don't know whether or not this is the source of your issue, but it may be worth trying.

Some additional details (suggested on the new issue default form) would be helpful too, such as # of physical cores, version of GFortran, GCC & MPICH and any links to build logs.

Thanks!

@zbeekman
Copy link
Collaborator

For anyone interested in testing, the RPM looks like it is available here: https://jussilehtola.fedorapeople.org/OpenCoarrays-2.0.0-1.fc27.src.rpm

@zbeekman
Copy link
Collaborator

Another idea as to the cause of this issue: #324 perhaps? I doubt it, but I haven't followed all the recent changes to OpenCoarrays closely enough to know if this could impact MPICH.

@vehre or @neok-m4700 if either of you have ideas as to the cause, feel free to chime in.

@zbeekman zbeekman added the bug label Mar 27, 2018
@t-bltg
Copy link
Contributor

t-bltg commented Mar 27, 2018

Hi @susilehtola ,

Yes I've also hit this bug. With my configuration (mpich 3.2.1, gcc 7.3.0, ...) I've observed sporadic failures (around 1 out of 6 builds fails, only because of the co_reduce-factorial-int8 test), from what I recall I've observed this behavior since release 1.9.3 and possibly in earlier releases (though not tested).

I've dug into the code but could not find any starting point for this bug, it might be happening at a lower level (mpich implementation). Deserves a bounty !

@susilehtola
Copy link
Author

58/72 Test #58: co_reduce-factorial-int8 ...............***Failed  Required regular expression not found.Regex=[Test passed.
]  0.02 sec
Number of images = 4
value [ 1 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
value [ 2 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
value [ 3 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
value [ 4 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
Product  value = 0
Expected value = num_images()!
 2! = 2, 3! = 6, 4! = 24, ...
Answer should have been num_images()! = 24
ERROR STOP Wrong answer for n! using co_reduce
Error: Command:
   `/usr/lib64/mpich/bin/mpiexec -n 4 --disable-auto-cleanup /home/jzlehtol/rpmbuild/BUILD/OpenCoarrays-2.0.0/mpich/bin/OpenCoarrays-2.0.0-tests/co_reduce-factorial-int8`
failed to run.

After adding -DCAF_ENABLE_FAILED_IMAGES=FALSE the tests run succesfully.

Some additional details (suggested on the new issue default form) would be helpful too, such as # of physical cores, version of GFortran, GCC & MPICH and any links to build logs.

Running on an 8-core machine with

$ rpm -q gcc-gfortran mpich
gcc-gfortran-7.3.1-5.fc27.x86_64
mpich-3.2.1-2.fc27.x86_64

@zbeekman zbeekman changed the title MPICH test 58/72 fails on Fedora MPICH test 58/72 co_reduce-factorial-int8 fails on Fedora Apr 27, 2018
@zbeekman
Copy link
Collaborator

I have confirmed that this bug happens intermittently. Here is some more detailed debug output from a recent Travis-CI job. To trigger it, it helps to run the tests multiple times in a row:

Relavent output from https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/372200706#L1603:

36/58 Test #43: co_reduce-factorial-int8 ..........................................................***Failed  Required regular expression not found.Regex=[Test passed.
]  0.07 sec
1/4: Entering sync all.
3/4: Entering sync all.
4/4: Entering sync all.
2/4: Entering sync all.
1/4: _gfortran_caf_sync_all: MPI_Barrier = 0.
1/4: Leaving sync all.
3/4: _gfortran_caf_sync_all: MPI_Barrier = 0.
3/4: Leaving sync all.
2/4: _gfortran_caf_sync_all: MPI_Barrier = 0.
2/4: Leaving sync all.
4/4: _gfortran_caf_sync_all: MPI_Barrier = 0.
4/4: Leaving sync all.
2/4: finalize_internal(status_code = 0)
2/4: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
4/4: finalize_internal(status_code = 0)
3/4: finalize_internal(status_code = 0)
4/4: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
3/4: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
Number of images = 4
1/4: _gfortran_caf_get() src_vector = 0x0, image_index = 1, offset = 0.
1/4: _gfortran_caf_get() in caf_this == image_index, size = 1, dst_kind = 1, src_kind = 1
value [ 1 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
1/4: _gfortran_caf_get() src_vector = 0x0, image_index = 2, offset = 0.
value [ 2 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
value [ 3 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
value [ 4 ] is 0
since RESULT_IMAGE is present, value on other images is undefined by the standard
Product  value = 0
Expected value = num_images()!
1/4: _gfortran_caf_get() src_vector = 0x0, image_index = 3, offset = 0.
1/4: _gfortran_caf_get() src_vector = 0x0, image_index = 4, offset = 0.
 2! = 2, 3! = 6, 4! = 24, ...
Answer should have been num_images()! = 24
ERROR STOP Wrong answer for n! using co_reduce
1/4: terminate_internal (stat_code = 6000, exit_code = 1).
1/4: finalize_internal(status_code = 6000)
3/4: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 76623973
3/4: finalize(): Freeed all slave tokens.
2/4: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 76623973
2/4: finalize(): Freeed all slave tokens.
4/4: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 75575397
4/4: finalize(): Freeed all slave tokens.
2/4: finalize_internal: before Win_unlock_all.
2/4: finalize_internal: before Win_free(stat_tok)
2/4: finalize_internal: before Comm_free(CAF_COMM_WORLD)
2/4: finalize_internal: after Comm_free(CAF_COMM_WORLD)
2/4: finalize_internal: Finalisation done!!!
3/4: finalize_internal: before Win_unlock_all.
3/4: finalize_internal: before Win_free(stat_tok)
3/4: finalize_internal: before Comm_free(CAF_COMM_WORLD)
3/4: finalize_internal: after Comm_free(CAF_COMM_WORLD)
4/4: finalize_internal: before Win_unlock_all.
4/4: finalize_internal: before Win_free(stat_tok)
4/4: finalize_internal: before Comm_free(CAF_COMM_WORLD)
4/4: finalize_internal: after Comm_free(CAF_COMM_WORLD)
3/4: finalize_internal: Finalisation done!!!
4/4: finalize_internal: Finalisation done!!!
Error: Command:
   `/usr/local/bin/mpiexec -n 4 --disable-auto-cleanup /Users/travis/build/sourceryinstitute/OpenCoarrays/cmake-build/bin/OpenCoarrays-2.0.0-28-g9765332-tests/co_reduce-factorial-int8`
failed to run.

@zbeekman
Copy link
Collaborator

I wonder if @vehre's MPICH patch might help/fix this. This is a reminder to myself to see if it needs to be backported to MPICH release branches.

@zbeekman zbeekman mentioned this issue Apr 27, 2018
4 tasks
@vehre
Copy link
Collaborator

vehre commented May 1, 2018

Nope, the MPICH patch is not addressing this issue. The MPICH patch is only addressing an issue in the parts of MPICH that are needed to support failed images.

This test is presumable failing, because the datatype (4-byte int) used in the reduce is too large for the target databyte (1-byte int).

@zbeekman
Copy link
Collaborator

zbeekman commented May 1, 2018

Hmmm I see. So

_gfortran_caf_get() in caf_this == image_index, size = 1, dst_kind = 1, src_kind = 1

is trying to send and receive 1 byte ints without converting, or converting incorrectly.

@vehre
Copy link
Collaborator

vehre commented May 1, 2018

No, that one is of course working as expected.

@zbeekman
Copy link
Collaborator

zbeekman commented May 1, 2018

OK, I guess there is no added debug output highlighting where the type conversion error is happening. Thanks for responding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants