-
-
Notifications
You must be signed in to change notification settings - Fork 55
Pull in failed images support. #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if (me == 2) print *, "Test failed." | ||
|
||
sync all (STAT=syncAllStat) | ||
if (me == 1) print *, "Test passed." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check syncAllStat==STAT_FAILED_IMAGE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, I see you have done a very nice and thorough job testing all aspects (that are currently obvious to me) of failed images! Nice work! 🎉 💯 🥇 🙏 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I like to keep unit tests as small as necessary to test a single feature. The check whether sync all returns the correct stat-value is done in a separate test.
@zbeekman Any chance this can be merged this week? I start teaching the short course at the University of Cyprus on Wednesday and would like to give the attendees an outline of what I'll cover. I hope to introduce failed images next week. |
Yes I'm close. Adding required build system stuff. Hope it will be done by
the call.
…On Mon, May 22, 2017 at 11:20 PM Damian Rouson ***@***.***> wrote:
@zbeekman <https://github.com/zbeekman> Any chance this can be merged
this week? I start teaching the short course at the University of Cyprus on
Wednesday and would like to give the attendees an outline of what I'll
cover. I hope to introduce failed images next week.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#370 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAREPAahlaNzb-d21QYawiV2vb1BFKJJks5r8lBugaJpZM4NMi7C>
.
|
- Ensure that tests requiring GCC 7 are only built when it is available - Add a default option (set to ON) to enable failed images support when the required experimental/proposed fault tolerant MPI features are detected.
This will create better grouping in `ccmake` and `cmake-gui`
- Test for existence, assume symbols/functionality provided elsewhere if it is missing. Should induce compile time error if this is not true.
…rrays into vehre/failed-images
Codecov Report
@@ Coverage Diff @@
## master #370 +/- ##
==========================================
+ Coverage 39.3% 43.38% +4.07%
==========================================
Files 3 3
Lines 1707 1950 +243
Branches 294 320 +26
==========================================
+ Hits 671 846 +175
- Misses 937 993 +56
- Partials 99 111 +12 |
May fix #388 (issue with Clang 4 on FreeBSD)
@vehre It looks like there may be a race condition or another parallelism issue: If you run with, e.g., This test is similar to the tests bearing the same name, but with the 1 and 2 suffixes. The biggest difference is that this test calls I've created a branch with a spurious |
@vehre: Adding the extra synchronization in 3e11eac causes This leads me to suspect that there is a race condition in |
@verhe: Additional update: It seems that some other the other failed images tests can fail sometimes and that my attempted work around exhibits this... I'm not really sure what's going on. Is it possible that MPI could have a bug/issue where the |
- Intermittent failure on Travis-CI when run repeatedly `ctest --output-on-failure --repeat-until-fail 10` - Bug report: #390
@zbeekman I don't have a single issue with running the testcases even multiple times. My mpi-version is mpich-3.2-6 Fedora system supplied. As for the failing travis-run for gcc7, I suppose that it is simply timing out after 10 minutes total runtime. |
…nds. That the number of images failed is reported incorrectly, may happen when in the finalization stage. To prevent this, the mpi-error handler now checks for the correct number and on incorrect ones just exists ok. This commit should Fix #390.
…encoarrays into vehre/failed-images
I only see segfaults on massive oversubscription like 64 processes on a 4-core machine. I have added a commit that addresses these. |
Ok, the time out is because the one test hangs for 10+ minutes, not because
the entire run takes over ten minutes. But I suspect this might be a
virtualization/cloud/MPI issue so until we get reports of others
encountering it or find a way to reliably reproduce I'm comfortable closing
the issue & merging
…On Thu, May 25, 2017 at 7:54 AM Dr. Andre Vehreschild < ***@***.***> wrote:
I only see segfaults on massive oversubscription like 64 processes on a
4-core machine. I have added a commit that addresses these.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#370 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAREPMu3_Vf5S5cl1pBtciMaZix4sd9Oks5r9WwEgaJpZM4NMi7C>
.
|
Summary of changes
Added failed images support.
Fixed #309.
Fixed #354.
Rationale for changes
It wasn't present yet?!
Additional info and certifications
This pull request (PR) is a:
I certify that:
WIP:
to thebeginning of the PR title
DO NOT MERGE:
to the beginning of the titleissue-<#>
where<#>
is replaced by the issue that this PRis addressing
touches
touches
changes
ctest
) and all tests passset of changes
Come on! Get a grip on all these options. My time is limited!
ATTENTION !!! Build-system work is needed !!!
For this patch to be most effective these requirements need to be added to the build system:
I propose to add a general flag to the build system, to allow the user to switch on failed_images support on demand. Additionally to this flag the following lines are needed in the build system:
if (compiler is 'gcc' and gcc-version >= 7 and mpi is 'mpich' and mpich-version >= 3.2) then
add_definition ('-DUSE_FAILED_IMAGES') to compile-statement of caf_mpi.c
// There is currently no sufficient support for openmpi or other mpi-implementations for fault tolerant mpi. When they add support other may be added.
In the top-most CMakeList, the #ifdef WITH_FAIL_IMAGES needs to be replaced by something reasonable to switch on these tests only when failed images support is present.
I have also added another preprocessor flag 'STOP_ON_UNSUPPORTED' independent of the above to allow stopping execution when an unsupported or unimplemented feature is hit. (see caf_mpi.c at the end).
Tested with gcc-trunk (aka 8.0), gcc-6.3.1 (the failed images test are not compilable, because 6.3 does know nothing about the new functions) and mpich v3.2 on Fedora 25 x86_64.